Muddy Points: Can not figure out the Markov Models, and its application in this chapter.
1.1 Definition of the information Retrieval:
IR is concerned with representing, searching, and manipulating large collections of electronic text and other human-language data.
First Library Elba Syria 3000 BC
Web search:
The machines identifies a set of web pages containing the terms in the query, compute a score for each page, eliminate duplicate and redundant pages, generate summaries of the remaining pages and finally returns summaries and links back to the user for browsing.
Other search application:
Desktop and file system provides other forms of search. Enterprise system once will have the internet search incorporated in the intranet or have their own system of collecting different files.
Other IR Applications:
Document Routing, filtering, and selective dissemination reverse the typical IR process:
Application: The email spam.
Text clustering and categorization systems group documents according to shared properties to the system:
The categorization system stems from the information training the various classes. Which would short unlabeled articles to the same categories.
Summarization systems reduce documents to a few key paragraphs:
Information extraction systems named entities:
Topic detection and tracking systems identify events in streams of news articles:
Questions answering systems:
1.2 Basic Architecture of Information Retrieval System:
Measuring the Systems:
efficiency and effectiveness.
Efficiency can be measured in terms of time and space,
Effectiveness is more dependable on human judgement, but they measure it with relevance: the probability of Ranking principle. This is although a good measurement but neglected the fact that the scope of the information retrieved.
1.3 Working with electronic text:
Text Format:
HTML(HyperText Markup language)
XML(Extensible Markup Language)
The previous two languages are derived from SGML
The Muddy points: Could not understand the tokenization of the language.
After tokenization of the language, we have the big whole sets of the languages and we can also know the frequencies of the languages.
Term Distribution:
The supplements from MBAliib:
George Kingsley Zipf)於本世紀40年代提出的詞頻分佈定律。它可以表述為:如果把一篇較長文章中每個詞出現的頻次
頻次最高的詞等級為1,頻次次之的等級為2,......,頻次最小的詞等級為D,。若用f表示頻次,r 表示序號,則有fr=C(C為常數)。人們稱該式為齊普夫定律。
The Long Tail)。以一個集合中按流行程度排名的物品(如
價值。也就是說,假設有100萬個物品,那麼最流行的100個物品將貢獻總價值的三分之一,其次的10000個物品將貢獻另外的三分之一; 剩餘的98.99萬個將貢獻剩下的三分之一。有n個物品的集合其價值與log(n)成正比。
Language Modeling:
Predictions concerning the content of the unseen text made by way of a special kind of probability distribution known as a language model. This simple probability modelization is 1.
we can use this simple model as to predict the unseen play probabilities. This technic is called maximum likelihood estimate.
Higher order models:
we have the conditional probabilities
Markov Models:
Markov Models are finite-state automata augmented with transition probabilities.
TREC Tasks:
Open Source IR Systems:
FOA process of readers:
1. Asking a question
2. Constructing an answer
3. Assessing the answer
Working with the IR tradition:
Keywords are linguistic atoms- pieces of words or phrases used to characterize the subject of a document.
Elements of query language:
query language
natural language
domain of discourse
Computer Science is a border term of AI.
The vocabulary size- the total number of keywords