2014年1月10日星期五

Notes on Information Storage and Retrieval (Chapter 1)

Muddy Points: Can not figure out the Markov Models, and its application in this chapter. 

1.1 Definition of the information Retrieval: 
IR is concerned with representing, searching, and manipulating large collections of electronic text and other human-language data.



History: 
First Library Elba Syria 3000 BC 

Categories: 
Web search:
The machines identifies a set of web pages containing the terms in the query, compute a score for each page, eliminate duplicate and redundant pages, generate summaries of the remaining pages and finally returns summaries and links back to the user for browsing.

Other search application:
Desktop and file system provides other forms of search.  Enterprise system once will have the internet search incorporated in the intranet or have their own system of collecting different files.

Other IR Applications:
Document Routing, filtering, and selective dissemination reverse the typical IR process: 
Application: The email spam.

Text clustering and categorization systems group documents according to shared properties to the system: 
The categorization system stems from the information training the various classes. Which would short unlabeled articles to the same categories.

Summarization systems reduce documents to a few key paragraphs:  
Information extraction systems named entities: 
Topic detection and tracking systems identify events in streams of news articles: 
Questions answering systems: 

1.2 Basic Architecture of Information Retrieval System: 





Measuring the Systems: 



efficiency and effectiveness. 
Efficiency can be measured in terms of time and space, 
Effectiveness is more dependable on human judgement, but they measure it with relevance: the probability of Ranking principle. This is although a good measurement but neglected the fact that the scope of the information retrieved.

1.3 Working with electronic text: 
Text Format: 
HTML(HyperText Markup language)
XML(Extensible Markup Language) 
The previous two languages are derived from SGML

The Muddy points: Could not understand the tokenization of the language. 

After tokenization of the language, we have the big whole sets of the languages and we can also know the frequencies of the languages. 



Term Distribution: 

The supplements from MBAliib: 
齊普夫定律是美國語言學家G.K.齊普夫George Kingsley Zipf)於本世紀40年代提出的詞頻分佈定律。它可以表述為:如果把一篇較長文章中每個詞出現的頻次統計起來,按照高頻詞在前、低頻詞在後的遞減順序排列,並用自然數個這些詞編上的等級序號,即頻次最高的詞等級為1,頻次次之的等級為2,......,頻次最小的詞等級為D,。若用f表示頻次,r 表示序號,則有fr=C(C為常數)。人們稱該式為齊普夫定律。
  齊普夫定律是描述一系列實際現象的特點非常到位的經驗定律之一。它認為,如果我們按照大小或者流行程度給某個大集合中的各項進行排序,集合中第二項的比重大約是第一項的一半,而第三項的比重大約是第一項的三分之一,以此類推。換句話來說,一般來講,排在第k位的項目其比重為第一項的1/k。
  齊普夫定律還從定量角度描述了目前流行的一個主題: 長尾巴定律The Long Tail)。以一個集合中按流行程度排名的物品(如亞馬遜網站上銷售的圖書)為例。表示流行程度的圖表會向下傾斜,位於左上角的是幾十本最流行的圖書。該圖會向右下角逐漸下降,那條長尾巴會列出每年銷量只有一兩本的幾十萬種圖書。換成英文即齊普夫定律最初應用的領域,這條長尾巴就是你很少會遇到的幾十萬個單詞,譬如floriferous或者refulgent。
  把流行程度作為大致衡量價值的標準,齊普夫定律隨後就會得出每一個物品的價值。也就是說,假設有100萬個物品,那麼最流行的100個物品將貢獻總價值的三分之一,其次的10000個物品將貢獻另外的三分之一; 剩餘的98.99萬個將貢獻剩下的三分之一。有n個物品的集合其價值與log(n)成正比。

Language Modeling: 
Predictions concerning the content of the unseen text made by way of a special kind of probability distribution known as a language model. This simple probability modelization is 1. 

we can use this simple model as to predict the unseen play probabilities. This technic is called maximum likelihood estimate.  


Higher order models: 

we have the conditional probabilities 


Markov Models: 
Markov Models are finite-state automata augmented with transition probabilities. 
马氏链描述一种随机过程,从一时期向下一时期的状态按一定的概率转移。下时期的状态取决于本期的状态,与过去无关。

存在两种链的方式:正则链:从任一状态出发有限次数转移能以正概率达到另一个状态
吸收链:从任一非吸收状态转移到达吸收状态。

TREC Tasks: 
Open Source IR Systems: 
LUCENE 
INDURI
WUMPUS 


---------------------------------------------------------------------------------------------------------------
FOA  process of readers: 
1. Asking a question 
2. Constructing an answer 
3. Assessing the answer 

Working with the IR tradition: 
Keywords are linguistic atoms- pieces of words or phrases  used to characterize the subject of a document. 


Elements of query language: 
query language 
natural language 
domain of discourse 
Computer Science is a border term of AI. 
The vocabulary size- the total number of keywords 



没有评论:

发表评论