MIR Section 7.4 and Chapter 8 Notes

Muddy Points:

  • How to evaluate different algorithms in searching the text 
  • The most common use of the algorithms in text searching 

(1) Searching Method:

  •      Search Sequentially 
  •      Build Data structure to store the data
(2) Main Indexing Technique:
  •     inverted files 
  •     suffix arrays 
  •    signature files (popular in 1980s) 
(3) Regular data structure using in information Retrieval
  •    sorted arrays 
  •    binary search trees 
  •    B-trees 
  •    Hash table
  •    Tries

Inverted Files: 
  • Two elements: Vocabulary and occurrence 
  • Techniques: Block Addressing 
  •     Vocabulary Search 
  •    Retrieval of Occurrence 
  •    Manipulation of Occurrence 
  •   Context inquires are complicated with inverted indices; 
  •     Building and maintaining an inverted index is relatively a low cost task 
  •    All the data will be keep in a trie tree data structure 
  •   Merging two indices include merging the sorted vocab   
Other Indices for text: 
Suffix Trees and Suffix Arrays: 
  •     Suffix Arrays are a space efficient implementation of suffix trees 
  •    Not all the text position needs to be indexed     
  •     Suffix Trees are built over all the suffix of the data 
  •    Patricia Tree 

Construction in Main Memory: 
  • concentrate on direct suffix construction 

Construction on Suffix Arrays for Large Text: 

Signature Files: 
Signature Files are word-oriented files based on hashing 
Use hash functions to map words 

Sequential Searching: 
Search on the text that has no data structure to build on 

Brute Force Method: 
  • Trying all possible text patterns 
  • Examine the Algor with the worst case 
KMP Method: 
  •      First linear with worst case behavior 

Boyer-Moore Family: 
Suffix Automation: 
Piratical Comparison: 
Pattern Matching: 

Regular Expression: 


Notes on Information Storage and Retrieval (Chapter 1)

Muddy Points: Can not figure out the Markov Models, and its application in this chapter. 

1.1 Definition of the information Retrieval: 
IR is concerned with representing, searching, and manipulating large collections of electronic text and other human-language data.

First Library Elba Syria 3000 BC 

Web search:
The machines identifies a set of web pages containing the terms in the query, compute a score for each page, eliminate duplicate and redundant pages, generate summaries of the remaining pages and finally returns summaries and links back to the user for browsing.

Other search application:
Desktop and file system provides other forms of search.  Enterprise system once will have the internet search incorporated in the intranet or have their own system of collecting different files.

Other IR Applications:
Document Routing, filtering, and selective dissemination reverse the typical IR process: 
Application: The email spam.

Text clustering and categorization systems group documents according to shared properties to the system: 
The categorization system stems from the information training the various classes. Which would short unlabeled articles to the same categories.

Summarization systems reduce documents to a few key paragraphs:  
Information extraction systems named entities: 
Topic detection and tracking systems identify events in streams of news articles: 
Questions answering systems: 

1.2 Basic Architecture of Information Retrieval System: 

Measuring the Systems: 

efficiency and effectiveness. 
Efficiency can be measured in terms of time and space, 
Effectiveness is more dependable on human judgement, but they measure it with relevance: the probability of Ranking principle. This is although a good measurement but neglected the fact that the scope of the information retrieved.

1.3 Working with electronic text: 
Text Format: 
HTML(HyperText Markup language)
XML(Extensible Markup Language) 
The previous two languages are derived from SGML

The Muddy points: Could not understand the tokenization of the language. 

After tokenization of the language, we have the big whole sets of the languages and we can also know the frequencies of the languages. 

Term Distribution: 

The supplements from MBAliib: 
齊普夫定律是美國語言學家G.K.齊普夫George Kingsley Zipf)於本世紀40年代提出的詞頻分佈定律。它可以表述為:如果把一篇較長文章中每個詞出現的頻次統計起來,按照高頻詞在前、低頻詞在後的遞減順序排列,並用自然數個這些詞編上的等級序號,即頻次最高的詞等級為1,頻次次之的等級為2,......,頻次最小的詞等級為D,。若用f表示頻次,r 表示序號,則有fr=C(C為常數)。人們稱該式為齊普夫定律。
  齊普夫定律還從定量角度描述了目前流行的一個主題: 長尾巴定律The Long Tail)。以一個集合中按流行程度排名的物品(如亞馬遜網站上銷售的圖書)為例。表示流行程度的圖表會向下傾斜,位於左上角的是幾十本最流行的圖書。該圖會向右下角逐漸下降,那條長尾巴會列出每年銷量只有一兩本的幾十萬種圖書。換成英文即齊普夫定律最初應用的領域,這條長尾巴就是你很少會遇到的幾十萬個單詞,譬如floriferous或者refulgent。
  把流行程度作為大致衡量價值的標準,齊普夫定律隨後就會得出每一個物品的價值。也就是說,假設有100萬個物品,那麼最流行的100個物品將貢獻總價值的三分之一,其次的10000個物品將貢獻另外的三分之一; 剩餘的98.99萬個將貢獻剩下的三分之一。有n個物品的集合其價值與log(n)成正比。

Language Modeling: 
Predictions concerning the content of the unseen text made by way of a special kind of probability distribution known as a language model. This simple probability modelization is 1. 

we can use this simple model as to predict the unseen play probabilities. This technic is called maximum likelihood estimate.  

Higher order models: 

we have the conditional probabilities 

Markov Models: 
Markov Models are finite-state automata augmented with transition probabilities. 


TREC Tasks: 
Open Source IR Systems: 

FOA  process of readers: 
1. Asking a question 
2. Constructing an answer 
3. Assessing the answer 

Working with the IR tradition: 
Keywords are linguistic atoms- pieces of words or phrases  used to characterize the subject of a document. 

Elements of query language: 
query language 
natural language 
domain of discourse 
Computer Science is a border term of AI. 
The vocabulary size- the total number of keywords 


Algorithm Implementation Notes

Chapter one:
Review Java:
Static method:
Static methods are called functions in many programming languages, each static method is a sequence of statements that are being executed.

public static double sqrt (double c)

Properties of methods:

1.Arguments can pass by value
2. Method name can be overloaded: the same name with different arguments
3.A method can have a single return value but may have multiple return statements.
4.May have a side effect: such as a void main effort

The method can call itself.

Application programming interfaces:
That explains of library methods that are intended to use by others, mostly speaking, it is the static method.