The Newbie on information Science: MIR Section 7.4 and Chapter 8 Notes

2014年1月24日星期五

MIR Section 7.4 and Chapter 8 Notes

Muddy Points:

How to evaluate different algorithms in searching the text
The most common use of the algorithms in text searching

(1) Searching Method:

Search Sequentially
Build Data structure to store the data

(2) Main Indexing Technique:

inverted files
suffix arrays
signature files (popular in 1980s)

(3) Regular data structure using in information Retrieval

sorted arrays
binary search trees
B-trees
Hash table
Tries

Inverted Files:

Two elements: Vocabulary and occurrence
Techniques: Block Addressing

Searching:

Vocabulary Search
Retrieval of Occurrence
Manipulation of Occurrence
Context inquires are complicated with inverted indices;

Construction:

Building and maintaining an inverted index is relatively a low cost task
All the data will be keep in a trie tree data structure
Merging two indices include merging the sorted vocab

Other Indices for text:

Suffix Trees and Suffix Arrays:

Suffix Arrays are a space efficient implementation of suffix trees
Not all the text position needs to be indexed

Structure:

Suffix Trees are built over all the suffix of the data
Patricia Tree

Construction in Main Memory:

concentrate on direct suffix construction

Construction on Suffix Arrays for Large Text:

Signature Files:

Signature Files are word-oriented files based on hashing

Use hash functions to map words

Sequential Searching:

Search on the text that has no data structure to build on

Brute Force Method:

Trying all possible text patterns
Examine the Algor with the worst case

KMP Method:

First linear with worst case behavior

Boyer-Moore Family:

Shift-Or

Suffix Automation:

Piratical Comparison:

Pattern Matching:

Regular Expression:

没有评论:

发表评论

订阅：博文评论 (Atom)