(1) search engine, the full text database
(2) Search the web by subjects
(1) Statistics problem
(2) main tools we do the search today
First Class: Data
(1) Distributed Data
(2) High percentage of volatile of data( it means there are a lot of adding and deleting in data links of data)
(3) Large Volume
(4) redundant data: highly unstructured for the web
(5) The quality of data would be judged and sometimes it is low quality
Conclusion: The quality of data will not changed
Second Class: Interaction with data problem
（1）how to specify the query
(2) how to interpret the answer by the query
Modeling the web:
The character of the web: The vocab grows faster and the word distribution should be more biased.
the file sizes are in the middle which is similar to normal distribution, but it has a different right tail distribution of different file types.
centralized crawler-indexer architecture
crawler built the indexer, query engine built the index
The problem with the crawler: can not cope with high volume of data
The harvest distributed:
Gathers and brokers:
Gathers: are collecting and indexing different information from one web to another.
Broker: provide the indexing mechanism
One of the goals of Harvest is to build topic-specific brokers, focusing the
index contents and avoiding many of the vocabulary and scaling problems of
The query interface, answer interface
The usual ranking algorithm can not be opened to others.
(1 ) Vector spread, boolean spread
(2) page rank algorithm ( taken into the link and pointers into consideration)
The first assumption: PageRank simulates a user navigating randomly in the Web who jumps to a random page with probability q or follows a random hyperlink (on the current page) with probability 1 - q.
The second assumption: It is further assumed that this user never goes back to a previously visited page following an already traversed hyperlink backwards. This process can be modeled with a Markov chain, from the stationary probability of being in each page can be computed.
Hypertext Induced Topic Search:
Pages that have many links pointing to them in S are called authorities (that is, they should have relevant content). Pages that have many outgoing links are called hubs (they should point to similar content).
Crawling the web:
The simplest is to start with a set of URLs and from there extract other URLs which are followed recursively in a breadth-first or depth-first fashion.