(1) search engine, the full text database
(2) Search the web by subjects
The outline
(1) Statistics problem
(2) main tools we do the search today
Challenges:
First Class: Data
(1) Distributed Data
(2) High percentage of volatile of data( it means there are a lot of adding and deleting in data links of data)
(3) Large Volume
(4) redundant data: highly unstructured for the web
(5) The quality of data would be judged and sometimes it is low quality
Conclusion: The quality of data will not changed
Second Class: Interaction with data problem
(1)how to specify the query
(2) how to interpret the answer by the query
Modeling the web:
The character of the web: The vocab grows faster and the word distribution should be more biased.
the file sizes are in the middle which is similar to normal distribution, but it has a different right tail distribution of different file types.
Search Engines:
centralized crawler-indexer architecture
crawler built the indexer, query engine built the index
The problem with the crawler: can not cope with high volume of data
Distributed Architecture:
The harvest distributed:
Gathers and brokers:
Gathers: are collecting and indexing different information from one web to another.
Broker: provide the indexing mechanism
One of the goals of Harvest is to build topic-specific brokers, focusing the
index contents and avoiding many of the vocabulary and scaling problems of
generic indices.
User Interface:
The query interface, answer interface
Ranking:
The usual ranking algorithm can not be opened to others.
The improvement:
(1 ) Vector spread, boolean spread
(2) page rank algorithm ( taken into the link and pointers into consideration)
The first assumption: PageRank simulates a user navigating randomly in the Web who jumps to a random page with probability q or follows a random hyperlink (on the current page) with probability 1 - q.
The second assumption: It is further assumed that this user never goes back to a previously visited page following an already traversed hyperlink backwards. This process can be modeled with a Markov chain, from the stationary probability of being in each page can be computed.
Hypertext Induced Topic Search:
Pages that have many links pointing to them in S are called authorities (that is, they should have relevant content). Pages that have many outgoing links are called hubs (they should point to similar content).
如果文档有相似的内容(就是一个网页有很多内容指向一个页面), 如果一个网页有很多外联的网页,那就叫做hub.
Crawling the web:
The simplest is to start with a set of URLs and from there extract other URLs which are followed recursively in a breadth-first or depth-first fashion.
没有评论:
发表评论