2014年3月17日星期一

MIR Chapter 13

Three different forms of searching the web:
(1) search engine, the full text database
(2) Search the web by subjects

The outline
(1) Statistics problem
(2) main tools we do the search today

Challenges:
First Class: Data
(1) Distributed Data
(2) High percentage of volatile of data( it means there are a lot of adding and deleting in data links of data)
(3) Large Volume
(4) redundant data: highly unstructured for the web
(5) The quality of data would be judged and sometimes it is low quality

Conclusion: The quality of data will not changed

Second Class: Interaction with data problem
(1)how to specify the query
(2) how to interpret the answer by the query

Modeling the web:
The character of the web: The vocab grows faster and the word distribution should be more biased.


the file sizes are in the middle which is similar to normal distribution, but it has a different right tail distribution of different file types. 

Search Engines: 
centralized crawler-indexer architecture


 crawler built the indexer, query engine built the index 

The problem with the crawler: can not cope with high volume of data 

Distributed Architecture: 
The harvest distributed:
Gathers and brokers: 
Gathers: are collecting and indexing different information from one web to another. 
Broker: provide the indexing mechanism 

One of the goals of Harvest is to build topic-specific brokers, focusing the
index contents and avoiding many of the vocabulary and scaling problems of
generic indices.

User Interface: 
The query interface, answer interface 

Ranking: 
The usual ranking algorithm can not be opened to others. 
The improvement: 
(1 ) Vector spread, boolean spread 
(2) page rank algorithm ( taken into the link and pointers into consideration) 

The first assumption: PageRank simulates a user navigating randomly in the Web who jumps to a random page with probability q or follows a random hyperlink (on the current page) with probability 1 - q. 

The second assumption: It is further assumed that this user never goes back to a previously visited page following an already traversed hyperlink backwards. This process can be modeled with a Markov chain, from the stationary probability of being in each page can be computed. 

Hypertext Induced Topic Search:
Pages that have many links pointing to them in S are called authorities (that is, they should have relevant content). Pages that have many outgoing links are called hubs (they should point to similar content).

如果文档有相似的内容(就是一个网页有很多内容指向一个页面), 如果一个网页有很多外联的网页,那就叫做hub. 


Crawling the web: 
The simplest is to start with a set of URLs and from there extract other URLs which are followed recursively in a breadth-first or depth-first fashion.






没有评论:

发表评论