The Newbie on information Science: MIR Chapter 13

Three different forms of searching the web:
(1) search engine, the full text database
(2) Search the web by subjects

The outline
(1) Statistics problem
(2) main tools we do the search today

Challenges:
First Class: Data
(1) Distributed Data
(2) High percentage of volatile of data( it means there are a lot of adding and deleting in data links of data)
(3) Large Volume
(4) redundant data: highly unstructured for the web
(5) The quality of data would be judged and sometimes it is low quality

Conclusion: The quality of data will not changed

Second Class: Interaction with data problem
（1）how to specify the query
(2) how to interpret the answer by the query

Modeling the web:
The character of the web: The vocab grows faster and the word distribution should be more biased.

the file sizes are in the middle which is similar to normal distribution, but it has a different right tail distribution of different file types.

Search Engines:

centralized crawler-indexer architecture

crawler built the indexer, query engine built the index

The problem with the crawler: can not cope with high volume of data

Distributed Architecture:

The harvest distributed:

Gathers and brokers:

Gathers: are collecting and indexing different information from one web to another.

Broker: provide the indexing mechanism

One of the goals of Harvest is to build topic-specific brokers, focusing the

index contents and avoiding many of the vocabulary and scaling problems of

generic indices.

User Interface:

The query interface, answer interface

Ranking:

The usual ranking algorithm can not be opened to others.

The improvement:

(1 ) Vector spread, boolean spread

(2) page rank algorithm ( taken into the link and pointers into consideration)

The first assumption: PageRank simulates a user navigating randomly in the Web who jumps to a random page with probability q or follows a random hyperlink (on the current page) with probability 1 - q.

The second assumption: It is further assumed that this user never goes back to a previously visited page following an already traversed hyperlink backwards. This process can be modeled with a Markov chain, from the stationary probability of being in each page can be computed.

Hypertext Induced Topic Search:

Pages that have many links pointing to them in S are called authorities (that is, they should have relevant content). Pages that have many outgoing links are called hubs (they should point to similar content).

如果文档有相似的内容（就是一个网页有很多内容指向一个页面），如果一个网页有很多外联的网页，那就叫做hub.

Crawling the web:

The simplest is to start with a set of URLs and from there extract other URLs which are followed recursively in a breadth-first or depth-first fashion.

The Newbie on information Science

2014年3月17日星期一

MIR Chapter 13

没有评论:

发表评论