2014年3月27日星期四

Cross Language Information Retrieval

1. Major Approaches and Challenges in CLIR
The mistakes of the translation
ambiguities exists in translation

Three major challenges in  CLIR 
1.what to translate
2. How to obtain translation knowledge
3.how to apply the translation knowledge

Identify the translation unit: 
Three types of resources:
(1) bilingual dictionaries
(2) corpora focus
(3) word translation,phrase translation

Tokenization:
The process of recognizing words, Chinese characters will be more difficult


Stemming:
Three types of corpora, part of speech, tagging, syntactic parsing
To use bi-gram method and parse the words and use the decreasing order to rank and the highest point is recognized as the phrases.

Stop-words:


Obtaining Translation Knowledge: 
Acquire the translation knowledge:
And extract the translation knowledge from translation resources

Obtain the bilingual resources dictionaries and corpora:

Extracting the translation knowledge:

Dealing with the out-of-corpors term: 
Transliteration: Orthographic Mapping(Share the same alphabet order)  and Phonetic Mapping.
Backoff Translation: (1) Match the surface form of the input terms (2) Match the stem form of the input terms (3)Using web mining to help to identify the words
pre_translation query expansion

Using Translation Knowledge: 
Translation Disambiguation:

Weighting Translation Alternative: 
IF_DF weighting

Cranfield Method Evaluation of the interactive system:






2014年3月26日星期三

Parallel Information Retrieval

1.Parallel Query Processing 
2.parallel execution of off-line tasks 


Parallel Query Processing: 
index portioning;
replication;

Replication:
Treat the servers as nodes and create the replica of nodes, which will cause the n-fold increase in the service's engine rate. This is called the inter-query parallelism.

we could split the index into n parts and have each node work only on its own small part of the index. This approach is referred to as intra-query parallelism. 

Two gauging indicators have  been provided: engine service rate, average time per query; 

Document partitioning: 
Holds the subset of the documents;
The main advantage of the document-partitioned approach is its simplicity. Because all index
servers operate independently of each other, no additional complexity needs to be introduced

into the low-level query processing routines

How to divide the collection in the subsets of nodes: 
By storing similar documents in the same node


Term Partitioning: 
Holds the terms of the collections; 
Term partitioning addresses the disk seek problem by splitting the collection into sets of terms

instead of sets of documents.

The scalability of the query: 
As the collection becomes bigger, so do the individual postings lists. For a corpus composed of a billion documents, the postings list for a frequent term, such as “computer” or “internet”, can easily occupy several hundred megabytes.

Load Imbalance. 
Term partitioning suffers from an uneven load across the index nodes.

Term-at-a-Time.
Perhaps the most severe limitation of the term-partitioned approach is its inability to support efficient document-at-a-time query processing

Hybrid Schemes: 
Xi et al. (2002) propose a hybrid architecture in which the collection is divided into p subcollections according to a standard document partitioning scheme. The index for each subcollection is then term-partitioned across n/p nodes, so that each node in the system is responsible for all occurrences of a set of terms within one of the p subcollections.

Redundancy and Fault Tolerance:
Replication. We can maintain multiple copies of the same index node, as described in

the previous section on hybrid partitioning schemes.

Partial Replication. Instead of replicating the entire index r times, we may choose
to replicate index information only for important documents

Dormant Replication. Suppose the search engine comprises a total of n nodes. We can
divide the index found on each node vi into n − 1 fragments and distribute them evenly
among the n−1 remaining nodes, but leave them dormant (on disk) and not use them for
query processing.

Map-Reduce Framework: 
Large Scale Data Processing:  building and updating the index; identifying duplicate documents in the corpus; and analyzing the link structure of the document collection.

当前的软件实现是指定一个Map(映射)函数,用来把一组键值对映射成一组新的键值对,指定并发的Reduce(化简)函数,用来保证所有映射的键值对中的每一个共享相同的键组。

简单来说,一个映射函数就是对一些独立元素组成的概念上的列表(例如,一个测试成绩的列表)的每一个元素进行指定的操作(比如,有人发现所有学生的成绩都被高估了一分,他可以定义一个“减一”的映射函数,用来修正这个错误。)

In the map phase, key/value pairs are read from the input and the map function is
applied to each of them individually. The function is of the general form
map : (k, v) 7→ h (k1, v1), (k2, v2), . . . i. (14.9)

That is, for each key/value pair, map outputs a sequence of key/value pairs. This
sequence may or may not be empty, and the output keys may or may not be identical to
the input key (they usually aren’t).

• In the shuffle phase, the pairs produced during the map phase are sorted by their key,
and all values for the same key are grouped together.
• In the reduce phase, the reduce function is applied to each key and its values. The function
is of the form
reduce : (k, hv1, v2, . . .i) 7→ (k, hv′ 1, v′ 2, . . .i). (14.10)

Suppose we have a total of n = m + r machines, where m is the number of map workers and r is the number of reduce workers. The input of the MapReduce is broken into small pieces called map shards. Each shard typically holds between 16 and 64 MB of data. The shards are treated independently, and each shard is assigned to one of the m map workers. In a large MapReduce, it is common to have dozens or hundreds of map shards assigned to each map worker. A worker usually works on only 1 shard at a time, so all its shards have to be processed sequentially. However, if a worker has more than 1 CPU, it may improve performance to have it work on multiple shards in parallel.

map()函数以key/value对作为输入,产生另外一系列key/value对作为中间输出写入本地磁盘。MapReduce框架会自动将这些中间数据按照key值进行聚集,且key值相同(用户可设定聚集策略,默认情况下是对key值进行哈希取模)的数据被统一交给reduce()函数处理。
reduce()函数以key及对应的value列表作为输入,经合并key相同的value值后,产生另外一系列key/value对作为最终输出写入HDFS。
下面以MapReduce中的“hello world”程序—WordCount为例介绍程序设计方法。
“hello world”程序是我们学习任何一门编程语言编写的第一个程序。它简单且易于理解,能够帮助读者快速入门。同样,分布式处理框架也有自己的“hello world”程序:WordCount。它完成的功能是统计输入文件中的每个单词出现的次数。在MapReduce中,可以这样编写(伪代码)。




2014年3月25日星期二

Multilingual Information Access

This essay is mainly talked about the Multi-language models called cross language retrieval model.

This model is to help the user to improve their cross language search in the text.

There are benefits for polygots: 
language. Polyglots can benefit from MLIA in at least three ways: (1) they can find documents in more than one language with a single search, (2) they can formulate queries in the language(s) for which their active vocabulary is largest, and (3) they can move more seamlessly across languages over the
course of an information seeking episode than would be possible if documents written in
different languages were available only from different information systems.

Background for people why they want to do the multi-language searching: 
1. The cold war background
2. The world war two
3. The world wide web (The emergence)
4. IBM: The machine can learn to translate by statistical analysis.

Cross Language information retrieval:  
(1) language and character set identification, (2) language-specific processing, and (3) construction of an “inverted index” that allows rapid identification of which documents contain specific terms. Sometimes the language in which a document is written and the character set used to encode it can be inferred from its source (e.g.., New York Times articles are almost always written in English, and typically encoded in ASCII) and sometimes the language and character set might be indicated using metadata (e.g., the HTML standard used for Web pages provides metadata fields for these purposes).

The distinguishing technique for the Cross Language Information Retrieval: 
(1) translate each term using the context in which that word appears to help select the right translation, (2) count the terms and then translate the aggregate counts without regard to the context of individual occurrences, or (3) compute some more sophisticated aggregate “term weight” for each term, and then translate those weights.

The Problems still exists in the Cross language search: 
1. The egg or the basket problem
2. Present ranking problem
3. The similar drives for the development of the new search engine

Muddies Point

1. I still don't quite understand why Page Rank works ? What is the difference between the Page Rank and the HITS model? Could you explain more to me?

2. Could you introduce us some of the useful crawler we can use grab the text from internet? I think it will be very useful if you provide some


2014年3月18日星期二

The anatomy of large-scale hyper textual web search engine

How to build the effective search engine as Google in case study.

The scaling web engine:
The WWWW , has been indexing over 110,000 web pages;

The main components for building a effective search engine:
(1) The faster crawler
(2) The large storage
(3) The queries must be handled fastly
(4) and also the fast indexing system

The another problem:
The web pages are expanding but the people viewing the results are not changed.

The google best practice:
(1) Page Rank (2) And using the link to improve the search results

Page Rank Calculation:
The assumption:
The user just randomly choose the webpage and never hits back but later on he has to start again.
The random probability that the surfer will browse the page is the PAGE RANK.

And the dampen factor is that the surfer will be easily get board and will start all over again.

We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d
is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are
more details about d in the next section. Also C(A) is defined as the number of links going
out of page A. The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Note that the PageRanks form a probability distribution over web pages, so the sum of all
web pages’ PageRanks will be one

The google Architecture overview:

Authoritative Sources in a hyperlink environment

This paper is mainly focused on extract information from the linked structure and gauging their effectiveness.

The procedure of searching:
user-supplied query:
specific query;  It has scarcity problem which means it has fewer pages
board topics query; It needs the way to filter
similar page query:

There are two difficulties for the search of authorities:
1. There are millions similar words for Harvard
2., The search engine problem

Analysis of the linked structured:
Construct the algor that contains the most of this topics;
There is problem in relevance and popularity, they build the algor to test the whether this website is the hub.

The goal for this paper:
Our approach to discovering authoritative www sources is meant to have a global nature: We wish to identify the most central pages for broad search topics in the context of the www as a whole.

Section 2: subgraph for building the relevant topics
Section 3& 4: algorithm for identifying the hubs in a subgraph


The paper uses the graph algorithm to test whether the central pages can be the hub of all the webstie

2014年3月17日星期一

MIR Chapter 13

Three different forms of searching the web:
(1) search engine, the full text database
(2) Search the web by subjects

The outline
(1) Statistics problem
(2) main tools we do the search today

Challenges:
First Class: Data
(1) Distributed Data
(2) High percentage of volatile of data( it means there are a lot of adding and deleting in data links of data)
(3) Large Volume
(4) redundant data: highly unstructured for the web
(5) The quality of data would be judged and sometimes it is low quality

Conclusion: The quality of data will not changed

Second Class: Interaction with data problem
(1)how to specify the query
(2) how to interpret the answer by the query

Modeling the web:
The character of the web: The vocab grows faster and the word distribution should be more biased.


the file sizes are in the middle which is similar to normal distribution, but it has a different right tail distribution of different file types. 

Search Engines: 
centralized crawler-indexer architecture


 crawler built the indexer, query engine built the index 

The problem with the crawler: can not cope with high volume of data 

Distributed Architecture: 
The harvest distributed:
Gathers and brokers: 
Gathers: are collecting and indexing different information from one web to another. 
Broker: provide the indexing mechanism 

One of the goals of Harvest is to build topic-specific brokers, focusing the
index contents and avoiding many of the vocabulary and scaling problems of
generic indices.

User Interface: 
The query interface, answer interface 

Ranking: 
The usual ranking algorithm can not be opened to others. 
The improvement: 
(1 ) Vector spread, boolean spread 
(2) page rank algorithm ( taken into the link and pointers into consideration) 

The first assumption: PageRank simulates a user navigating randomly in the Web who jumps to a random page with probability q or follows a random hyperlink (on the current page) with probability 1 - q. 

The second assumption: It is further assumed that this user never goes back to a previously visited page following an already traversed hyperlink backwards. This process can be modeled with a Markov chain, from the stationary probability of being in each page can be computed. 

Hypertext Induced Topic Search:
Pages that have many links pointing to them in S are called authorities (that is, they should have relevant content). Pages that have many outgoing links are called hubs (they should point to similar content).

如果文档有相似的内容(就是一个网页有很多内容指向一个页面), 如果一个网页有很多外联的网页,那就叫做hub. 


Crawling the web: 
The simplest is to start with a set of URLs and from there extract other URLs which are followed recursively in a breadth-first or depth-first fashion.