The Newbie on information Science: Parallel Information Retrieval

1.Parallel Query Processing
2.parallel execution of off-line tasks

Parallel Query Processing:
index portioning;
replication;

Replication:
Treat the servers as nodes and create the replica of nodes, which will cause the n-fold increase in the service's engine rate. This is called the inter-query parallelism.

we could split the index into n parts and have each node work only on its own small part of the index. This approach is referred to as intra-query parallelism.

Two gauging indicators have been provided: engine service rate, average time per query;

Document partitioning:
Holds the subset of the documents;
The main advantage of the document-partitioned approach is its simplicity. Because all index
servers operate independently of each other, no additional complexity needs to be introduced

into the low-level query processing routines

How to divide the collection in the subsets of nodes:
By storing similar documents in the same node

Term Partitioning:
Holds the terms of the collections;
Term partitioning addresses the disk seek problem by splitting the collection into sets of terms

instead of sets of documents.

The scalability of the query:
As the collection becomes bigger, so do the individual postings lists. For a corpus composed of a billion documents, the postings list for a frequent term, such as “computer” or “internet”, can easily occupy several hundred megabytes.

Load Imbalance.
Term partitioning suffers from an uneven load across the index nodes.

Term-at-a-Time.
Perhaps the most severe limitation of the term-partitioned approach is its inability to support efficient document-at-a-time query processing

Hybrid Schemes:
Xi et al. (2002) propose a hybrid architecture in which the collection is divided into p subcollections according to a standard document partitioning scheme. The index for each subcollection is then term-partitioned across n/p nodes, so that each node in the system is responsible for all occurrences of a set of terms within one of the p subcollections.

Redundancy and Fault Tolerance:
Replication. We can maintain multiple copies of the same index node, as described in

the previous section on hybrid partitioning schemes.

Partial Replication. Instead of replicating the entire index r times, we may choose
to replicate index information only for important documents

Dormant Replication. Suppose the search engine comprises a total of n nodes. We can
divide the index found on each node vi into n − 1 fragments and distribute them evenly
among the n−1 remaining nodes, but leave them dormant (on disk) and not use them for
query processing.

Map-Reduce Framework:
Large Scale Data Processing: building and updating the index; identifying duplicate documents in the corpus; and analyzing the link structure of the document collection.

当前的软件实现是指定一个Map（映射）函数，用来把一组键值对映射成一组新的键值对，指定并发的Reduce（化简）函数，用来保证所有映射的键值对中的每一个共享相同的键组。

简单来说，一个映射函数就是对一些独立元素组成的概念上的列表（例如，一个测试成绩的列表）的每一个元素进行指定的操作（比如，有人发现所有学生的成绩都被高估了一分，他可以定义一个“减一”的映射函数，用来修正这个错误。）

In the map phase, key/value pairs are read from the input and the map function is
applied to each of them individually. The function is of the general form
map : (k, v) 7→ h (k1, v1), (k2, v2), . . . i. (14.9)

That is, for each key/value pair, map outputs a sequence of key/value pairs. This
sequence may or may not be empty, and the output keys may or may not be identical to
the input key (they usually aren’t).

• In the shuffle phase, the pairs produced during the map phase are sorted by their key,
and all values for the same key are grouped together.
• In the reduce phase, the reduce function is applied to each key and its values. The function
is of the form
reduce : (k, hv1, v2, . . .i) 7→ (k, hv′ 1, v′ 2, . . .i). (14.10)

Suppose we have a total of n = m + r machines, where m is the number of map workers and r is the number of reduce workers. The input of the MapReduce is broken into small pieces called map shards. Each shard typically holds between 16 and 64 MB of data. The shards are treated independently, and each shard is assigned to one of the m map workers. In a large MapReduce, it is common to have dozens or hundreds of map shards assigned to each map worker. A worker usually works on only 1 shard at a time, so all its shards have to be processed sequentially. However, if a worker has more than 1 CPU, it may improve performance to have it work on multiple shards in parallel.

map()函数以key/value对作为输入，产生另外一系列key/value对作为中间输出写入本地磁盘。MapReduce框架会自动将这些中间数据按照key值进行聚集，且key值相同（用户可设定聚集策略，默认情况下是对key值进行哈希取模）的数据被统一交给reduce()函数处理。

reduce()函数以key及对应的value列表作为输入，经合并key相同的value值后，产生另外一系列key/value对作为最终输出写入HDFS。

下面以MapReduce中的“hello world”程序—WordCount为例介绍程序设计方法。

“hello world”程序是我们学习任何一门编程语言编写的第一个程序。它简单且易于理解，能够帮助读者快速入门。同样，分布式处理框架也有自己的“hello world”程序：WordCount。它完成的功能是统计输入文件中的每个单词出现的次数。在MapReduce中，可以这样编写（伪代码）。

The Newbie on information Science

2014年3月26日星期三

Parallel Information Retrieval

没有评论:

发表评论