2.parallel execution of off-line tasks
Parallel Query Processing:
Treat the servers as nodes and create the replica of nodes, which will cause the n-fold increase in the service's engine rate. This is called the inter-query parallelism.
we could split the index into n parts and have each node work only on its own small part of the index. This approach is referred to as intra-query parallelism.
Two gauging indicators have been provided: engine service rate, average time per query;
Holds the subset of the documents;
The main advantage of the document-partitioned approach is its simplicity. Because all index
servers operate independently of each other, no additional complexity needs to be introduced
into the low-level query processing routines
How to divide the collection in the subsets of nodes:
By storing similar documents in the same node
Holds the terms of the collections;
Term partitioning addresses the disk seek problem by splitting the collection into sets of terms
instead of sets of documents.
The scalability of the query:
As the collection becomes bigger, so do the individual postings lists. For a corpus composed of a billion documents, the postings list for a frequent term, such as “computer” or “internet”, can easily occupy several hundred megabytes.
Term partitioning suffers from an uneven load across the index nodes.
Perhaps the most severe limitation of the term-partitioned approach is its inability to support efficient document-at-a-time query processing
Xi et al. (2002) propose a hybrid architecture in which the collection is divided into p subcollections according to a standard document partitioning scheme. The index for each subcollection is then term-partitioned across n/p nodes, so that each node in the system is responsible for all occurrences of a set of terms within one of the p subcollections.
Redundancy and Fault Tolerance:
Replication. We can maintain multiple copies of the same index node, as described in
the previous section on hybrid partitioning schemes.
Partial Replication. Instead of replicating the entire index r times, we may choose
to replicate index information only for important documents
Dormant Replication. Suppose the search engine comprises a total of n nodes. We can
divide the index found on each node vi into n − 1 fragments and distribute them evenly
among the n−1 remaining nodes, but leave them dormant (on disk) and not use them for
Large Scale Data Processing: building and updating the index; identifying duplicate documents in the corpus; and analyzing the link structure of the document collection.
In the map phase, key/value pairs are read from the input and the map function is
applied to each of them individually. The function is of the general form
map : (k, v) 7→ h (k1, v1), (k2, v2), . . . i. (14.9)
That is, for each key/value pair, map outputs a sequence of key/value pairs. This
sequence may or may not be empty, and the output keys may or may not be identical to
the input key (they usually aren’t).
• In the shuffle phase, the pairs produced during the map phase are sorted by their key,
and all values for the same key are grouped together.
• In the reduce phase, the reduce function is applied to each key and its values. The function
is of the form
reduce : (k, hv1, v2, . . .i) 7→ (k, hv′ 1, v′ 2, . . .i). (14.10)
Suppose we have a total of n = m + r machines, where m is the number of map workers and r is the number of reduce workers. The input of the MapReduce is broken into small pieces called map shards. Each shard typically holds between 16 and 64 MB of data. The shards are treated independently, and each shard is assigned to one of the m map workers. In a large MapReduce, it is common to have dozens or hundreds of map shards assigned to each map worker. A worker usually works on only 1 shard at a time, so all its shards have to be processed sequentially. However, if a worker has more than 1 CPU, it may improve performance to have it work on multiple shards in parallel.
“hello world”程序是我们学习任何一门编程语言编写的第一个程序。它简单且易于理解，能够帮助读者快速入门。同样，分布式处理框架也有自己的“hello world”程序：WordCount。它完成的功能是统计输入文件中的每个单词出现的次数。在MapReduce中，可以这样编写（伪代码）。