2014年2月14日星期五

Evaluation

IIR chapter 8. OR MIR 3 
2. Karen Sparck Jones, (2006). What's the value of TREC: is there a gap to jump or a chasm to bridge? ACM SIGIR Forum, Volume 40 Issue 1 June 2006 http://doi.acm.org/10.1145/1147197.1147198 
3. Kalervo Järvelin, Jaana Kekäläinen. (2002) Cumulated gain-based evaluation of IR techniques ACM Transactions on Information Systems (TOIS) Volume 20 , Issue 4 (October 2002) Pages: 422 – 446  http://doi.acm.org/10.1145/582415.582418  


The TREC Programme has been very successful at generalising. It has shown that essentially simple methods of retrieving documents (standard statistically-based VSM, BM25, InQuery, etc) can give decent ‘basic benchmark’ performance across a range of datasets, including quite large ones. But retrieval contexts vary widely, much more widely than the contexts the TREC datasets embody. The TREC Programme has sought to address variation, but it has done this in a largely ad hoc and unsystematic way. Thus even while allowing for the Programme’s concentration on core retrieval system functionality, namely the ability to retrieve relevant documents, and excluding for now the tasks it has addressed that do not fall under the document retrieval heading, the generalisation the Programme seeks, or is assumed to be seeking, is incomplete. However, rather than simply continuing with the generalisation mission, intended to gain more support for the current findings, it may be time now to address particularisation.


environment and contexts, 
The simple historic model for environment variation within this laboratory paradigm was of micro 
variation, i.e. of change to the request set - say plain or fancy, or to the relevance criteria and hence set -
say high only or high and partial, for the same set of documents; less commonly there has been change to 
the document set while holding request or relevance criteria/practice constant.

Changing all of D, Q and R might seem to imply more than micro variation, or at least could be deemed
to do so if the type of request and/or style of relevance assessment changed, not merely the actual document sets. Such variation, embodying a new form of need as well as new documents to draw on in trying to meet it, might be deemed to constitute macro variation rather than micro variation, and therefore as implicitly enlarging the TREC Programme's reference to contexts.

TREC Strategies: 
This TREC failure is hardly surprising. In TREC, as in many other retrieval experiment situations,
there is normally no material access to the encompassing wider context and especially to the actual users,whether because such real working contexts are too remote or because, fundamentally, they do not exist as prior, autonomous realities.

The factors framework refers to Input Factors (IF), Purpose Factors (PF), and Output Factors (OF).
Input Factors and Purpose Factors constrain the set of choices for Output Factors, but for any complex task cannot simply determine them.Under IF for summarising we have properties of the source texts. This includes their Form, Subject type, and Units, which subsume a series of subfactors, as illustrated in Figure 2. It is not di cult to see that such factors apply, in the retrieval case, to documents. They will also apply, in retrieval, to requests, past and current, and to any available past relevant document sets. The particular characterisations of documents and requests may of course be di erent, e.g. documents but not requests might have a complex structure.

retrieval case - the setup with both system and context - o ers.
The non-TREC literature refers to many studies of individual retrieval setups: what they are about,
what they are for, how they seem to be working, how they might be speci cally improved to serve their
purposes better. The generalisation goal that the automated system research community has sought to
achieve has worked against getting too involved in the particularities of any individual setups. My argument here is that, in the light of the generalisation we have achieved, we now need to revisit particularity. That is, to try to work with test data that is tied to an accessible and rich setup, that can be analysed for what it suggests to guide system development as well as for what it o ers for fuller performance assessment. We need to start from the whole setup, not just from the system along with whatever we happen to be able to pull pretty straightforwardly from the setup into our conventional D * Q * R environment model.

Cumulated Gain-Based Evaluation of IR Techniques
Modern large retrieval environments tend to overwhelm their users by their large output. Since all documents are not of equal relevance to their users, highly relevant documents, or document components, should be identified and ranked first for presentation. This is often desirable from the user point of view.In order to develop IR techniques in this direction, it is necessary to develop.

Modern large retrieval environments tend to overwhelm their users by their large output. Since all documents are not of equal relevance to their users, highly relevant documents, or document components, should be identified and ranked first for presentation. This is often desirable from the user point of view. In order to develop IR techniques in this direction, it is necessary to develop.

The second point above stated that the greater the ranked position of a relevant document, the less valuable it is for the user, because the less likely it is that the user will ever examine the document due to time, effort, and cumulated information from documents already seen. This leads to comparison of IR techniques through test queries by their cumulated gain based on document rank with a rank-based discount factor. The greater the rank, the smaller the share of the document score that is added to the cumulated gain.

The normalized recall measure (NR, for short; Rocchio [1966] and Salton and McGill [1983]), the sliding ratio measure (SR, for short; Pollack [1968] and Korfhage [1997]), and the satisfaction—frustration—total measure (SFT, for short; Myaeng and Korfhage [1990] and Korfhage [1997]) all seek to take into account the order in which documents are presented to the user. The NR measure compares the actual performance of an IR technique to the ideal one (when all relevant documents are retrieved first). Basically it measures the area between the ideal and the actual curves. NR does not take the degree of document relevance into account and is highly sensitive to the last relevant document found late in the ranked order.

They first propose the use of each relevance level separately in recall and precision calculation. Thus different P–R curves are drawn for each level. Performance differences at different relevance levels between IR techniques may thus be analyzed. Furthermore, they generalize recall and precision calculation to directly utilize graded document relevance scores.

The nonbinary relevance judgments were obtained by rejudging documents judged relevant by NIST assessors and about 5% of irrelevant documents for each topic. The new judgments were made by six Master’s students of information studies, all of them fluent in English although not native speakers. The
relevant and irrelevant documents were pooled, and the judges did not know the number of documents previously judged relevant or irrelevant in the pool

The proposed measures are based on several parameters: the last rank considered, the gain values to employ, and discounting factors to apply. An experimenter needs to know which parameter values and combinations to use. In practice, the evaluation context and scenario should suggest these values.

没有评论:

发表评论