environment and contexts,
The simple historic model for environment variation within this laboratory paradigm was of micro
variation, i.e. of change to the request set - say plain or fancy, or to the relevance criteria and hence set -
say high only or high and partial, for the same set of documents; less commonly there has been change to
the document set while holding request or relevance criteria/practice constant.
Changing all of D, Q and R might seem to imply more than micro variation, or at least could be deemed
to do so if the type of request and/or style of relevance assessment changed, not merely the actual document sets. Such variation, embodying a new form of need as well as new documents to draw on in trying to meet it, might be deemed to constitute macro variation rather than micro variation, and therefore as implicitly enlarging the TREC Programme's reference to contexts.
This TREC failure is hardly surprising. In TREC, as in many other retrieval experiment situations,
there is normally no material access to the encompassing wider context and especially to the actual users,whether because such real working contexts are too remote or because, fundamentally, they do not exist as prior, autonomous realities.
The factors framework refers to Input Factors (IF), Purpose Factors (PF), and Output Factors (OF).
Input Factors and Purpose Factors constrain the set of choices for Output Factors, but for any complex task cannot simply determine them.Under IF for summarising we have properties of the source texts. This includes their Form, Subject type, and Units, which subsume a series of subfactors, as illustrated in Figure 2. It is not di cult to see that such factors apply, in the retrieval case, to documents. They will also apply, in retrieval, to requests, past and current, and to any available past relevant document sets. The particular characterisations of documents and requests may of course be di erent, e.g. documents but not requests might have a complex structure.
retrieval case - the setup with both system and context - o ers.
The non-TREC literature refers to many studies of individual retrieval setups: what they are about,
what they are for, how they seem to be working, how they might be speci cally improved to serve their
purposes better. The generalisation goal that the automated system research community has sought to
achieve has worked against getting too involved in the particularities of any individual setups. My argument here is that, in the light of the generalisation we have achieved, we now need to revisit particularity. That is, to try to work with test data that is tied to an accessible and rich setup, that can be analysed for what it suggests to guide system development as well as for what it o ers for fuller performance assessment. We need to start from the whole setup, not just from the system along with whatever we happen to be able to pull pretty straightforwardly from the setup into our conventional D * Q * R environment model.
Cumulated Gain-Based Evaluation of IR Techniques
Modern large retrieval environments tend to overwhelm their users by their large output. Since all documents are not of equal relevance to their users, highly relevant documents, or document components, should be identiﬁed and ranked ﬁrst for presentation. This is often desirable from the user point of view.In order to develop IR techniques in this direction, it is necessary to develop.
Modern large retrieval environments tend to overwhelm their users by their large output. Since all documents are not of equal relevance to their users, highly relevant documents, or document components, should be identiﬁed and ranked ﬁrst for presentation. This is often desirable from the user point of view. In order to develop IR techniques in this direction, it is necessary to develop.
The second point above stated that the greater the ranked position of a relevant document, the less valuable it is for the user, because the less likely it is that the user will ever examine the document due to time, effort, and cumulated information from documents already seen. This leads to comparison of IR techniques through test queries by their cumulated gain based on document rank with a rank-based discount factor. The greater the rank, the smaller the share of the document score that is added to the cumulated gain.
The normalized recall measure (NR, for short; Rocchio  and Salton and McGill ), the sliding ratio measure (SR, for short; Pollack  and Korfhage ), and the satisfaction—frustration—total measure (SFT, for short; Myaeng and Korfhage  and Korfhage ) all seek to take into account the order in which documents are presented to the user. The NR measure compares the actual performance of an IR technique to the ideal one (when all relevant documents are retrieved first). Basically it measures the area between the ideal and the actual curves. NR does not take the degree of document relevance into account and is highly sensitive to the last relevant document found late in the ranked order.
They first propose the use of each relevance level separately in recall and precision calculation. Thus different P–R curves are drawn for each level. Performance differences at different relevance levels between IR techniques may thus be analyzed. Furthermore, they generalize recall and precision calculation to directly utilize graded document relevance scores.
The nonbinary relevance judgments were obtained by rejudging documents judged relevant by NIST assessors and about 5% of irrelevant documents for each topic. The new judgments were made by six Master’s students of information studies, all of them fluent in English although not native speakers. The
relevant and irrelevant documents were pooled, and the judges did not know the number of documents previously judged relevant or irrelevant in the pool
The proposed measures are based on several parameters: the last rank considered, the gain values to employ, and discounting factors to apply. An experimenter needs to know which parameter values and combinations to use. In practice, the evaluation context and scenario should suggest these values.