MetadataShow full item record
Abstractcollections are the standard framework used in the evaluation of an information retrieval system and the comparison between different systems. A text test collection consists of a set of documents, a set of topics, and a set of relevance assessments which is a list indicating the relevance of each document to each topic. Traditionally, forming the relevance assessments is done manually by human judges. But in large scale environments, such as the web, examining each document retrieved to determine its relevance is not possible. In the past there have been several studies that aimed to reduce the human effort required in building these assessments which are referred to as qrels (query-based relevance sets). Some research has also been done to completely automate the process of generating the qrels. In this thesis, we present different methodologies that lead to producing the qrels automatically without any human intervention. A first method is based on keyphrase (KP) extraction from documents presumed relevant; a second method uses Machine Learning classifiers, Naïve Bayes and Support Vector Machines. The experiments were conducted on the TREC-6, TREC-7 and TREC-8 test collections. The use of machine learning classifiers produced qrels resulting in information retrieval system rankings which were better correlated with those produced by TREC human assessments than any of the automatic techniques proposed in the literature. In order to produce a test collection which could discriminate between the best performing systems, an enhancement to the machine learning technique was made that used a small number of real or actual qrels as training sets for the classifiers. These actual relevant documents were selected by Losada et al.’s (2016) pooling technique. This modification led to an improvement in the overall system rankings and enabled discrimination between the best systems with only a little human effort. We also used the bpref-10 and infAP measures for evaluating the systems and comparing between the rankings, since they are more robust in incomplete judgment environments. We applied our new techniques to the French and Finnish test collections from CLEF2003 in order to confirm their reproducibility on non-English languages, and we achieved high correlations as seen for English.
PublisherUniversity of Wolverhampton
TypeThesis or dissertation
DescriptionA thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy
The following licence applies to the copyright and re-use of this item:
- Creative Commons
Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States