I think it is time the open source search community (and I don’t mean just Lucene) develop and publish a set of TREC-style relevance judgments for freely available data that is easily obtained from the Internet. Simply put, I am wondering if there are volunteers out there who would be willing to develop a practical set of queries and judgments for datasets like Wikipedia, iBiblio, the Internet Archive, etc. We wouldn’t host these datasets, we would just provide the queries and judgments, as well as the info on how to obtain the data. Then, it is easy enough to provide simple scripts that do things like run Lucene’s contrib/benchmark Quality tasks against said data.
Practically speaking, I don’t think we even need to go as deep as TREC. I think we would find the most use in making judgments on the top 10 or 20 results for any given query.
So, what do others think? Am I off my rocker? Are there any volunteers out there? I think we could do this pretty simply through some scripts, and the effective use of a wiki. I don’t think our goal is, in the short run, to be scientifically rigorous, but it should be over time. Instead, I think our goal is to run a practical relevance test like any organization should when deploying search: take 50 (top) queries and judge them, as well as 20 or so random queries and judge them. (I wonder if Wikipedia would give us there top 50 queries, or maybe it is already available.) Over time, we can add queries, and refine judgments using the web 2.0 mentality of the wisdom of crowds.