System-oriented IR evaluations are limited to rather abstract understandings of real user behavior

Last update: Nov 23, 2022

Related tags

Deep Learning ecir2022-uqv-sim

Overview

Validating Simulations of User Query Variants

This repository contains the scripts of the experiments and evaluations, simulated queries, as well as the figures of:

Timo Breuer, Norbert Fuhr, and Philipp Schaer. 2022. Validating Simulations of User Query Variants. In Proceedings of the 44th European Conference on IR Research, ECIR 2022.

System-oriented IR evaluations are limited to rather abstract understandings of real user behavior. As a solution, simulating user interactions provides a cost-efficient way to support system-oriented experiments with more realistic directives when no interaction logs are available. While there are several user models for simulated clicks or result list interactions, very few attempts have been made towards query simulations, and it has not been investigated if these can reproduce properties of real queries. In this work, we validate simulated user query variants with the help of TREC test collections in reference to real user queries that were made for the corresponding topics. Besides, we introduce a simple yet effective method that gives better reproductions of real queries than the established methods. Our evaluation framework validates the simulations regarding the retrieval performance, reproducibility of topic score distributions, shared task utility, effort and effect, and query term similarity when compared with real user query variants. While the retrieval effectiveness and statistical properties of the topic score distributions as well as economic aspects are close to that of real queries, it is still challenging to simulate exact term matches and later query reformulations.

Directory overview

Directory	Description
`config/`	Contains configuration files for the query simulations, experiments, and evaluations.
`data/`	Contains (intermediate) output data of the simulations and experiments as well as the figures of the paper.
`eval/`	Contains scripts of the experiments and evaluations.
`sim/`	Contains scripts of the query simulations.

Setup

Install Anserini and index Core17 (The New York Times Annotated Corpus) according to the regression guide:

anserini/target/appassembler/bin/IndexCollection \
    -collection NewYorkTimesCollection \
    -input /path/to/core17/ \
    -index anserini/indexes/lucene-index.core17 \
    -generator DefaultLuceneDocumentGenerator \
    -threads 4 \
    -storePositions \
    -storeDocvectors \
    -storeRaw \
    -storeContents \
    > anserini/logs/log.core17 &

Install the required Python packages:

pip install -r requirements.txt

Query simulation

In order to prepare the language models and simulate the queries, the scripts have to executed in the order shown in the following table. All of the outputs can be found in the data/ directory. For the sake of better code readability the names of the query reformulation strategies have been mapped: S1 → S1; S2 → S2; S2' → S3; S3 → S4; S3' → S5; S4 → S6; S4' → S7; S4'' → S8. The names of the scripts and output files comply with this name mapping.

Script	Description	Output files
`sim/make_background.py`	Make the background language model form all index terms of Core17. The background model is required for Controlled Query Generation (CQG) by Jordan et al.	`data/lm/background.csv`
`sim/make_cqg.py`	Make the CQG language models with different parameters of lambda from 0.0 to 1.0.	`data/lm/cqg.json`
`sim/simulate_queries_s12345.py`	Simulate TTS and KIS queries with strategies S1 to S3'	`data/queries/s12345.csv`
`sim/simulate_queries_s678.py`	Simulate TTS and KIS queries with strategies S4 to S4''	`data/queries/s678.csv`

Experimental evaluation and results

In order to reproduce the experiments of the study, the scripts have to executed in the order shown in the following table.

Script	Description	Output files	Reproduction of ...
`eval/arp.py`, `eval/arp_first.py`, `eval/arp_max.py`	Retrieval performance: Evaluate the Average Retrieval Performance (ARP).	`data/experimental_results/arp.csv`, `data/experimental_results/arp_first.csv`, `data/experimental_results/arp_max.csv`	`Tab. A.1`
`eval/rmse_s12345.py`, `eval/rmse_s678.py`	Retrieval performance: Evaluate the Root-Mean-Square-Error (RMSE).	`data/experimental_results/rmse_map.csv`, `data/experimental_results/rmse_ndcg.csv`, `data/experimental_results/rmse_p1000.csv`, `data/experimental_results/rmse_uqv_vs_s12345_kis_ndcg.csv`, `data/experimental_results/rmse_uqv_vs_s12345_tts_ndcg.csv`, `data/figures/rmse_map.pdf`, `data/figures/rmse_ndcg.pdf`, `data/figures/rmse_p1000.pdf`, `data/figures/rmse_uqv_vs_s12345_kis_ndcg.pdf`, `data/figures/rmse_uqv_vs_s12345_tts_ndcg.pdf`	`Fig. A.1`, `Fig. 1`
`eval/t-test.py`	Retrieval performance: Evaluate the p-values of paired t-tests.	`data/experimental_results/ttest.csv`, `data/figures/ttest.pdf`	`Fig. A.2`
`eval/system_orderings.py`	Shared task utility: Evaluate Kendall's tau between relative system orderings.	`data/experimental_results/system_orderings.csv`, `data/figures/system_orderings.pdf`	`Fig. 2 (left)`
`eval/sdcg.py`	Effort and effect: Evaluate the Session Discounted Cumulative Gain (sDCG).	`data/experimental_results/sdcg_3queries.csv`, `data/experimental_results/sdcg_5queries.csv`, `data/experimental_results/sdcg_10queries.csv`, `data/figures/sdcg_3queries.pdf`, `data/figures/sdcg_5queries.pdf`, `data/figures/sdcg_10queries.pdf`	`Fig. 3 (top)`
`eval/economic.py`	Effort and effect: Evaluate tradeoffs between number of queries and browsing depth by isoquants.	`data/experimental_results/economic0.3.csv`, `data/experimental_results/economic0.4.csv`, `data/experimental_results/economic0.5.csv`, `data/figures/economic0.3.pdf`, `data/figures/economic0.4.pdf`, `data/figures/economic0.5.pdf`	`Fig. 3 (bottom)`
`eval/jaccard_similarity.py`	Query term similarity: Evaluate query term similarities.	`data/experimental_results/jacc.csv`, `data/figures/jacc.pdf`	`Fig. 2 (right)`

System-oriented IR evaluations are limited to rather abstract understandings of real user behavior

Related tags

Overview

Validating Simulations of User Query Variants

Directory overview

Setup

Query simulation

Experimental evaluation and results

Owner

IR Group at Technische Hochschule Köln

PaddleBoBo是基于PaddlePaddle和PaddleSpeech、PaddleGAN等开发套件的虚拟主播快速生成项目

This dlib-based facial login system

This project provides the proof of the uniqueness of the equilibrium and the global asymptotic stability.

TCTrack: Temporal Contexts for Aerial Tracking (CVPR2022)

Codes of paper "Unseen Object Amodal Instance Segmentation via Hierarchical Occlusion Modeling"

CVPR2021: Temporal Context Aggregation Network for Temporal Action Proposal Refinement

Improving Query Representations for DenseRetrieval with Pseudo Relevance Feedback:A Reproducibility Study.

the code used for the preprint Embedding-based Instance Segmentation of Microscopy Images.

[ICLR 2021] HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

This is the official implementation of 3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View Spatial Feature Fusion for 3D Object Detection, built on SECOND.

Easy way to add GoogleMaps to Flask applications. maintainer: @getcake

SWA Object Detection

The world's largest toxicity dataset.

Geometric Deep Learning Extension Library for PyTorch

Source code for our paper "Improving Empathetic Response Generation by Recognizing Emotion Cause in Conversations"

A high-performance anchor-free YOLO. Exceeding yolov3~v5 with ONNX, TensorRT, NCNN, and Openvino supported.

Pairwise learning neural link prediction for ogb link prediction

The repo for the paper "I3CL: Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection".

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge