Efficiency and Scalability of Large Search Architectures
The massive size and rapid growth of Internet resources forces search engine providers to continuously study new methods for improving the efficiency and scalability of their services. One line of research is directed at optimizing index-level retrieval, with an abundance of work in such areas as query processing, index compression, and document ranking. Orthogonal to this is work dedicated to parallelization of high-throughput, low-latency search systems, such as Google or Bing. Although some techniques—such as tiering, static document rankings, and load balancing—are known to be extensively used in commercial installations, few details are disclosed, and there is a need for filling the gaps in the somewhat incomplete academic literature. This work examines several problems related to the efficiency and scalability of search engines, with emphasis on large, distributed architectures.
Fast Disjunctive Candidate Generation Using Live Block Filtering
Antonio Mallia, Michal Siedlaczek, Torsten Suel. Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 2021.
A lot of research has focused on the efficiency of search engine query processing, and in particular on disjunctive top-k queries that return the highest scoring k results that contain at least one of the query terms. Disjunctive top-k queries over simple ranking functions are commonly used to retrieve an initial set of candidate results that are then reranked by more complex, often machine-learned rankers. Many optimized top-k algorithms have been proposed, including MaxScore, WAND, BMW, and JASS. While the fastest methods achieve impressive results on top-10 and top-100 queries, they tend to become much slower for the larger k commonly used for candidate generation.
In this paper, we focus on disjunctive top-k queries for larger k. We propose new algorithms that achieve much faster query processing for values of k up to thousands or tens of thousands. Our algorithms build on top of the live-block filtering approach of Dimopoulos et al., and exploit the SIMD capabilities of modern CPUs. We also perform a detailed experimental comparison of our methods with the fastest known approaches, and release a full model implementation of our methods and of the underlying live-block mechanism, which will allows others to design and experiment with additional methods under the live-block approach.
Comparing Top-k Threshold Estimation Techniques for Disjunctive Query Processing
Antonio Mallia, Michal Siedlaczek, Torsten Suel. Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2020.
In the top-k threshold estimation problem, given a query q, the goal is to estimate the score of the result at rank k. A good estimate of this score can result in significant performance improvements for several query processing scenarios, including selective search, index tiering, and widely used disjunctive query processing algorithms such as MaxScore, WAND, and BMW.
Several approaches have been proposed, including parametric approaches, methods using random sampling, and a recent approach based on machine learning. However, previous work fails to perform any experimental comparison between these approaches. In this paper, we address this issue by reimplementing four major approaches and comparing them in terms of estimation error, running time, likelihood of an overestimate, and end-to-end performance when applied to common classes of disjunctive top-k query processing algorithms.
More | Read
Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format
Jimmy Lin, Joel Mackenzie, Chris Kamphuis, Craig Macdonald, Antonio Mallia, Michał Siedlaczek, Andrew Trotman, Arjen de Vries. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2020.
There exists a natural tension between encouraging a diverse ecosystem of open-source search engines and supporting fair, replicable comparisons across those systems. To balance these two goals, we examine two approaches to providing interoperability between the inverted indexes of several systems. The first takes advantage of internal abstractions around index structures and building wrappers that allow one system to directly read the indexes of another. The second involves sharing indexes across systems via a data exchange specification that we have developed, called the Common Index File Format (CIFF). We demonstrate the first approach with the Java systems Anserini and Terrier, and the second approach with Anserini, JASSv2, OldDog, PISA, and Terrier. Together, these systems provide a wide range of implementations and features, with different research goals. Overall, we recommend CIFF as a low-effort approach to support independent innovation while enabling the types of fair evaluations that are critical for driving the field forward.
More | Read
Forward Index Compression for Instance Retrieval in an Augmented Reality Application
Qi Wang, Michal Siedlaczek, Yen-Yu Chen, Michael Gormish, Torsten Suel. In Proceedings of 2019 IEEE Big Data Conference. 2019.
Instance retrieval systems are widely used in applications such as robot navigation, medical diagnosis, and augmented reality. Blippar is a company that creates compelling augmented reality experiences or provides you with the tools to build your own. In this paper we focus on one of the company's augmented-reality applications, with which users are able to point their phone cameras at different objects in order to receive information about the objects in real time. In this paper, we provide what we believe to be the first study of forward index compression techniques for such instance retrieval systems. First, we perform an analysis of real-world data from a large-scale commercial instance retrieval system, run by Blippar focusing on augmented reality. Then we propose an entropy-based lossless compression strategy. Experiments show that our proposed Huffman-based approach outperforms a variety of other compression techniques, while also increasing overall system efficiency slightly.
More | Read
GPU-Accelerated Decoding of Integer Lists
Antonio Mallia, Michal Siedlaczek, Torsten Suel and Mohamed Zahran. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM). 2019.
An inverted index is the basic data structure used in most current large-scale information retrieval systems. It can be modeled as a collection of sorted sequences of integers. Many compression techniques for inverted indexes have been studied in the past, with some of them reaching tremendous decompression speeds through the use of SIMD instructions available on modern CPUs. While there has been some work on query processing algorithms for Graphics Processing Units (GPUs), little of it has focused on how to efficiently access compressed index structures, and we see some potential for significant improvements in decompression speed. In this paper, we describe and implement two encoding schemes for index decompression on GPU architectures. Their format and decoding algorithm is adapted from existing CPU-based compression methods to exploit the execution model and memory hierarchy offered by GPUs. We show that our solutions, GPU-BP and GPU-VByte, achieve significant speedups over their already carefully optimized CPU counterparts.
More | Read
PISA: Performant Indexes and Search for Academia
Antonio Mallia, Michał Siedlaczek, Joel Mackenzie, Torsten Suel. In Proceedings of the Open-Source IR Replicability Challenge (OSIRRC 2019). 2019.
Performant Indexes and Search for Academia (PISA) is an experimental search engine that focuses on efficient implementations of state-of-the-art representations and algorithms for text retrieval. In this work, we outline our effort in creating a replicable search run from PISA for the Open Source Information Retrieval Replicability Challenge, which encourages the information retrieval community to produce replicable systems through the use of a containerized, Docker-based infrastructure. We also discuss the origins, current functionality, and future direction and challenges for PISA system.
More | Read
Exploiting Global Impact Ordering for Higher Throughput in Selective Search
Michał Siedlaczek, Juan Rodriguez, Torsten Suel. In Proceedings of 41st European Conference on Information Retrieval. 2019.
We investigate potential benefits of exploiting a global impact ordering in a selective search architecture. We propose a generalized, ordering-aware version of the learning-to-rank-resources framework along with a modified selection strategy. By allowing partial shard processing we are able to achieve a better initial trade-off between query cost and precision than the current state of the art. Thus, our solution is suitable for increasing query throughput during periods of peak load or in low-resource systems.
More | Read | Poster
An Experimental Study of Index Compression and DAAT Query Processing Methods
Antonio Mallia, Michał Siedlaczek, Torsten Suel. In Proceedings of 41st European Conference on Information Retrieval. 2019.
In the last two decades, the IR community has seen numerous advances in top-k query processing and inverted index compression techniques. While newly proposed techniques are typically proposed against a few baselines, these evaluations are often very limited, and we feel that there is no clear overall picture on the best choices of algorithms and compression methods.
In this paper, we attempt to address this issue by evaluating a number of state-of-the-art index compression methods and safe disjunctive DAAT query processing algorithms. Our goal is to understand how much index compression performance impacts overall query processing speeds, how the choice of query processing algorithm depends on the compression method used, and how performance is impacted by document reordering techniques and the number of results returned, keeping in mind that current search engines typically use sets of hundreds or thousands of candidates for further reranking.
More | Read
Fast bag-of-words candidate selection in content-based instance retrieval systems
Michał Siedlaczek, Qi Wang, Yen-Yu Chen, and Torsten Suel. In Proceedings of 2018 IEEE Big Data Conference. 2018.
Many content-based image search and instance retrieval systems implement bag-of-visual-words strategies for candidate selection. Visual processing of an image results in hundreds of visual words that make up a document, and these words are used to build an inverted index. Query processing then consists of an initial candidate selection phase that queries the inverted index, followed by more complex reranking of the candidates using various image features. The initial phase typically uses disjunctive top-k query processing algorithms originally proposed for searching text collections.
Our objective in this paper is to optimize the performance of disjunctive top-k computation for candidate selection in content-based instance retrieval systems. While there has been extensive previous work on optimizing this phase for textual search engines, we are unaware of any published work that studies this problem for instance retrieval, where both index and query data are quite different from the distributions commonly found and exploited in the textual case. Using data from a commercial large-scale instance retrieval system, we address this challenge in three steps. First, we analyze the quantitative properties of index structures and queries in the system, and discuss how they differ from the case of text retrieval. Second, we describe an optimized term-at-a-time retrieval strategy that significantly outperforms baseline term-at-a-time and document-at-a-time strategies, achieving up to 66% speed-up over the most efficient baseline. Finally, we show that due to the different properties of the data, several common safe and unsafe early termination techniques from the literature fail to provide any significant performance benefits.
More | Read | Slides