What is it about?

Data analytics over unstructured data (videos, images, text, audio) is increasingly using machine learning (ML). Unfortunately, deploying ML is expensive. Thus, to reduce the cost of such queries, many recent systems (e.g., BlazeIt, NoScope, Tahoma, SUPG, etc.) train proxy models to approximate expensive target labelers (e.g., expensive ML models and human labeling services) for each query that needs to be answered. In this work, we present TASTI, which is a trainable semantic index which removes the need to train query-specific proxy models for each query. After the index is constructed, TASTI can generate high quality proxy models that can be used downstream to accelerate queries such as aggregation and selection over large datasets. TASTI's design is motivated by the fact that many queries are highly correlated and share underlying semantic information. For instance, answering a query that counts the number of cars should help us answer a different query involving finding red cars. This property is not leveraged by prior work which focuses on training a new proxy model from scratch for each query.

Featured Image

Read the Original

This page is a summary of: TASTI: Semantic Indexes for Machine Learning-based Queries over Unstructured Data, June 2022, ACM (Association for Computing Machinery),
DOI: 10.1145/3514221.3517897.
You can read the full text:

Read

Contributors

The following have contributed to this page