Query-Driven Learning for Predictive Analytics of Data Subspace Cardinality

Christos Anagnostopoulos; Peter Triantafillou

doi:10.1145/3059177

What is it about?

Fundamental to many predictive analytics tasks is the ability to predict the number of data items fetched in analytics queries. This is crucial for data analysts dealing with, e.g., interactive data explorations, data visualization, and in query processing optimization. However, in many modern data systems, predictive analytics are too costly money-wise, unreliable, e.g., in modern Big Data query engines accurate statistics are difficult to obtain/maintain, and/or infeasible, e.g., for privacy issues. We contribute a novel, query-driven ML methodology to provide function estimation models for analysts over dynamically changing workloads. Our methodology is highly accurate in terms of prediction and accommodates well-known multi-dimensional distance-nearest neighbors (radius) queries. Our methodology associates queries to analyst-defined data and optimally adapts to changes of the query workloads based on the principles of the theory of optimal stopping.

Photo by Luke Chesser on Unsplash

Why is it important?

The proposed methodology is decentralized facilitating the scaling-out of predictive analytics tasks. The research significance of our idea lies in that (i) it is an attractive solution when data-driven exploration and statistical techniques are undesirable or infeasible, (ii) it offers a scale-out, decentralized training solution, (iii) it is applicable to different types of exploration queries, and (iv) it offers a performance that is superior to that of data-driven approaches. More interestingly, our novel query workload change detection, which reflects the way analysts change their interests in exploring and analyze data is treated as an optimal stopping time problem. Through this time-optimized stochastic framework, we are able to securely decide when a query workload is novel reflecting the analysts’ interest with the ultimate purpose to minimize the risk of low prediction accuracy.

Perspectives

The idea of predictive analytics based only on the knowledge from previously issued queries and their results are only now emerging, based on our idea of query-driven ML for query prediction. Our methodology is fundamentally decentralized and applied in modern scale-out data systems over distributed data stores. This provides for elasticity (adding/removing data nodes) in distributed environments since the model training phase is purely decentralized (local) and thus independent among data nodes.
CHRISTOS ANAGNOSTOPOULOS
University of Glasgow

This page is a summary of: Query-Driven Learning for Predictive Analytics of Data Subspace Cardinality, ACM Transactions on Knowledge Discovery from Data, August 2017, ACM (Association for Computing Machinery),
DOI: 10.1145/3059177.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page

CHRISTOS ANAGNOSTOPOULOS
University of Glasgow

Query-driven Machine Learning for Data Analysis & Query Workloads Prediction.

What is it about?

Why is it important?

Perspectives

Resources

ML Databases (MLR)

UCI KDD Database

Learning set cardinality in distance nearest neighbours

Learning to accurately COUNT with query-driven predictive analytics

Aggregate Query Prediction under Dynamic Workloads

Large-scale predictive modeling and analytics through regression queries in data management systems

Contributors

You might also like

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Query-driven Machine Learning for Data Analysis & Query Workloads Prediction.

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Resources

ML Databases (MLR)

UCI KDD Database

Learning set cardinality in distance nearest neighbours

Learning to accurately COUNT with query-driven predictive analytics

Aggregate Query Prediction under Dynamic Workloads

Large-scale predictive modeling and analytics through regression queries in data management systems

Contributors

Share this page:

You might also like

The dialectic of the avatar: Developing in-world identities in Second Life

Effects of environmental control before sleeping on autonomic nervous activity and sleep: A pilot study

Optimization of Transmission Map for Improved Image Defogging

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management