What is it about?

Fundamental to many predictive analytics tasks is the ability to predict the number of data items fetched in analytics queries. This is crucial for data analysts dealing with, e.g., interactive data explorations, data visualization, and in query processing optimization. However, in many modern data systems, predictive analytics are too costly money-wise, unreliable, e.g., in modern Big Data query engines accurate statistics are difficult to obtain/maintain, and/or infeasible, e.g., for privacy issues. We contribute a novel, query-driven ML methodology to provide function estimation models for analysts over dynamically changing workloads. Our methodology is highly accurate in terms of prediction and accommodates well-known multi-dimensional distance-nearest neighbors (radius) queries. Our methodology associates queries to analyst-defined data and optimally adapts to changes of the query workloads based on the principles of the theory of optimal stopping.

Featured Image

Why is it important?

The proposed methodology is decentralized facilitating the scaling-out of predictive analytics tasks. The research significance of our idea lies in that (i) it is an attractive solution when data-driven exploration and statistical techniques are undesirable or infeasible, (ii) it offers a scale-out, decentralized training solution, (iii) it is applicable to different types of exploration queries, and (iv) it offers a performance that is superior to that of data-driven approaches. More interestingly, our novel query workload change detection, which reflects the way analysts change their interests in exploring and analyze data is treated as an optimal stopping time problem. Through this time-optimized stochastic framework, we are able to securely decide when a query workload is novel reflecting the analysts’ interest with the ultimate purpose to minimize the risk of low prediction accuracy.


The idea of predictive analytics based only on the knowledge from previously issued queries and their results are only now emerging, based on our idea of query-driven ML for query prediction. Our methodology is fundamentally decentralized and applied in modern scale-out data systems over distributed data stores. This provides for elasticity (adding/removing data nodes) in distributed environments since the model training phase is purely decentralized (local) and thus independent among data nodes.

University of Glasgow

Read the Original

This page is a summary of: Query-Driven Learning for Predictive Analytics of Data Subspace Cardinality, ACM Transactions on Knowledge Discovery from Data, August 2017, ACM (Association for Computing Machinery), DOI: 10.1145/3059177.
You can read the full text:




The following have contributed to this page