What is it about?
Fundamental to many predictive analytics tasks is the ability to predict the number of data items fetched in analytics queries. This is crucial for data analysts dealing with, e.g., interactive data explorations, data visualization, and in query processing optimization. However, in many modern data systems, predictive analytics are too costly money-wise, unreliable, e.g., in modern Big Data query engines accurate statistics are difficult to obtain/maintain, and/or infeasible, e.g., for privacy issues. We contribute a novel, query-driven ML methodology to provide function estimation models for analysts over dynamically changing workloads. Our methodology is highly accurate in terms of prediction and accommodates well-known multi-dimensional distance-nearest neighbors (radius) queries. Our methodology associates queries to analyst-defined data and optimally adapts to changes of the query workloads based on the principles of the theory of optimal stopping.
Photo by Luke Chesser on Unsplash
Why is it important?
The proposed methodology is decentralized facilitating the scaling-out of predictive analytics tasks. The research significance of our idea lies in that (i) it is an attractive solution when data-driven exploration and statistical techniques are undesirable or infeasible, (ii) it offers a scale-out, decentralized training solution, (iii) it is applicable to different types of exploration queries, and (iv) it offers a performance that is superior to that of data-driven approaches. More interestingly, our novel query workload change detection, which reflects the way analysts change their interests in exploring and analyze data is treated as an optimal stopping time problem. Through this time-optimized stochastic framework, we are able to securely decide when a query workload is novel reflecting the analysts’ interest with the ultimate purpose to minimize the risk of low prediction accuracy.
Read the Original
This page is a summary of: Query-Driven Learning for Predictive Analytics of Data Subspace Cardinality, ACM Transactions on Knowledge Discovery from Data, August 2017, ACM (Association for Computing Machinery), DOI: 10.1145/3059177.
You can read the full text:
ML Databases (MLR)
The dataset from the UCI Repository (MLR) contains multivariate data with and is used for performance analysis of our method and for comparison against the sampling method.
UCI KDD Database
The real dataset from UCI MLR is used for comparison with GenHist model.
Learning set cardinality in distance nearest neighbours
Christos Anagnostopoulos and Peter Triantafillou. 2015a. Learning set cardinality in distance nearest neighbours. In Proceedings of the IEEE International Conference on Data Mining (ICDM’15). 691–696.
Learning to accurately COUNT with query-driven predictive analytics
Christos Anagnostopoulos and Peter Triantafillou. 2015b. Learning to accurately COUNT with query-driven predictive analytics. In Proceedings of the IEEE International Conference on Big Data (Big Data’15). 14–23.
Aggregate Query Prediction under Dynamic Workloads
Savva, F. , Anagnostopoulos, C. and Triantafillou, P. (2020) Aggregate Query Prediction under Dynamic Workloads. In: 2019 IEEE International Conference on Big Data (IEEE BigData 2019), Los Angeles, CA, USA, 09-12 Dec 2019, pp. 671-676. ISBN 9781728108582 (doi:10.1109/BigData47090.2019.9006267)
Large-scale predictive modeling and analytics through regression queries in data management systems
Anagnostopoulos, C. and Triantafillou, P. (2020) Large-scale predictive modeling and analytics through regression queries in data management systems. International Journal of Data Science and Analytics, 9(1), pp. 17-55.
The following have contributed to this page