What is it about?
Spark is currently the framework of choice for running data analytics workloads. However, the default configuration is shown to be significantly sub-optimal. Furthermore, because of the high number of tunable parameters, finding the optimal configuration for a workload is non-trivial. We employ a Random Forests model to identify application-specific important parameters and then tune them using Bayesian Optimization to overcome this challenge. Also, we include well-performing configurations from previous tuning sessions to speed up the search process. Our evaluation with an extensive set of analytics workloads demonstrates that ROBOTune finds configurations that perform better on average while significantly improving search cost and search speed.
Featured Image
Photo by Tim Mossholder on Unsplash
Why is it important?
Deploying analytics workloads with a near-optimal configuration saves a huge amount of cluster time and resources. Thus, finding well-performing configurations, especially for recurring applications, is quite important. Our work aims at reducing the tuning cost and speed while suggesting well-performing configurations, which is critical for the real-world adoption of automated tuning of analytics workloads.
Read the Original
This page is a summary of: ROBOTune: High-Dimensional Configuration Tuning for Cluster-Based Data Analytics, August 2021, ACM (Association for Computing Machinery),
DOI: 10.1145/3472456.3472518.
You can read the full text:
Resources
Contributors
The following have contributed to this page







