What is it about?

Climate datasets are usually enormous in size, but researchers are mostly interested in small chunks of the dataset at a time. Currently, researches rely on hand-written Python or Julia scripts to filter and analyze large datasets. The direct alternative, i.e. dedicated spatiotemporal data management systems, are difficult to set up and use. Northlight is an extension for the popular SparkSQL data analysis framework, and allows researchers to efficiently access and process large distributed climate datasets using standard SQL queries.

Featured Image

Why is it important?

Northlight includes a novel algorithm to convert a (possibly non-convex) query predicate into a set of convex, non-overlapping regions, which can then be loaded from the dataset. The algorithm does not rely on a pre-built index structure. This process would take a lot of effort to perform manually for each query.

Read the Original

This page is a summary of: Northlight: Declarative and Optimized Analysis of Atmospheric Datasets in SparkSQL, July 2022, ACM (Association for Computing Machinery),
DOI: 10.1145/3538712.3538715.
You can read the full text:

Read

Resources

Contributors

Be the first to contribute to this page