What is it about?

Much of research work has been carried out based on MapReduce in order to reduce cloud storage and improve analytical efficiency. This paper adopts a totally differnt approach, which focuses on the data characteristics such as normal distribution, possion distribution. In this paper, the data reduction is based on unsupervised sampling from original data which can greatly reduce the data size but minimize information losses.

Featured Image

Why is it important?

In the digital age, the explosion of data volume demends an efficient solution to reduce the data size particularly for cloud-based systems. This paper proposes data characteristics-based method instead of MapReduce solution. The experimental results show that one dimentional data set is more than 95% of similar to original data sets.

Perspectives

We have developed multi-dimentional data sets sampling solutions. In the future, more advanced methods will be applied to our models. One dimentional sampling can solve the data storage reduction problem for simple data structure. More complex multi-dimentional data structures can use covariance matrix to solve the problem as we publish paper on WI 2022.

Henry Zane
NIT, Zhejiang University

Read the Original

This page is a summary of: Splitting Large Medical Data Sets Based on Normal Distribution in Cloud Environment, IEEE Transactions on Cloud Computing, April 2020, Institute of Electrical & Electronics Engineers (IEEE),
DOI: 10.1109/tcc.2015.2462361.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page