What is it about?

We develop a novel covariate ranking and selection algorithm for regularized ordinary logistic regression models in the presence of severe class-imbalance in high dimensional and potentially massive datasets. We assess selection performance by conducting a detailed simulation experiment to show that our algorithm is robust against severe class-imbalance under the presence of highly correlated covariates, and consistently achieves stable and accurate variable selection with very low false discovery rate. We illustrate our methodology using a case study involving a severely imbalanced high-dimensional wildland fire occurrence dataset comprising 13 million instances. The case study and simulation results demonstrate that our framework provides a robust approach to variable selection in severely imbalanced big binary data.

Featured Image

Why is it important?

Fitting regularized versions of the ordinary logistic model to massively large datasets is often computationally infeasible due to restricted amount of memory (RAM) available on computers. This issue is especially exacerbated for commonly used analysis languages such as R. The SVRS algorithm is therefore attractive in the sense that: i) it circumvents the computational bottleneck by making it feasible to estimate the model from much smaller subsamples of the original data, and ii) its implementation is highly parallelizable which yields further gains in computational efficiency. In summary, this study introduces a new variable selection method for logistic regression modeling of extreme rare events data. The method combines response-based subsampling and commonly employed regularization methods to perform accurate variable selection for high-dimensional and large datasets. Our methodology is applicable to a wide array of contexts as the performance results are supported by an extensive simulation experiment and analysis of big and severely imbalanced real-life datasets.


This project was part of thesis research work done by my MSc student, and co-author on this study, Mehdi-Abderrahman Jabri. It was a wonderful experience supervising him on this project and working jointly with him to publish the findings.

Khurram Nadeem

Read the Original

This page is a summary of: Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data, PLoS ONE, January 2023, PLOS,
DOI: 10.1371/journal.pone.0280258.
You can read the full text:

Open access logo


The following have contributed to this page