Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data

Khurram Nadeem; Mehdi-Abderrahman Jabri

doi:10.1371/journal.pone.0280258

What is it about?

We develop a novel covariate ranking and selection algorithm for regularized ordinary logistic regression models in the presence of severe class-imbalance in high dimensional and potentially massive datasets. We assess selection performance by conducting a detailed simulation experiment to show that our algorithm is robust against severe class-imbalance under the presence of highly correlated covariates, and consistently achieves stable and accurate variable selection with very low false discovery rate. We illustrate our methodology using a case study involving a severely imbalanced high-dimensional wildland fire occurrence dataset comprising 13 million instances. The case study and simulation results demonstrate that our framework provides a robust approach to variable selection in severely imbalanced big binary data.

Photo by Markus Spiske on Unsplash

Why is it important?

Fitting regularized versions of the ordinary logistic model to massively large datasets is often computationally infeasible due to restricted amount of memory (RAM) available on computers. This issue is especially exacerbated for commonly used analysis languages such as R. The SVRS algorithm is therefore attractive in the sense that: i) it circumvents the computational bottleneck by making it feasible to estimate the model from much smaller subsamples of the original data, and ii) its implementation is highly parallelizable which yields further gains in computational efficiency. In summary, this study introduces a new variable selection method for logistic regression modeling of extreme rare events data. The method combines response-based subsampling and commonly employed regularization methods to perform accurate variable selection for high-dimensional and large datasets. Our methodology is applicable to a wide array of contexts as the performance results are supported by an extensive simulation experiment and analysis of big and severely imbalanced real-life datasets.

Perspectives

This project was part of thesis research work done by my MSc student, and co-author on this study, Mehdi-Abderrahman Jabri. It was a wonderful experience supervising him on this project and working jointly with him to publish the findings.
Khurram Nadeem

This page is a summary of: Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data, PLoS ONE, January 2023, PLOS,
DOI: 10.1371/journal.pone.0280258.
You can read the full text:

Read

Contributors

The following have contributed to this page

Khurram Nadeem

Variable Selection in Big Binary Data Logistic Regression

What is it about?

Why is it important?

Perspectives

Contributors

You might also like

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Variable Selection in Big Binary Data Logistic Regression

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Contributors

Share this page:

You might also like

Lower probability and shorter duration of infections after COVID-19 vaccine correlate with anti-SARS-CoV-2 circulating IgGs

Comparative analysis of various clinical specimens in detection of SARS-CoV-2 using rRT-PCR in new and follow up cases of COVID-19 infection: Quest for the best choice

Clinical heterogeneity of Pulmonary Arterial Hypertension associated with variants in TBX4

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management