Histological classification of non-small cell lung cancer with RNA-seq data using machine learning models

Robert B. Eshun; Md Khurram Monir Rabby; A. K. M. Kamrul Islam; Marwan U. Bikdash

doi:10.1145/3459930.3471168

What is it about?

This study develops an automated model using the supervised learning framework(s) for the classification of the histological subtypes of non-small cell lung cancer (NSCLC). The machine learning (ML) approach is performed on gene expression profiles for the diagnosis of lung cancer that is the primary cause of cancer deaths worldwide. The performance of five classical Machine Learning (ML) estimators and four ensemble ML classifiers are evaluated on an RNA-Sequence dataset of 127 cases of NSCLC. The Decision Tree (DT) and Bagging models show promising classification accuracy up to 100% and area under curves (AUCs) is more than 0.97. The implemented ensemble methods collectively exhibit good performance in terms of AUCs (0.68 -- 1.00). The findings are comparable to the high precision ML models and the results provide an insight into the supervised models that can achieve higher diagnosis accuracy on RNA-Seq-based gene expression profiles of NSCLC subtypes.

Photo by Europeana on Unsplash

Why is it important?

This study is important because it demonstrates that highly accurate and computationally efficient diagnosis of non-small cell lung cancer (NSCLC) histological subtypes can be achieved using supervised machine learning models applied to RNA-Seq gene expression data. Unlike many recent studies that emphasize deep learning models requiring large datasets and substantial computational resources, this work shows that classical and ensemble machine learning approaches—particularly Decision Tree, Bagging, and AdaBoost—can attain near-perfect classification performance even with a limited number of samples and a drastically reduced feature space. The ability to achieve up to 100% accuracy and AUC using only four PCA-derived components is a key and distinctive contribution. The work is timely because RNA-Seq has become increasingly accessible in clinical and translational research, yet there remains a critical need for reliable, interpretable, and low-overhead analytical frameworks that can be integrated into diagnostic pipelines. By systematically comparing classical and ensemble models with and without dimensionality reduction, this study provides practical guidance on model selection for transcriptomics-based cancer diagnosis. The findings highlight that simpler, interpretable models can rival or outperform more complex approaches, which is particularly relevant for clinical adoption where transparency, reproducibility, and computational efficiency are essential. The broader impact of this research lies in its potential to support faster, more accurate, and cost-effective diagnosis of NSCLC subtypes, which directly influences treatment selection and patient outcomes. By reducing reliance on genome-wide feature sets and demonstrating robust performance on RNA-Seq data, this work contributes to advancing precision oncology and encourages wider use of machine learning-assisted diagnostic tools in real-world clinical settings.

Perspectives

Working on this study was particularly meaningful to me because it sits at the intersection of data science, machine learning, and real-world clinical impact. Lung cancer remains one of the most devastating diseases globally, and throughout this work I was constantly reminded that behind every data point is a patient whose diagnosis and treatment depend on accurate and timely decisions. That awareness strongly shaped how I approached the modeling choices, with a focus not only on performance but also on simplicity, interpretability, and practicality. One aspect of this work that I personally found rewarding was discovering that relatively simple and well-understood machine learning models could perform exceptionally well on complex RNA-Seq data when paired with thoughtful feature reduction. At a time when the field often gravitates toward increasingly complex deep learning architectures, it was encouraging to see that classical and ensemble methods can still offer powerful, transparent solutions—especially in settings where data and computational resources are limited. I also hope this article helps bridge the gap between computational research and biomedical application. For researchers entering this space, I want it to demonstrate that meaningful contributions to precision medicine do not always require massive datasets or black-box models. For clinicians and translational scientists, I hope it reinforces confidence in machine learning as a supportive diagnostic tool rather than a replacement for expert judgment. Ultimately, I see this work as a step toward making RNA-Seq–based decision support more accessible and clinically relevant. If this article encourages further collaboration between data scientists and medical researchers—or inspires others to pursue interpretable, efficient approaches to cancer diagnosis—then it will have achieved something beyond its technical contributions.
Md Khurram Monir Rabby

This page is a summary of: Histological classification of non-small cell lung cancer with RNA-seq data using machine learning models, August 2021, ACM (Association for Computing Machinery),
DOI: 10.1145/3459930.3471168.
You can read the full text:

Read

Contributors

The following have contributed to this page

Md Khurram Monir Rabby

Histological Classification of Non-small Cell Lung Cancer with RNA-seq Data Using Machine Learning

What is it about?

Why is it important?

Perspectives

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Histological Classification of Non-small Cell Lung Cancer with RNA-seq Data Using Machine Learning

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management