A comprehensive simulation study on classification of RNA-Seq data

Gökmen Zararsız; Dincer Goksuluk; Selcuk Korkmaz; Vahap Eldem; Gozde Erturk Zararsiz; Izzet Parug Duru; Ahmet Ozturk

doi:10.1371/journal.pone.0182507

What is it about?

In this research, the authors compare various classification algorithms for RNA sequencing (RNA-Seq) data, which uses next-generation sequencing technologies for gene expression profiling. Traditional statistical methods, which are based on a continuous scale, cannot be directly applied to RNA-Seq data due to its discrete distribution. Therefore, count-based classifiers, such as PLDA with power transformation, NBLDA, and microarray-based classifiers after rlog/vst transformations, are proposed. The study also examines the impact of several parameters, including sample size, overdispersion, and the number of genes and classes, on model performance. The results indicate that increasing the sample size, decreasing the dispersion parameter, and the number of groups lead to an increase in classification accuracy. The authors conclude that PLDA after a power transformation may be a good choice as a count-based classifier, while NBLDA performance is not satisfactory. RF, SVM, and bagSVM may give accurate results after an rlog or vst transformation. Moreover, the efficiency of the bagSVM is improved markedly with increasing sample size. An R/BIOCONDUCTOR package, MLSeq, is developed for the classification of RNA-Seq data.

Why is it important?

The study is important because it focuses on the classification of RNA-Seq data, which is a powerful technique for gene expression profiling. The increasing use of RNA-Seq in research and diagnostics highlights the need for effective classification algorithms that can handle the unique characteristics of RNA-Seq data, such as overdispersion and continuous scaling. Key Takeaways: 1. RNA-Seq data is overdispersed, which can negatively impact classification performance. 2. Count-based classifiers, such as PLDA with power transformation and NBLDA, can efficiently handle overdispersed RNA-Seq data. 3. Microarray-based classifiers, after rlog/vst transformations, can also be used for classifying RNA-Seq data. 4. The PLDA classifier after a power transformation may be a good choice as a count-based classifier due to its sparsity and efficiency. 5. Further research is needed to improve the performance of NBLDA as a count-based classifier and to extend it into a sparse classifier. 6. An R/BIOCONDUCTOR package, MLSeq, is available for the classification of RNA-Seq data.

Some of the content on this page has been created using generative AI.

This page is a summary of: A comprehensive simulation study on classification of RNA-Seq data, PLOS One, August 2017, PLOS,
DOI: 10.1371/journal.pone.0182507.
You can read the full text:

Read

Contributors

The following have contributed to this page

Professor Turgay UNVER
Cankiri Karatekin University

Classification of RNA-Seq Data: A Comprehensive Simulation Study

What is it about?

Why is it important?

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Classification of RNA-Seq Data: A Comprehensive Simulation Study

What is it about?

Featured Image

Why is it important?

AI notice

Read the Original

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management