What is it about?

In this work, we analysed the visual words (codebook) of protein distance matrices. We studied the relationship between the size of the vocabulary and the classification accuracy. The result was that codewords with higher relative frequency are generally closer to the main diagonal of the distance matrix. We also showed that solenoid domains have a much lower proportion of unique codewords compared to globular proteins, and that the feature vector (codeword histogram) together with a support vector machine classifier can be used very efficiently to discriminate between globular and solenoid proteins.

Featured Image

Why is it important?

We also showed that solenoid domains have a much lower proportion of unique codewords compared to globular proteins, and that the feature vector (codeword histogram) together with a support vector machine classifier can be used very efficiently to discriminate between globular and solenoid proteins.

Perspectives

We believe that further work and development can be done to investigate whether the codeword histogram is useful for classifying tandem repeats. In addition, a more advanced approach, such as pooling methods, can be used to incorporate spatial data from protein distance matrix patches.

Jure Pražnikar
University of Primorska Faculty of Mathematics, Natural Sciences and Information Technologies

Read the Original

This page is a summary of: Quantitative analysis of visual codewords of a protein distance matrix, PLoS ONE, February 2022, PLOS,
DOI: 10.1371/journal.pone.0263566.
You can read the full text:

Read
Open access logo

Contributors

The following have contributed to this page