Better prediction of protein cellular localization sites with the k nearest neighbors classifier

P Horton; K Nakai

Better prediction of protein cellular localization sites with the k nearest neighbors classifier

Proc Int Conf Intell Syst Mol Biol. 1997:5:147-52.

Authors

P Horton¹, K Nakai

Affiliation

¹ Computer Science Division, University of California, Berkeley 94720, USA. paulh@cs.berkeley.edu

PMID: 9322029

Abstract

We have compared four classifiers on the problem of predicting the cellular localization sites of proteins in yeast and E. coli. A set of sequence derived features, such as regions of high hydrophobicity, were used for each classifier. The methods compared were a structured probabilistic model specifically designed for the localization problem, the k nearest neighbors classifier, the binary decision tree classifier, and the naïve Bayes classifier. The result of tests using stratified cross validation shows the k nearest neighbors classifier to perform better than the other methods. In the case of yeast this difference was statistically significant using a cross-validated paired t test. The result is an accuracy of approximately 60% for 10 yeast classes and 86% for 8 E. coli classes. The best previously reported accuracies for these datasets were 55% and 81% respectively.

Publication types

Comparative Study
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Bacterial Proteins / chemistry
Bacterial Proteins / classification
Bacterial Proteins / metabolism
Bayes Theorem
Binding Sites
Databases, Factual
Decision Trees
Escherichia coli / metabolism
Evaluation Studies as Topic
Fungal Proteins / chemistry
Fungal Proteins / classification
Fungal Proteins / metabolism
Proteins / chemistry
Proteins / classification*
Proteins / metabolism*
Saccharomyces cerevisiae / metabolism
Sequence Alignment
Software
Subcellular Fractions / metabolism

Substances

Bacterial Proteins
Fungal Proteins
Proteins