Ecoli
About
This data contains protein localization sites
The references below describe a predecessor to this dataset and its development. They also give results (not cross-validated) for classification by a rule-based expert system with that version of the dataset.
Reference: "Expert Sytem for Predicting Protein Localization Sites in Gram-Negative Bacteria", Kenta Nakai & Minoru Kanehisa, PROTEINS: Structure, Function, and Genetics 11:95-110, 1991.
Reference: "A Knowledge Base for Predicting Protein Localization Sites in Eukaryotic Cells", Kenta Nakai & Minoru Kanehisa, Genomics 14:897-911, 1992.
Variables Info:
1. Sequence Name: Accession number for the SWISS-PROT database
2. mcg: McGeoch's method for signal sequence recognition.
3. gvh: von Heijne's method for signal sequence recognition.
4. lip: von Heijne's Signal Peptidase II consensus sequence score. Binary attribute.
5. chg: Presence of charge on N-terminus of predicted lipoproteins. Binary attribute.
6. aac: score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins.
7. alm1: score of the ALOM membrane spanning region prediction program.
8. alm2: score of ALOM program after excluding putative cleavable signal regions from the sequence.
Class labels:
cp (cytoplasm) 143
im (inner membrane without signal sequence) 77
pp (perisplasm) 52
imU (inner membrane, uncleavable signal sequence) 35
om (outer membrane) 20
omL (outer membrane lipoprotein) 5
imL (inner membrane lipoprotein) 2
imS (inner membrane, cleavable signal sequence) 2
Subject Area
Biology
Instances
336
Features
8
Data Types
Multivariate
Tasks
Classification
Feature Types
Continuous
Features
| Name | Role | Type | Units | Missing Values | Description |
|---|---|---|---|---|---|
| Sequence | Id | Categorical | - | No | |
| mcg | Feature | Continuous | - | No | |
| gvh | Feature | Continuous | - | No | |
| lip | Feature | Binary | - | No | |
| chg | Feature | Binary | - | No | |
| aac | Feature | Continuous | - | No | |
| alm1 | Feature | Continuous | - | No | |
| alm2 | Feature | Continuous | - | No | |
| class | Target | Categorical | - | No |
Introductory Paper
A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins
P. Horton, K. Nakai. 1996.
Intelligent Systems in Molecular Biology