UC Irvine
ML Repository
Theme

Molecular Biology (Splice-junction Gene Sequences)

Download(163.1 KB)
Thumbnail

About

Primate splice-junction gene sequences (DNA) with associated imperfect domain theory Problem Description: Splice junctions are points on a DNA sequence at which `superfluous' DNA is removed during the process of protein creation in higher organisms. The problem posed in this dataset is to recognize, given a sequence of DNA, the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are spliced out). This problem consists of two subtasks: recognizing exon/intron boundaries (referred to as EI sites), and recognizing intron/exon boundaries (IE sites). (In the biological community, IE borders are referred to a ``acceptors'' while EI borders are referred to as ``donors''.) This dataset has been developed to help evaluate a "hybrid" learning algorithm (KBANN) that uses examples to inductively refine preexisting knowledge. Using a "ten-fold cross-validation" methodology on 1000 examples randomly selected from the complete set of 3190, the following error rates were produced by various ML algorithms (all experiments run at the Univ of Wisconsin, sometimes with local implementations of published algorithms). System -- Neither -- EI -- IE --------------------------------------------------- KBANN -- 4.62 -- 7.56 -- 8.47 BACKPROP -- 5.29 -- 5.74 -- 10.75 PEBLS -- 6.86 -- 8.18 -- 7.55 PERCEPTRON -- 3.99 -- 16.32 -- 17.41 ID3 -- 8.84 -- 10.58 -- 13.99 COBWEB -- 11.80 -- 15.04 -- 9.46 Near. Neighbor -- 31.11 -- 11.65 -- 9.09
Subject Area
Biology
Instances
3,190
Features
61
Data Types
Sequential
Tasks
Classification
Feature Types
Categorical

Features

NameRoleTypeUnitsMissing Values

Introductory Paper

–

Additional Metadata

Keywords
–
Authors
–
Year Created
1991
License
CC BY 4.0