UC Irvine
ML Repository
Theme

Spambase

Download(122.6 KB)
Thumbnail
About
Classifying Email as Spam or Non-Spam The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography... The classification task for this dataset is to determine whether a given email is spam or not. Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter. For background on spam: Cranor, Lorrie F., LaMacchia, Brian A. Spam!, Communications of the ACM, 41(8):74-83, 1998. Typical performance is around ~7% misclassification error. False positives (marking good mail as spam) are very undesirable.If we insist on zero false positives in the training/testing set, 20-25% of the spam passed through the filter. See also Hewlett-Packard Internal-only Technical Report. External version forthcoming.
Subject Area
Computer Science
Instances
4,601
Features
57
Data Types
Multivariate
Tasks
Classification
Feature Types
Integer, Continuous
Features
NameRoleTypeUnitsMissing ValuesDescription
Keywords
–
Authors
Mark Hopkins
Erik Reeber
George Forman
Jaap Suermondt
Year Created
1999
License
CC BY 4.0
Donated On
1 Jul 1999