Spambase

About
Classifying Email as Spam or Non-Spam
The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...
The classification task for this dataset is to determine whether a given email is spam or not.
Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.
For background on spam: Cranor, Lorrie F., LaMacchia, Brian A. Spam!, Communications of the ACM, 41(8):74-83, 1998.
Typical performance is around ~7% misclassification error. False positives (marking good mail as spam) are very undesirable.If we insist on zero false positives in the training/testing set, 20-25% of the spam passed through the filter. See also Hewlett-Packard Internal-only Technical Report. External version forthcoming.
Subject Area
Computer Science
Instances
4,601
Features
57
Data Types
Multivariate
Tasks
Classification
Feature Types
Integer, Continuous
Features
Name | Role | Type | Units | Missing Values | Description |
---|---|---|---|---|---|
word_freq_make | Feature | Continuous | - | No | |
word_freq_address | Feature | Continuous | - | No | |
word_freq_all | Feature | Continuous | - | No | |
word_freq_3d | Feature | Continuous | - | No | |
word_freq_our | Feature | Continuous | - | No | |
word_freq_over | Feature | Continuous | - | No | |
word_freq_remove | Feature | Continuous | - | No | |
word_freq_internet | Feature | Continuous | - | No | |
word_freq_order | Feature | Continuous | - | No | |
word_freq_mail | Feature | Continuous | - | No | |
word_freq_receive | Feature | Continuous | - | No | |
word_freq_will | Feature | Continuous | - | No | |
word_freq_people | Feature | Continuous | - | No | |
word_freq_report | Feature | Continuous | - | No | |
word_freq_addresses | Feature | Continuous | - | No | |
word_freq_free | Feature | Continuous | - | No | |
word_freq_business | Feature | Continuous | - | No | |
word_freq_email | Feature | Continuous | - | No | |
word_freq_you | Feature | Continuous | - | No | |
word_freq_credit | Feature | Continuous | - | No | |
word_freq_your | Feature | Continuous | - | No | |
word_freq_font | Feature | Continuous | - | No | |
word_freq_000 | Feature | Continuous | - | No | |
word_freq_money | Feature | Continuous | - | No | |
word_freq_hp | Feature | Continuous | - | No | |
word_freq_hpl | Feature | Continuous | - | No | |
word_freq_george | Feature | Continuous | - | No | |
word_freq_650 | Feature | Continuous | - | No | |
word_freq_lab | Feature | Continuous | - | No | |
word_freq_labs | Feature | Continuous | - | No | |
word_freq_telnet | Feature | Continuous | - | No | |
word_freq_857 | Feature | Continuous | - | No | |
word_freq_data | Feature | Continuous | - | No | |
word_freq_415 | Feature | Continuous | - | No | |
word_freq_85 | Feature | Continuous | - | No | |
word_freq_technology | Feature | Continuous | - | No | |
word_freq_1999 | Feature | Continuous | - | No | |
word_freq_parts | Feature | Continuous | - | No | |
word_freq_pm | Feature | Continuous | - | No | |
word_freq_direct | Feature | Continuous | - | No | |
word_freq_cs | Feature | Continuous | - | No | |
word_freq_meeting | Feature | Continuous | - | No | |
word_freq_original | Feature | Continuous | - | No | |
word_freq_project | Feature | Continuous | - | No | |
word_freq_re | Feature | Continuous | - | No | |
word_freq_edu | Feature | Continuous | - | No | |
word_freq_table | Feature | Continuous | - | No | |
word_freq_conference | Feature | Continuous | - | No | |
char_freq_; | Feature | Continuous | - | No | |
char_freq_( | Feature | Continuous | - | No | |
char_freq_[ | Feature | Continuous | - | No | |
char_freq_! | Feature | Continuous | - | No | |
char_freq_$ | Feature | Continuous | - | No | |
char_freq_# | Feature | Continuous | - | No | |
capital_run_length_average | Feature | Continuous | - | No | |
capital_run_length_longest | Feature | Continuous | - | No | |
capital_run_length_total | Feature | Continuous | - | No | |
Class | Target | Binary | - | No |
Keywords
–
Authors
Mark Hopkins
Erik Reeber
George Forman
Jaap Suermondt
Year Created
1999
License
CC BY 4.0Donated On
1 Jul 1999