UC Irvine
ML Repository
Theme

Product Classification and Clustering

Download(602.1 KB)

About

This dataset was collected from PriceRunner, a popular product comparison platform. It includes 35311 product offers from 10 categories, provided by 306 different merchants. This dataset offers an ideal ground for evaluating classification, clustering, and entity matching algorithms. Although it contains product-related data, it can still be applied to any problem involving text/short-text mining. Preprocessing description: Case folding and punctuation removal were applied to the titles of column 2. Does this dataset contain sensitive information?: no
Subject Area
Business
Instances
35,311
Features
7
Data Types
Tabular, Text
Tasks
Classification, Clustering
Feature Types
Categorical, Integer

Features

NameRoleTypeUnitsMissing Values

Introductory Paper

A self-verifying clustering approach to unsupervised matching of product titles
Leonidas Akritidis, Athanasios Fevgas, Panayiotis Bozanis, C. Makris. 2020.
Artificial Intelligence Review

Additional Metadata

Authors
Leonidas Akritidis
Year Created
2020
License
CC BY 4.0