Product Classification and Clustering
About
This dataset was collected from PriceRunner, a popular product comparison platform. It includes 35311 product offers from 10 categories, provided by 306 different merchants.
This dataset offers an ideal ground for evaluating classification, clustering, and entity matching algorithms. Although it contains product-related data, it can still be applied to any problem involving text/short-text mining.
Preprocessing description:
Case folding and punctuation removal were applied to the titles of column 2.
Does this dataset contain sensitive information?:
no
Subject Area
Business
Instances
35,311
Features
7
Data Types
Tabular, Text
Tasks
Classification, Clustering
Feature Types
Categorical, Integer
Features
Name | Role | Type | Units | Missing Values |
---|---|---|---|---|
Product ID | Feature | Integer | - | No |
Product Title | Feature | Categorical | - | No |
Merchant ID | Feature | Integer | - | No |
Cluster ID | Feature | Integer | - | No |
Cluster Label | Feature | Categorical | - | No |
Category ID | Feature | Integer | - | No |
Category Label | Feature | Categorical | - | No |
Introductory Paper
A self-verifying clustering approach to unsupervised matching of product titles
Leonidas Akritidis, Athanasios Fevgas, Panayiotis Bozanis, C. Makris. 2020.
Artificial Intelligence Review