11 Aug 2004
Scalable Clustering of Categorical Data
Speaker: Panos Karras
Abstract
Clustering is a problem of great practical importance in numerous applications.
The problem of clustering becomes more challenging when the data is
categorical, that is, when there is no inherent distance measure between data
values. A scalable hierarchical categorical clustering algorithm that builds
on the Information Bottleneck (IB) framework for quantifying the relevant
information preserved when clustering has been introduced, called LIMBO. As a
hierarchical algorithm, LIMBO has the advantage that it can produce
clusterings of different sizes in a single execution. The IB framework is used
to define a distance measure for categorical tuples and a novel distance
measure for categorical attribute values is presented. The LIMBO algorithm can
be used to cluster both tuples and values and it handles large data sets by
producing a memory bounded summary model for the data. LIMBO supports a trade-
off between efficiency (in terms of space and time) and quality, allowing for
substantial improvements in efficiency with negligible decrease in quality.
Read the Presentation
Slides...
Referred Papers
|