HKU CS-Database Research Group

The University of Hong Kong
Department of Computer Science

home

db seminars

11 Aug 2004

Scalable Clustering of Categorical Data

Speaker: Panos Karras

Abstract

Clustering is a problem of great practical importance in numerous applications. The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data values. A scalable hierarchical categorical clustering algorithm that builds on the Information Bottleneck (IB) framework for quantifying the relevant information preserved when clustering has been introduced, called LIMBO. As a hierarchical algorithm, LIMBO has the advantage that it can produce clusterings of different sizes in a single execution. The IB framework is used to define a distance measure for categorical tuples and a novel distance measure for categorical attribute values is presented. The LIMBO algorithm can be used to cluster both tuples and values and it handles large data sets by producing a memory bounded summary model for the data. LIMBO supports a trade- off between efficiency (in terms of space and time) and quality, allowing for substantial improvements in efficiency with negligible decrease in quality.

Read the Presentation Slides...

Referred Papers

	Back to the top
	Comment? Send to dbgroup@cs.hku.hk