Abstract
A very popular data mining task is to partition objects into
different groups such that the objects in each group share some unique
properties. The task has long been formulated as two different
computational problems: clustering and classification. In clustering
(unsupervised learning), all objects are unlabeled and the grouping is
based on a similarity function. In contrast, classification (supervised
learning) takes as input a set of labeled training objects to build a
classifier for assigning labels to new unlabeled objects. The grouping
of new objects is mainly based on the patterns learnt from the training
set.
There are many real situations where clustering may not yield
satisfactory results, while there is no or not enough labeled data for
building a classifier. In such situations, it is possible to make a
surprising performance boost of clustering accuracy by utilizing only a
small amount of accessible domain knowledge. This learning paradigm is
now commonly known as semi-supervised clustering.
In this talk, I will discuss the motivation for this new technique
by addressing its potential applications in various domains, with a
special emphasis on explaining why traditional clustering and
classification techniques may not work or may give poorer results in
such cases. I will then go through the recent studies on the topic and
compare the different proposed approaches by comparing
The kind of input knowledge being considered
When the knowledge is supplied to the clustering algorithm
How does the knowledge affect the clustering process
I will also suggest some possible future works, particularly in the
database and bioinformatics domains.