Abstract
Using relational tables to store XML documents is an established trend.
However, it fragments the documents and creates a large number of joins that
seriously impacts query performance. If the collection contains documents
of different structures, we show that a proper clustering of the documents
will alleviate the problem.
To achieve a good clustering, we propose an algorithm S-GRACE which clusters
documents according to their XML structures. S-GRACE is a hierarchical
clustering algorithm for semi-structure data. The notion of structure graph
(s-graph) is proposed which facilitates the definition of a distance metric
applicable
between documents as well as between clusters of documents.
Our experiments with real data such as the DBLP database shows that S-GRACE
can discover clusters that cannot be spotted easily by manual action.