CLUSTERING AND ENTITY RESOLUTION FOR SEMI-STRUCTURED DATA

Chuan Zhao

Using a graph representation of the data, a graph-based similarity measure to assess the similarity between data records is proposed. Both direct and indirect similarity are considered, which comprehensively capture the relationship between data records. Different data mining techniques and applications, including clustering and entity resolution are explored. First, the problem of clustering is considered for a dataset consisting of non-numeric attributes. The K-medoid clustering algorithm is used; some postprocessing steps are introduced to improve the quality of clustering. A set of validity indices are proposed to assess the quality of the clustering results. To reduce computational complexity, a sampling strategy is introduced. Effect of sampling on the values of validity indices and clustering result is discussed. Influence of different similarity measures, postprocessing steps, and cluster numbers on the quality of clustering is discussed, both analytically and experimentally. Similar enhancements to the fuzzy K-medoid algorithm are provided. The clusters resulting from the proposed algorithm can sometimes be interpreted as grouping objects sharing a common attribute that was not used in the clustering algorithm. A multi-medoid K-medoid algorithm is proposed by introducing multiple medoids in each cluster to enhance the performance of the K-medoid algorithm. Finally, an optional node move step is introduced to produce better clustering results based on edge-oriented evaluation measures. The entity resolution problem, which is the process of determining whether multiple records refer to the same real world entity, is studied next. It is an important step during data cleaning and integration. A general entity resolution framework called ERUDITE, which includes data preprocessing (filtering), record matching, and postprocessing (inconsistency elimination, record updating, and equivalent record elimination), is presented. Different record matching models are explored for both supervised and unsupervised learning methods. Two record updating algorithms are proposed to significantly improve the entity resolution result. The entity resolution result generally contains inconsistent decisions. New inconsistency elimination methods are proposed and their performances are compared with that of existing methods. Experiments for both unsupervised and supervised learning on two public datasets show the good performance of the proposed framework.

CLUSTERING AND ENTITY RESOLUTION FOR SEMI-STRUCTURED DATA

Files and links (1)

Abstract

Metrics

Details