Applications of corpus-based semantic similarity and word segmentation to database schema matching

Description
Title: Applications of corpus-based semantic similarity and word segmentation to database schema matching
Authors: Islam, Md. Aminul
Date: 2006
Abstract: In this thesis, we present a method for database schema matching, the problem of identifying elements of two given schemas that correspond to each other. Schema matching is useful in e-commerce exchanges, in data integration/warehousing, and in Semantic Web applications. We first present two corpus-based methods: one method is for determining the semantic similarity of two target words and the other is for automatic word segmentation. Then we present a name-based element-level database schema matching method that exploits both the semantic similarity and the word segmentation methods. Our word similarity method uses Pointwise Mutual Information (PMI) to sort lists of important neighbor words of the two target words and distinguish the words which are common in both lists and aggregate their PMI values (from the opposite list) to calculate the relative similarity score. Our word segmentation method uses corpus type frequency information to choose the type with maximum length and frequency from "desegmented" text. It also uses a modified forward-backward matching technique using maximum length frequency and entropy rate if any non-matching portions of the text exist. For the database schema matching method, we also use normalized and modified versions of the Longest Common Subsequence (LCS) string matching algorithm with weight factors to allow for a balanced combination. We validate our methods with experimental studies, the results of which suggest that these methods can be a useful addition to the set of existing methods.
URL: http://hdl.handle.net/10393/27256
http://dx.doi.org/10.20381/ruor-11992
CollectionTh├Ęses, 1910 - 2010 // Theses, 1910 - 2010
Files
MR18429.PDF3.68 MBAdobe PDFOpen