Text processing without a priori domain knowledge: Semi-automatic linguistic analysis for incremental knowledge acquisition.
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Ottawa (Canada)
Abstract
Technical texts are an invaluable source of the domain-specific knowledge which plays a crucial role in advanced knowledge-based systems today. However, acquiring such knowledge has always been a major difficulty in the construction of these systems--this critical obstacle is sometimes referred to as the "knowledge acquisition bottleneck". In order to lessen the burden on the knowledge engineer's shoulders, several approaches have been proposed in the literature. A few of these suggest processing texts pertaining to the domain of interest in order to extract the knowledge they contain and thus facilitate the domain modelling. We herein propose a new approach to knowledge acquisition from texts; this approach is comprised of a new methodology and computational framework for the implementation of a linguistic processor which represents the central component of a system for the acquisition of knowledge from text. The system, named TANKA, is not given the complete domain model beforehand. It is designed to process technical texts in order to incrementally build a knowledge base containing a conceptual model of the domain. TANKA is an intelligent assistant to the knowledge engineer; when it cannot proceed entirely on its own, the user is asked to collaborate. In the process, the system acquires knowledge from text; it can be said to learn about the domain. The originality of the research is due mainly to the fact that we do not assume significant a priori domain-specific (semantic) knowledge: this assumption represents a severe constraint on the natural language processor. The only external elements of knowledge we consider in the proposed framework are "off-the-shelf" publicly available and domain-independent repositories, such as a basic dictionary containing surface syntactic information (i.e. The Collins) and a lexical database (i.e. WordNet). Other components of the proposed framework are general-purpose. The parser (DIPETT) is domain-independent with a large coverage of English: our approach relies on full syntactic analysis. The Case-based semantic analyzer (HAIKU) is semi-automatic: it interacts with the user in order to get his$\sp1$ approval of the analysis it has just proposed and negotiates refined elements of the analysis when necessary. The combined processing of DIPETT and HAIKU allows TANKA, the encompassing system$\sp2$, to acquire knowledge, based on the conceptual elements produced by HAIKU. The thesis also describes experiments that have been conducted on a Prolog implementation of both of these text analysis components. The approach presented in the thesis is general and in principle portable to any domain in which suitable technical texts are available. The thesis presents theoretical considerations as well as engineering aspects of the many facets of this research work. We also provide a detailed discussion of many future work items that could be added to what has already been accomplished in order to make the framework even more productive. (Abstract shortened by UMI.) ftn$\sp1$In order to lighten the text, the terms 'he' and 'his' have been used generically to refer equally to persons of either sex. No discrimination is either implied or intended. $\sp2$DIPETT and HAIKU constitute a conceptual analyzer that can be used independently of TANKA or within a different encompassing system.
Description
Keywords
Citation
Source: Dissertation Abstracts International, Volume: 56-01, Section: B, page: 0338.
