Structural identification of unintelligible documents.

Title: Structural identification of unintelligible documents.
Authors: Fontaine, Martin.
Date: 2000
Abstract: This thesis is about the identification of unintelligible documents using machine learning techniques. An unintelligible document is a document that is not necessarily expressed in a natural language. We have developed a three level approach including: the representation of a base alphabet, the textual feature extraction and the induction of classification models using machine learning techniques. For the representation of the base alphabet, we have introduced a clustering technique that reduces the size of the alphabet and increases the density of the training documents. For the textual feature extraction, we present four techniques adapted for unintelligible documents. Among the presented feature extraction techniques, we have developed two techniques based on the data compression concept. We have explored two approaches to accomplish the induction of an identification model: a rule-based system called RIPPER and an approach based on grammatical inference. In order to combine different complementary techniques, we have developed an object-oriented framework that encourages code reuse. The presented techniques have been tested in an experiment on three different domains: the identification of e-mail documents, the identification of BASE64 encoded documents and the identification of GIF images versus BMP images.
CollectionTh├Ęses, 1910 - 2010 // Theses, 1910 - 2010
MQ58452.PDF5.35 MBAdobe PDFOpen