Topical Structure in Long Informal Documents

Kazantseva, Anna

Topical Structure in Long Informal Documents

dc.contributor.author	Kazantseva, Anna
dc.contributor.supervisor	Szpakowicz, Stanislaw
dc.date.accessioned	2014-09-25T12:20:02Z
dc.date.available	2014-09-25T12:20:02Z
dc.date.created	2014
dc.date.issued	2014
dc.degree.discipline	Génie / Engineering
dc.degree.level	doctorate
dc.degree.name	PhD
dc.description.abstract	This dissertation describes a research project concerned with establishing the topical structure of long informal documents. In this research, we place special emphasis on literary data, but also work with speech transcripts and several other types of data. It has long been acknowledged that discourse is more than a sequence of sentences but, for the purposes of many Natural Language Processing tasks, it is often modelled exactly in that way. In this dissertation, we propose a practical approach to modelling discourse structure, with an emphasis on it being computationally feasible and easily applicable. Instead of following one of the many linguistic theories of discourse structure, we attempt to model the structure of a document as a tree of topical segments. Each segment encapsulates a span that concentrates on a particular topic at a certain level of granularity. Each span can be further sub-segmented based on finer fluctuations of topic. The lowest (most refined) level of segmentation is individual paragraphs. In our model, each topical segment is described by a segment centre -- a sentence or a paragraph that best captures the contents of the segment. In this manner, the segmenter effectively builds an extractive hierarchical outline of the document. In order to achieve these goals, we use the framework of factor graphs and modify a recent clustering algorithm, Affinity Propagation, to perform hierarchical segmentation instead of clustering. While it is far from being a solved problem, topical text segmentation is not uncharted territory. The methods developed so far, however, perform least well where they are most needed: on documents that lack rigid formal structure, such as speech transcripts, personal correspondence or literature. The model described in this dissertation is geared towards dealing with just such types of documents. In order to study how people create similar models of literary data, we built two corpora of topical segmentations, one flat and one hierarchical. Each document in these corpora is annotated for topical structure by 3-6 people. The corpora, the model of hierarchical segmentation and software for segmentation are the main contributions of this work.
dc.faculty.department	Science informatique et génie électrique / Electrical Engineering and Computer Science
dc.identifier.uri	http://hdl.handle.net/10393/31612
dc.identifier.uri	http://dx.doi.org/10.20381/ruor-6645
dc.language.iso	en
dc.publisher	Université d'Ottawa / University of Ottawa
dc.subject	Natural Language Processing
dc.subject	topic modelling
dc.subject	topical segmentation
dc.subject	discourse structure
dc.title	Topical Structure in Long Informal Documents
dc.type	Thesis
thesis.degree.discipline	Génie / Engineering
thesis.degree.level	Doctoral
thesis.degree.name	PhD
uottawa.department	Science informatique et génie électrique / Electrical Engineering and Computer Science

Fichiers

Trousse originale

Voici les éléments 1 - 1 sur 1

Nom:: Kazantseva_Anna_2014_thesis.pdf
Taille:: 1.81 MB
Format:: Adobe Portable Document Format
Description:

Télécharger

Trousse de licence

Voici les éléments 1 - 1 sur 1

Nom:: license.txt
Taille:: 4.07 KB
Format:: Item-specific license agreed upon to submission
Description:

Télécharger

Collections

- Thèses, 2011 - // Theses, 2011 -