Repository logo

Topical Structure in Long Informal Documents

dc.contributor.authorKazantseva, Anna
dc.contributor.supervisorSzpakowicz, Stanislaw
dc.date.accessioned2014-09-25T12:20:02Z
dc.date.available2014-09-25T12:20:02Z
dc.date.created2014
dc.date.issued2014
dc.degree.disciplineGénie / Engineering
dc.degree.leveldoctorate
dc.degree.namePhD
dc.description.abstractThis dissertation describes a research project concerned with establishing the topical structure of long informal documents. In this research, we place special emphasis on literary data, but also work with speech transcripts and several other types of data. It has long been acknowledged that discourse is more than a sequence of sentences but, for the purposes of many Natural Language Processing tasks, it is often modelled exactly in that way. In this dissertation, we propose a practical approach to modelling discourse structure, with an emphasis on it being computationally feasible and easily applicable. Instead of following one of the many linguistic theories of discourse structure, we attempt to model the structure of a document as a tree of topical segments. Each segment encapsulates a span that concentrates on a particular topic at a certain level of granularity. Each span can be further sub-segmented based on finer fluctuations of topic. The lowest (most refined) level of segmentation is individual paragraphs. In our model, each topical segment is described by a segment centre -- a sentence or a paragraph that best captures the contents of the segment. In this manner, the segmenter effectively builds an extractive hierarchical outline of the document. In order to achieve these goals, we use the framework of factor graphs and modify a recent clustering algorithm, Affinity Propagation, to perform hierarchical segmentation instead of clustering. While it is far from being a solved problem, topical text segmentation is not uncharted territory. The methods developed so far, however, perform least well where they are most needed: on documents that lack rigid formal structure, such as speech transcripts, personal correspondence or literature. The model described in this dissertation is geared towards dealing with just such types of documents. In order to study how people create similar models of literary data, we built two corpora of topical segmentations, one flat and one hierarchical. Each document in these corpora is annotated for topical structure by 3-6 people. The corpora, the model of hierarchical segmentation and software for segmentation are the main contributions of this work.
dc.faculty.departmentScience informatique et génie électrique / Electrical Engineering and Computer Science
dc.identifier.urihttp://hdl.handle.net/10393/31612
dc.identifier.urihttp://dx.doi.org/10.20381/ruor-6645
dc.language.isoen
dc.publisherUniversité d'Ottawa / University of Ottawa
dc.subjectNatural Language Processing
dc.subjecttopic modelling
dc.subjecttopical segmentation
dc.subjectdiscourse structure
dc.titleTopical Structure in Long Informal Documents
dc.typeThesis
thesis.degree.disciplineGénie / Engineering
thesis.degree.levelDoctoral
thesis.degree.namePhD
uottawa.departmentScience informatique et génie électrique / Electrical Engineering and Computer Science

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail ImageThumbnail Image
Name:
Kazantseva_Anna_2014_thesis.pdf
Size:
1.81 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail ImageThumbnail Image
Name:
license.txt
Size:
4.07 KB
Format:
Item-specific license agreed upon to submission
Description: