Distributed Representations of Topics

Angelov, Dimo

Distributed Representations of Topics

dc.contributor.author	Angelov, Dimo
dc.contributor.supervisor	Inkpen, Diana
dc.date.accessioned	2025-08-11T20:35:40Z
dc.date.available	2025-08-11T20:35:40Z
dc.date.issued	2025-08-11
dc.description.abstract	In an era where around 330 million terabytes of data are generated each day, it is crucial to have effective methods for extracting knowledge and structure from this vast amount of information. Topic modeling is a technique for extracting themes, topics, and structure within large data sets which allow for organizing, searching and making sense of the data efficiently. It is a fundamental technique that has a lot of downstream uses in information retrieval, recommender systems, content summarization, content tagging, trend detection, and many others. Some major challenges of topic modeling are finding the right resolution of topics, labeling the topics, segmenting text by topics, evaluating topic model performance, and dealing with topic change over time. The most widely used methods for topic modeling are Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis. They are probabilistic generative models and, despite their popularity, they have several weaknesses. In order to achieve optimal results, they often require the number of topics to be known. They need custom stop-word lists, stemming, and lemmatization. Lastly, they model topics as distributions over a vocabulary which necessarily make uninformative words the most probable in a topic. Modern neural topic modeling approaches have tackled some of these problems, but none have been able to solve all of them. We introduce distributed representations of topics where topics are vectors in a semantic vector space. We redefine topics to be the most informationally representative of documents rather than representing an underlying distribution over a vocabulary. Our novel topic modeling approach uses document contextual token embeddings. It creates hierarchical topics, finds topic spans within documents, and labels topics with phrases rather than just words. We propose a density-based agglomerative clustering for semantic vector spaces, which is essential for topic hierarchies. Most previous topic modeling evaluation methods focus on topic coherence without evaluating how well topics represent the documents specifically assigned to a topic, leaving a gap in topic model evaluation. To close this gap, we propose the use of BERTScore and topic information gain to evaluate topic coherence and to evaluate how informative topics are of the underlying documents in addition to the existing topic coherence measures.
dc.identifier.uri	http://hdl.handle.net/10393/50745
dc.identifier.uri	https://doi.org/10.20381/ruor-31311
dc.language.iso	en
dc.publisher	Université d'Ottawa / University of Ottawa
dc.subject	topic modeling
dc.subject	representation learning
dc.subject	multi-vector document representation
dc.subject	topic segmentation
dc.subject	hierarchical topics
dc.subject	text embeddings
dc.title	Distributed Representations of Topics
dc.type	Thesis	en
thesis.degree.discipline	Sciences / Science
thesis.degree.level	Doctoral
thesis.degree.name	PhD
uottawa.department	Science informatique et génie électrique / Electrical Engineering and Computer Science

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Angelov_Dimo_2025_thesis.pdf
Size:: 4.4 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.65 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

- Thèses, 2011 - // Theses, 2011 -