Distributed Representations of Topics

Angelov, Dimo2025-08-112025-08-112025-08-11http://hdl.handle.net/10393/50745https://doi.org/10.20381/ruor-31311In an era where around 330 million terabytes of data are generated each day, it is crucial to have effective methods for extracting knowledge and structure from this vast amount of information. Topic modeling is a technique for extracting themes, topics, and structure within large data sets which allow for organizing, searching and making sense of the data efficiently. It is a fundamental technique that has a lot of downstream uses in information retrieval, recommender systems, content summarization, content tagging, trend detection, and many others. Some major challenges of topic modeling are finding the right resolution of topics, labeling the topics, segmenting text by topics, evaluating topic model performance, and dealing with topic change over time. The most widely used methods for topic modeling are Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis. They are probabilistic generative models and, despite their popularity, they have several weaknesses. In order to achieve optimal results, they often require the number of topics to be known. They need custom stop-word lists, stemming, and lemmatization. Lastly, they model topics as distributions over a vocabulary which necessarily make uninformative words the most probable in a topic. Modern neural topic modeling approaches have tackled some of these problems, but none have been able to solve all of them. We introduce distributed representations of topics where topics are vectors in a semantic vector space. We redefine topics to be the most informationally representative of documents rather than representing an underlying distribution over a vocabulary. Our novel topic modeling approach uses document contextual token embeddings. It creates hierarchical topics, finds topic spans within documents, and labels topics with phrases rather than just words. We propose a density-based agglomerative clustering for semantic vector spaces, which is essential for topic hierarchies. Most previous topic modeling evaluation methods focus on topic coherence without evaluating how well topics represent the documents specifically assigned to a topic, leaving a gap in topic model evaluation. To close this gap, we propose the use of BERTScore and topic information gain to evaluate topic coherence and to evaluate how informative topics are of the underlying documents in addition to the existing topic coherence measures.entopic modelingrepresentation learningmulti-vector document representationtopic segmentationhierarchical topicstext embeddingsDistributed Representations of TopicsThesis