Suicide Ideation Detection from Social Media Using Language Models: Data Augmentation and Interpretability

Ghanadian, Hamideh

Suicide Ideation Detection from Social Media Using Language Models: Data Augmentation and Interpretability

dc.contributor.author	Ghanadian, Hamideh
dc.contributor.supervisor	Al Osman, Hussein
dc.contributor.supervisor	Nejadgholi, Isar
dc.date.accessioned	2026-05-07T17:35:25Z
dc.date.available	2026-05-07T17:35:25Z
dc.date.issued	2026-05-07
dc.description.abstract	Early detection of suicide is a vital research area that holds great potential for facilitating early prevention and interventions by mental health professionals. With accurate and reliable detection of suicide ideation, targeted interventions can be developed to reduce suicide rates and provide better support for at-risk individuals. While traditional methods of identifying individuals at risk of suicide have primarily relied on clinical assessments and crisis hotlines, the ubiquity of social media platforms has opened new avenues for early detection and intervention, as many individuals at risk of suicide might express suicidal ideation in their social media interactions. However, developing models for suicide detection on social media is a challenging area of research, primarily due to ethical and practical issues in data collection and annotation. In this work, we investigate the potential and limitations of Large Language Models in addressing data quality and accessibility issues in suicide detection on social media. First, we explore the capabilities of the state-of-the-art generative LLMs as Zero-shot or Few-shot alternatives to classifiers trained with annotated datasets. Our evaluations of the ChatGPT system underscore the limitations of this model in detecting suicide notes and highlight the necessity of high-quality training datasets for fine-tuning specialized classifiers for this task. Then, we turn to assess the quality of existing datasets collected from social media. With this assessment, we seek to uncover the extent to which social media datasets mirror or diverge from conventional psychological understandings about suicide-related topics. We ground our evaluation of the datasets in established psychological literature by identifying risk factors linked to suicide, such as mental health challenges, relationship conflicts, and financial distress. Employing a guided topic modelling technique, we identify the distribution of mentions of risk factors in existing datasets. Our results demonstrate that while surface-level risk factors such as depression and anxiety dominate the topics of these datasets, more stigmatized topics such as racism, immigration challenges or sexual orientation prejudices are completely absent in these datasets. These results highlight the necessity of creating more diverse datasets that cover the risk factors related to under-represented social groups. Next, we focus on addressing the topic coverage issues in training datasets. Acknowledging that the sensitivity surrounding suicide-related data poses challenges in accessing diverse real-world examples, we introduce an innovative strategy that leverages the capabilities of generative AI models, such as GPT models, Flan-T5, and LLama2, to create synthetic data for suicidal ideation detection. Our data generation approach is grounded in social factors extracted from psychology literature and aims to ensure coverage of essential information related to suicidal ideation. Our comparison of synthetic and real data shows that synthetic data is more balanced in terms of risk factor coverage, is not significantly different from real data in terms of complexity and readability, and is significantly less diverse in terms of the vocabulary used. We then study the impact of psychology-grounded synthetic data on both the performance and the internal representations of suicide-detection models. We first leverage the generated synthetic data as standalone training data and as an augmentation source for fine-tuning BERT-family models for suicidal ideation detection. Our results show that synthetic datasets generated across multiple large language models enable strong generalization to real-world data, achieving an F-score of 82% when evaluated on held-out real samples. Moreover, augmenting this synthetic data with only 30% of the real dataset yields models that outperform those trained exclusively on the full real dataset, demonstrating a cost-effective strategy for improving performance while mitigating topic imbalance. Finally, we examine how topic-aware data augmentation influences the internal representations learned by these models. Using sparse autoencoders and geometric analyses, including UMAP projections and cosine-distance measurements, we analyze whether psychologically meaningful risk factors are encoded as more distinct and separable directions in the models' latent spaces. Our findings indicate that augmentation not only improves predictive performance but also leads to more structured and interpretable internal representations, with several previously under-represented risk factors becoming more clearly encoded. Together, these results highlight the value of combining synthetic data generation with representation-level analysis to develop more reliable and transparent models for suicidal ideation detection.
dc.identifier.uri	http://hdl.handle.net/10393/51616
dc.identifier.uri	https://doi.org/10.20381/ruor-31919
dc.language.iso	en
dc.publisher	Université d'Ottawa \| University of Ottawa
dc.rights	Attribution 4.0 International	en
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.subject	Suicide Ideation Detection
dc.subject	Large Language Models(LLMs)
dc.subject	Model Interpretability
dc.subject	Synthetic Data Generation
dc.subject	Data Augmentation
dc.subject	Topic Modeing
dc.subject	Social Media Analysis
dc.title	Suicide Ideation Detection from Social Media Using Language Models: Data Augmentation and Interpretability
dc.type	Thesis	en
thesis.degree.discipline	Génie / Engineering
thesis.degree.level	Doctoral
thesis.degree.name	PhD
uottawa.department	Science informatique et génie électrique / Electrical Engineering and Computer Science

Fichiers

Trousse originale

Voici les éléments 1 - 1 sur 1

Nom:: Ghanadian_Hamideh_2026_thesis.pdf
Taille:: 9.58 MB
Format:: Adobe Portable Document Format

Télécharger

Trousse de licence

Voici les éléments 1 - 1 sur 1

Nom:: license.txt
Taille:: 2.51 KB
Format:: Item-specific license agreed upon to submission
Description:

Télécharger

Collections

- Thèses, 2011 - // Theses, 2011 -