Repository logo

Suicide Ideation Detection from Social Media Using Language Models: Data Augmentation and Interpretability

dc.contributor.authorGhanadian, Hamideh
dc.contributor.supervisorAl Osman, Hussein
dc.contributor.supervisorNejadgholi, Isar
dc.date.accessioned2026-05-07T17:35:25Z
dc.date.available2026-05-07T17:35:25Z
dc.date.issued2026-05-07
dc.description.abstractEarly detection of suicide is a vital research area that holds great potential for facilitating early prevention and interventions by mental health professionals. With accurate and reliable detection of suicide ideation, targeted interventions can be developed to reduce suicide rates and provide better support for at-risk individuals. While traditional methods of identifying individuals at risk of suicide have primarily relied on clinical assessments and crisis hotlines, the ubiquity of social media platforms has opened new avenues for early detection and intervention, as many individuals at risk of suicide might express suicidal ideation in their social media interactions. However, developing models for suicide detection on social media is a challenging area of research, primarily due to ethical and practical issues in data collection and annotation. In this work, we investigate the potential and limitations of Large Language Models in addressing data quality and accessibility issues in suicide detection on social media. First, we explore the capabilities of the state-of-the-art generative LLMs as Zero-shot or Few-shot alternatives to classifiers trained with annotated datasets. Our evaluations of the ChatGPT system underscore the limitations of this model in detecting suicide notes and highlight the necessity of high-quality training datasets for fine-tuning specialized classifiers for this task. Then, we turn to assess the quality of existing datasets collected from social media. With this assessment, we seek to uncover the extent to which social media datasets mirror or diverge from conventional psychological understandings about suicide-related topics. We ground our evaluation of the datasets in established psychological literature by identifying risk factors linked to suicide, such as mental health challenges, relationship conflicts, and financial distress. Employing a guided topic modelling technique, we identify the distribution of mentions of risk factors in existing datasets. Our results demonstrate that while surface-level risk factors such as depression and anxiety dominate the topics of these datasets, more stigmatized topics such as racism, immigration challenges or sexual orientation prejudices are completely absent in these datasets. These results highlight the necessity of creating more diverse datasets that cover the risk factors related to under-represented social groups. Next, we focus on addressing the topic coverage issues in training datasets. Acknowledging that the sensitivity surrounding suicide-related data poses challenges in accessing diverse real-world examples, we introduce an innovative strategy that leverages the capabilities of generative AI models, such as GPT models, Flan-T5, and LLama2, to create synthetic data for suicidal ideation detection. Our data generation approach is grounded in social factors extracted from psychology literature and aims to ensure coverage of essential information related to suicidal ideation. Our comparison of synthetic and real data shows that synthetic data is more balanced in terms of risk factor coverage, is not significantly different from real data in terms of complexity and readability, and is significantly less diverse in terms of the vocabulary used. We then study the impact of psychology-grounded synthetic data on both the performance and the internal representations of suicide-detection models. We first leverage the generated synthetic data as standalone training data and as an augmentation source for fine-tuning BERT-family models for suicidal ideation detection. Our results show that synthetic datasets generated across multiple large language models enable strong generalization to real-world data, achieving an F-score of 82% when evaluated on held-out real samples. Moreover, augmenting this synthetic data with only 30% of the real dataset yields models that outperform those trained exclusively on the full real dataset, demonstrating a cost-effective strategy for improving performance while mitigating topic imbalance. Finally, we examine how topic-aware data augmentation influences the internal representations learned by these models. Using sparse autoencoders and geometric analyses, including UMAP projections and cosine-distance measurements, we analyze whether psychologically meaningful risk factors are encoded as more distinct and separable directions in the models' latent spaces. Our findings indicate that augmentation not only improves predictive performance but also leads to more structured and interpretable internal representations, with several previously under-represented risk factors becoming more clearly encoded. Together, these results highlight the value of combining synthetic data generation with representation-level analysis to develop more reliable and transparent models for suicidal ideation detection.
dc.identifier.urihttp://hdl.handle.net/10393/51616
dc.identifier.urihttps://doi.org/10.20381/ruor-31919
dc.language.isoen
dc.publisherUniversité d'Ottawa | University of Ottawa
dc.rightsAttribution 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectSuicide Ideation Detection
dc.subjectLarge Language Models(LLMs)
dc.subjectModel Interpretability
dc.subjectSynthetic Data Generation
dc.subjectData Augmentation
dc.subjectTopic Modeing
dc.subjectSocial Media Analysis
dc.titleSuicide Ideation Detection from Social Media Using Language Models: Data Augmentation and Interpretability
dc.typeThesisen
thesis.degree.disciplineGénie / Engineering
thesis.degree.levelDoctoral
thesis.degree.namePhD
uottawa.departmentScience informatique et génie électrique / Electrical Engineering and Computer Science

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail ImageThumbnail Image
Name:
Ghanadian_Hamideh_2026_thesis.pdf
Size:
9.58 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail ImageThumbnail Image
Name:
license.txt
Size:
2.51 KB
Format:
Item-specific license agreed upon to submission
Description: