Large Language Models: Towards Safety, Robustness, and Understanding

Crothers, Evan2024-11-202024-11-202024-11-20http://hdl.handle.net/10393/49870https://doi.org/10.20381/ruor-30696The Transformer neural network architecture has had an enormous impact on state-of-the-art language model performance across a wide range of tasks in the text domain. In order for large language models based on this architecture to be suitable for widespread usage, it is critical to ensure they are not abused for malicious purposes, that they are robust against adversarial attack, and that the behaviour of such models is well-understood. The rapid proliferation of user-friendly interfaces to generative language models in particular, such as ChatGPT, highlight the pressing need for preventing abuse of large language models, while improving adversarial robustness of systems designed to detect them. This thesis outlines a plan to make a significant contribution towards these goals in several ways. We begin by performing an in-depth survey of the categories of malicious attacks associated with machine generated text, a threat modelling exercise to explore cybersecurity threats related to these attacks, and a comprehensive overview of detection methodologies and recommendations to improve defenses. This work was featured by cybersecurity expert Bruce Schneier as "a solid grounding amongst all of the hype", and a talk on the paper was presented as part of the United Nations "AI for Good" speaker series. Second, we demonstrate a new technique utilizing statistical features to augment Transformer-derived features to improve adversarial robustness in detection of computer-generated text - an important problem for detection of spam and disinformation, and a setting where adversarial attacks are likely. Third, we determine to what extent existing metrics for assessment of machine generated text align with subjective human assessment, identifying gaps between computational metrics and subjective human assessment of machine generated text. Finally, we perform an in-depth assessment of how masking-based faithfulness measures are applied to Transformer text classifiers, demonstrating pitfalls in faithfulness-based model comparisons, investigating the underlying mechanisms that cause these issues to arise, and determining the impacts of relying on such measures on adversarial robustness and fairness.enAttribution-NonCommercial-ShareAlike 4.0 Internationalhttp://creativecommons.org/licenses/by-nc-sa/4.0/large language modelsgenerative AIfairnessexplainabilityadversarial attackscybersecurityNLPethical AIinterpretabilityneural networksmachine learningLarge Language Models: Towards Safety, Robustness, and UnderstandingThesis