Repository logo

Leveraging Metadata to Detect Phishing Where it Spreads

Loading...
Thumbnail ImageThumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Université d'Ottawa | University of Ottawa

Abstract

Phishing is one of the most persistent and evolving attacks in cybersecurity. Attackers are exploiting legitimate domains and various hosting and third-party platforms, making detection challenging. Many existing proactive detection methods still focus on identifying newly registered domains, often by monitoring data sources that signal suspicious domain activities. Although these approaches are effective at capturing a subset of attacks-specifically those involving attacker-owned domains-they leave open the question of how many phishing campaigns actually rely on attacker-controlled infrastructure compared to compromised or third-party platforms. If attacker-owned domains are not the dominant category, then relying only on infrastructure-level data sources may miss a large portion of phishing activities. In such cases, effective detection needs to focus on other dimensions of phishing behavior, such as how they propagate across platforms and how users interact with them. We begin with a phishing domain ownership taxonomy that distinguishes between attacker-owned, compromised, and third-party platforms. Using our large-scale datasets, PhishXtract, collected over a year from multiple reporting feeds, we analyze how attackers leverage these different infrastructures. Our findings show that most phishing attacks are not hosted on attacker-owned domains. Instead, attackers often abuse legitimate infrastructures and third-party platforms to create their campaigns, which makes infrastructure-level indicators alone insufficient for detection. To address this limitation and recognizing the growing role of social media, we investigate it as a source of phishing propagation and explore its potential for alternative detection. These platforms are not only channels where phishing links spread, but also rich sources of data-ranging from message content to metadata and propagation patterns-that can be used to identify phishing activity at an early stage. By focusing on these signals, we use our TelePhish dataset to develop a lightweight approach that detects phishing at the message level without relying on infrastructure-based indicators. Finally, we compare this lightweight, propagation-based model against LLMs to assess whether advanced text-driven approaches outperform metadata-driven approach in social media. Our findings suggest that our lightweight, propagation-based models perform better for near real-time phishing detection, and it offers a balance between effectiveness and efficiency.

Description

Keywords

Machine Learning, Phishing, Large Language Models, Telegram, Proactive Phishing Detection

Citation

Related Materials

Alternate Version