Building a Personally Identifiable Information Recognizer in a Privacy Preserved Manner Using Automated Annotation and Federated Learning

Hathurusinghe, Rajitha

Building a Personally Identifiable Information Recognizer in a Privacy Preserved Manner Using Automated Annotation and Federated Learning

dc.contributor.author	Hathurusinghe, Rajitha
dc.contributor.supervisor	Bolic, Miodrag
dc.contributor.supervisor	Nejadgholi, Isar
dc.date.accessioned	2020-09-16T17:51:32Z
dc.date.available	2020-09-16T17:51:32Z
dc.date.issued	2020-09-16	en_US
dc.description.abstract	This thesis explores the training of a deep neural network based named entity recognizer in an end-to-end privacy preserved setting where dataset creation and model training happen in an environment with minimal manual interventions. With the improvement of accuracy in Deep Learning Models for practical tasks, a rising concern is satisfying the demand for training data for these models amidst the concerns on the data privacy. Several scenarios of data protection are suggested in the recent past due to public concerns hence the legal guidelines to enforce them. A promising new development is the decentralized model training on isolated datasets, which eliminates the compromises of privacy upon providing data to a centralized entity. However, in this federated setting curating the data source is still a privacy risk mostly in unstructured data sources such as text. We explore the feasibility of automatic dataset annotation for a Named Entity Recognition (NER) task and training a deep learning model with it in two federated learning settings. We explore the feasibility of utilizing a dataset created in this manner for fine-tuning a stateof- the-art deep learning language model for the downstream task of named entity recognition. We also explore this novel setting of deep learning NLP model and federated learning for its deviation from the classical centralized setting. We created an automatically annotated dataset containing around 80,000 sentences, a manual human annotated test set and tools to extend the dataset with more manual annotations. We observed the noise from automated annotation can be overcome to a level by increasing the dataset size. We also contributed to the federated learning framework with state-of-the-art NLP model developments. Overall, our NER model achieved around 0.80 F1-score for recognition of entities in sentences.	en_US
dc.identifier.uri	http://hdl.handle.net/10393/41011
dc.identifier.uri	http://dx.doi.org/10.20381/ruor-25235
dc.language.iso	en	en_US
dc.publisher	Université d'Ottawa / University of Ottawa	en_US
dc.subject	Federated Learning	en_US
dc.subject	Named Entity Recognition	en_US
dc.subject	BERT	en_US
dc.subject	Transformer based NLP	en_US
dc.subject	NLP	en_US
dc.subject	NER	en_US
dc.subject	Deep learning	en_US
dc.subject	Privacy	en_US
dc.subject	Machine learning	en_US
dc.title	Building a Personally Identifiable Information Recognizer in a Privacy Preserved Manner Using Automated Annotation and Federated Learning	en_US
dc.type	Thesis	en_US
thesis.degree.discipline	Génie / Engineering	en_US
thesis.degree.level	Masters	en_US
thesis.degree.name	MASc	en_US
uottawa.department	Science informatique et génie électrique / Electrical Engineering and Computer Science	en_US

Fichiers

Trousse originale

Voici les éléments 1 - 1 sur 1

Nom:: Hathurusinghe_Rajitha_2020_thesis.pdf
Taille:: 2.41 MB
Format:: Adobe Portable Document Format
Description:

Télécharger

Trousse de licence

Voici les éléments 1 - 1 sur 1

Nom:: license.txt
Taille:: 6.65 KB
Format:: Item-specific license agreed upon to submission
Description:

Télécharger

Collections

- Thèses, 2011 - // Theses, 2011 -