Repository logo

Building a Personally Identifiable Information Recognizer in a Privacy Preserved Manner Using Automated Annotation and Federated Learning

dc.contributor.authorHathurusinghe, Rajitha
dc.contributor.supervisorBolic, Miodrag
dc.contributor.supervisorNejadgholi, Isar
dc.date.accessioned2020-09-16T17:51:32Z
dc.date.available2020-09-16T17:51:32Z
dc.date.issued2020-09-16en_US
dc.description.abstractThis thesis explores the training of a deep neural network based named entity recognizer in an end-to-end privacy preserved setting where dataset creation and model training happen in an environment with minimal manual interventions. With the improvement of accuracy in Deep Learning Models for practical tasks, a rising concern is satisfying the demand for training data for these models amidst the concerns on the data privacy. Several scenarios of data protection are suggested in the recent past due to public concerns hence the legal guidelines to enforce them. A promising new development is the decentralized model training on isolated datasets, which eliminates the compromises of privacy upon providing data to a centralized entity. However, in this federated setting curating the data source is still a privacy risk mostly in unstructured data sources such as text. We explore the feasibility of automatic dataset annotation for a Named Entity Recognition (NER) task and training a deep learning model with it in two federated learning settings. We explore the feasibility of utilizing a dataset created in this manner for fine-tuning a stateof- the-art deep learning language model for the downstream task of named entity recognition. We also explore this novel setting of deep learning NLP model and federated learning for its deviation from the classical centralized setting. We created an automatically annotated dataset containing around 80,000 sentences, a manual human annotated test set and tools to extend the dataset with more manual annotations. We observed the noise from automated annotation can be overcome to a level by increasing the dataset size. We also contributed to the federated learning framework with state-of-the-art NLP model developments. Overall, our NER model achieved around 0.80 F1-score for recognition of entities in sentences.en_US
dc.identifier.urihttp://hdl.handle.net/10393/41011
dc.identifier.urihttp://dx.doi.org/10.20381/ruor-25235
dc.language.isoenen_US
dc.publisherUniversité d'Ottawa / University of Ottawaen_US
dc.subjectFederated Learningen_US
dc.subjectNamed Entity Recognitionen_US
dc.subjectBERTen_US
dc.subjectTransformer based NLPen_US
dc.subjectNLPen_US
dc.subjectNERen_US
dc.subjectDeep learningen_US
dc.subjectPrivacyen_US
dc.subjectMachine learningen_US
dc.titleBuilding a Personally Identifiable Information Recognizer in a Privacy Preserved Manner Using Automated Annotation and Federated Learningen_US
dc.typeThesisen_US
thesis.degree.disciplineGénie / Engineeringen_US
thesis.degree.levelMastersen_US
thesis.degree.nameMAScen_US
uottawa.departmentScience informatique et génie électrique / Electrical Engineering and Computer Scienceen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail ImageThumbnail Image
Name:
Hathurusinghe_Rajitha_2020_thesis.pdf
Size:
2.41 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail ImageThumbnail Image
Name:
license.txt
Size:
6.65 KB
Format:
Item-specific license agreed upon to submission
Description: