Building a Personally Identifiable Information Recognizer in a Privacy Preserved Manner Using Automated Annotation and Federated Learning
| dc.contributor.author | Hathurusinghe, Rajitha | |
| dc.contributor.supervisor | Bolic, Miodrag | |
| dc.contributor.supervisor | Nejadgholi, Isar | |
| dc.date.accessioned | 2020-09-16T17:51:32Z | |
| dc.date.available | 2020-09-16T17:51:32Z | |
| dc.date.issued | 2020-09-16 | en_US |
| dc.description.abstract | This thesis explores the training of a deep neural network based named entity recognizer in an end-to-end privacy preserved setting where dataset creation and model training happen in an environment with minimal manual interventions. With the improvement of accuracy in Deep Learning Models for practical tasks, a rising concern is satisfying the demand for training data for these models amidst the concerns on the data privacy. Several scenarios of data protection are suggested in the recent past due to public concerns hence the legal guidelines to enforce them. A promising new development is the decentralized model training on isolated datasets, which eliminates the compromises of privacy upon providing data to a centralized entity. However, in this federated setting curating the data source is still a privacy risk mostly in unstructured data sources such as text. We explore the feasibility of automatic dataset annotation for a Named Entity Recognition (NER) task and training a deep learning model with it in two federated learning settings. We explore the feasibility of utilizing a dataset created in this manner for fine-tuning a stateof- the-art deep learning language model for the downstream task of named entity recognition. We also explore this novel setting of deep learning NLP model and federated learning for its deviation from the classical centralized setting. We created an automatically annotated dataset containing around 80,000 sentences, a manual human annotated test set and tools to extend the dataset with more manual annotations. We observed the noise from automated annotation can be overcome to a level by increasing the dataset size. We also contributed to the federated learning framework with state-of-the-art NLP model developments. Overall, our NER model achieved around 0.80 F1-score for recognition of entities in sentences. | en_US |
| dc.identifier.uri | http://hdl.handle.net/10393/41011 | |
| dc.identifier.uri | http://dx.doi.org/10.20381/ruor-25235 | |
| dc.language.iso | en | en_US |
| dc.publisher | Université d'Ottawa / University of Ottawa | en_US |
| dc.subject | Federated Learning | en_US |
| dc.subject | Named Entity Recognition | en_US |
| dc.subject | BERT | en_US |
| dc.subject | Transformer based NLP | en_US |
| dc.subject | NLP | en_US |
| dc.subject | NER | en_US |
| dc.subject | Deep learning | en_US |
| dc.subject | Privacy | en_US |
| dc.subject | Machine learning | en_US |
| dc.title | Building a Personally Identifiable Information Recognizer in a Privacy Preserved Manner Using Automated Annotation and Federated Learning | en_US |
| dc.type | Thesis | en_US |
| thesis.degree.discipline | Génie / Engineering | en_US |
| thesis.degree.level | Masters | en_US |
| thesis.degree.name | MASc | en_US |
| uottawa.department | Science informatique et génie électrique / Electrical Engineering and Computer Science | en_US |
