Repository logo

Beyond the Boundaries of SMOTE: A Framework for Manifold-based Synthetic Oversampling

dc.contributor.authorBellinger, Colin
dc.contributor.supervisorJapkowicz, Nathalie
dc.contributor.supervisorDrummond, Christopher
dc.date.accessioned2016-05-12T16:58:55Z
dc.date.available2016-05-12T16:58:55Z
dc.date.issued2016
dc.description.abstractWithin machine learning, the problem of class imbalance refers to the scenario in which one or more classes is significantly outnumbered by the others. In the most extreme case, the minority class is not only significantly outnumbered by the majority class, but it also considered to be rare, or absolutely imbalanced. Class imbalance appears in a wide variety of important domains, ranging from oil spill and fraud detection, to text classification and medical diagnosis. Given this, it has been deemed as one of the ten most important research areas in data mining, and for more than a decade now the machine learning community has been coming together in an attempt to unequivocally solve the problem. The fundamental challenge in the induction of a classifier from imbalanced training data is in managing the prediction bias. The current state-of-the-art methods deal with this by readjusting misclassification costs or by applying resampling methods. In cases of absolute imbalance, these methods are insufficient; rather, it has been observed that we need more training examples. The nature of class imbalance, however, dictates that additional examples cannot be acquired, and thus, synthetic oversampling becomes the natural choice. We recognize the importance of selecting algorithms with assumptions and biases that are appropriate for the properties of the target data, and argue that this is of absolute importance when it comes to developing synthetic oversampling methods because a large generative leap must be made from a relatively small training set. In particular, our research into gamma-ray spectral classification has demonstrated the benefits of incorporating prior knowledge of conformance to the manifold assumption into the synthetic oversampling algorithms. We empirically demonstrate the negative impact of the manifold property on the state-of-the-art methods, and propose a framework for manifold-based synthetic oversampling. We algorithmically present the generic form of the framework and demonstrate formalizations of it with PCA and the denoising autoencoder. Through use of the helix and swiss roll datasets, which are standards in the manifold learning community, we visualize and qualitatively analyze the benefits of our proposed framework. Moreover, we unequivocally show the framework to be superior on three real-world gamma-ray spectral datasets and on sixteen benchmark UCI datasets in general. Specifically, our results demonstrate that the framework for manifold-based synthetic oversampling produces higher area under the ROC results than the current state-of-the-art and degrades less on data that conforms to the manifold assumption.en
dc.identifier.urihttp://hdl.handle.net/10393/34643
dc.identifier.urihttp://dx.doi.org/10.20381/ruor-5841
dc.language.isoenen
dc.publisherUniversité d'Ottawa / University of Ottawaen
dc.subjectmachine learningen
dc.subjectclass imbalanceen
dc.subjectsynthetic oversamplingen
dc.subjectmanifold learningen
dc.titleBeyond the Boundaries of SMOTE: A Framework for Manifold-based Synthetic Oversamplingen
dc.typeThesisen
thesis.degree.disciplineGénie / Engineeringen
thesis.degree.levelDoctoralen
thesis.degree.namePhDen
uottawa.departmentScience informatique et génie électrique / Electrical Engineering and Computer Scienceen

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail ImageThumbnail Image
Name:
bellinger_colin_2016_thesis.pdf
Size:
2.41 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail ImageThumbnail Image
Name:
license.txt
Size:
6.65 KB
Format:
Item-specific license agreed upon to submission
Description: