Deep Learning Seminar / April 22, 2021, 10:00 – 11:00
Representation of Categorical Variables for Machine Learning Based Anomaly Detection Using Embeddings
(Bachelor Thesis) Speaker: Malte Silbernagel (Fraunhofer ITWM, Department »Financial Mathematics«)
Abstract:
Most of the machine learning algorithms are only capable of handling numerical data. Hence, categorical values must be encoded into numeric values that represent the initial data. In this thesis, a neural network is discussed, which learns a mapping of the categorical values onto a two-dimensional manifold, according to the neighborhood relationships between samples in the input space. As a byproduct of the learned mapping, a higher dimensional embedding of the values is produced. The performance of the embedding and the two-dimensional representation is then compared with the commonly used one-hot encoding.
This thesis proposes a neighborhood Probability Hamming whose embedding yields a more accurate classification between fraudulent and non-fraudulent data. Comparing the best scores of the different downstream classifiers, this method has increased the accuracy by 3.66 percentage points over the one-hot encoding.