One version of these unusual data are the »global anomalies«. These are structured quite differently compared to other observations in our data. The figure above shows a visualization of these anomalies. We can see that the points labeled X 1 and X 2 are far away from the other points.
Another type of anomalous data are the »local anomalies«. These are located near clusters of normal data, i.e. collections of non-anomalous data, but do not fall into these clusters. Clusters of anomalies are a different type of anomaly. In the figure above, C 2 could be such a cluster of anomalies. However, it depends on the use case whether it is really a cluster of anomalies or a cluster of normal data with lower frequency.
There are many applications where anomaly detection is used, for example in fraud detection. Here the anomalies correspond to cases of fraud. Use cases include detecting false information in records from care providers or emails from criminals, detecting abnormal medical conditions or patient behavior, detecting potential errors or machine failures, and much more.
Why Are Low Occurrence Features a Challenge for Anomaly Detection Algorithms?
What is the difference between small occurrences, which may also be insignificant, and anomalies? For the algorithm, a significantly less frequently occurring value of a characteristic is actually an anomaly.
In our example, simply collecting the data and applying an algorithm to detect anomalies presents at least two challenges:
- Our data may contain a numeric column derived from different scales. In our example of the chain of stores selling Christmas trees to Germany, France and Hungary, we have currencies in euros or forints. In the raw data, 50 percent were sold to Germany, 42 percent to France and only eight percent to Hungary. Due to the exchange rate of the euro and forint, the amount for individual sales to Hungary is higher than to Germany or France. This is true even if the data is almost balanced. Now one might assume that converting the amount for Hungary solves this problem. But still we have the same anomalies, why?
- After the currency has been converted to Euro, the algorithm will no longer detect anomalies due to different scales. However, the amount of data in Hungary is still much smaller compared to Germany or France, which is why the system will recognize this data as anomalous.
In both cases, the observations in Hungary are a relatively small number of observations. Therefore, the algorithms that look for global outliers, i.e. a small number of observations that are far away from the others, will recognize them as anomalies. This explains why Hungary is no longer recognized as anomalous in balanced data with uniform currency.
We hope that anomaly algorithms are well implemented and you can get your Christmas tree in the store of your choice at fair conditions.