1. Introduction
Anomaly and outlier in machine learning are inseparable concepts. Therefore, detecting and handling instances that are different from the norm plays an essential role in building a robust and efficient model.
On top of cleaning the dataset before the training phase, we apply anomaly detection techniques to find unusual instances. Some well-known use cases are fraud detection in finance, defect detection in manufacturing, intrusion detection in IT security, and condition monitoring in healthcare.
Anomaly detection is also closely related to novelty detection and noise removal. All of these areas focus on finding and handling different data instances.
Most of the time, momentary errors cause anomalies and novelties. However, we can also observe the abnormal values. Furthermore, the underlying patterns within the data can change over time.
This tutorial will focus on some fundamental terms in detecting unusual observations and shifts in trends.
2. Drift
When training a model, we assume that the data and the environment are stationary. So, we build a model to predict the outcomes based on the examples observed under certain conditions. But, as the circumstances change, the behavior of the data evolves as well.
We observe changes in the relationship between features or underlying distribution or structure of the data as time passes. We call it drift.
Model decay measures the speed of a model’s accuracy decrease over time. It’s affected by the drift rate.
There are two types of drift:
- Concept drift
- Data drift
Data drift happens when the input data is affected in an unforeseen way. For example, changing the data collection logic or introducing a new category causes data drift. To sum up, data drift happens when the input data distribution changes.
On the other hand, concept drift occurs when the output changes over time. As a result, the relationship between the dataset and the target variable change. There are three types of concept drift:
- Gradual drift
- Sudden drift
- Recurring drift
A gradual drift is when a set of minor variances accumulate over time. For example, the effects of climate change are minimal and undetectable for short periods.
Conversely, a sudden drift changes the predicted value at once. For instance, the effects of the 2008 global financial crisis are suddenly observable on various economic variables.
There are also recurring drifts, where the patterns follow a change over time. As an example, the sales of ice cream would change over the year.
To neutralize the effects of drift, we need to recalibrate the model or change the business process.
3. Anomaly
Anomalies define non-typical events that only happen under exceptional circumstances. As a result, they are often mistaken for outliers.
There are three classes of anomaly detection tasks:
- Point anomaly
- Contextual anomaly
- Collective anomaly
Point anomalies occur when an individual observation is abnormal compared to the rest of the observations in the dataset. An unusual occurrence is generally easy to spot. For example, observing snow in July for the northern hemisphere is an excellent example of a point anomaly.
Contextual anomalies occur when a particular observation doesn’t fit in a specific context. For instance, having 100 visitors within an hour would be an anomaly for website A, but it would be typical for website B.
Collective anomalies happen when a set of observations is non-typical concerning the rest of the data set. An example would be a website receiving only half of its usual traffic within a time range.
We can apply statistical models such as z-score or interquartile analysis to detect anomalies in the dataset. Alternatively, we can utilize supervised learning algorithms like isolation forests or SVMs. Or, we can turn to semi-supervised and unsupervised learning algorithms, such as clustering and dimension reduction.
We don’t need to recalibrate the model to eliminate the effects of an anomaly, as its existence doesn’t mean that our model is no longer valid.
4. Novelty
Novelty is a new observation that is not similar to the dataset.
We start the training process with a preprocessed dataset in novelty detection, where all the outliers are eliminated. Then, we train the detection algorithm to output whether a new observation fits with the training data or not. In this context, we also treat outliers as novelties.
We can think of novelty detection as a semi-supervised outlier detection process. In the case of novelties, we don’t change the model or business model to answer them.
The One-Class SVM is a well-known novelty detection algorithm. It learns the frontier of observations in the initial dataset. Then, it decides whether it’s coming from the same distribution for each new observation. As a result, the data points falling outside the frontier are marked as novelties.
Novelty detection is widely adopted in online training tasks, as the new observations need to be classified as outliers or not in real-time.
5. Comparison
Outliers are closely related to anomalies and novelties, as well as drift.
The term drift refers to a change in the data regime. It can either result from a change in the environment or a difference in business logic. Consequently, drift in the data leads to a decrease in model accuracy. Hence, we’ll need to change the model.
Drift is a concept, whereas anomaly or novelty are data instances. However, they are both closely related to outliers.
In simple terms, we can think of anomalies as unusual or unexpected data instances within a dataset. The term is often used interchangeably with outliers.
Similarly, novelties are also anomalies in data, but they only exist in new instances. They don’t reside in the original dataset.
The presence of outliers, anomalies, or novelties doesn’t imply a change in the underlying data distribution or regime. Hence, detecting them won’t invalidate a model. So, there’s no need to change the model.
6. Conclusion
In this article, we’ve learned about and compared some elementary concepts related to outliers in machine learning.
Firstly, we’ve defined drift and its types: concept and data drift. Then, we’ve given the meaning of anomaly and provided examples of its varieties: point, contextual, and collective anomaly. After that, we’ve described novelty.
Lastly, we’ve compared them to understand their differences better and concluded.