Reduction of False Positives in Online Outlier Detection over Time Series using Ensemble Learning
Alaiñe Iturria Aguinaga
14/04/2023
- DIRECTORES: Javier Del Ser Lorente and Francisco Herrera Triguero
- UNIVERSIDAD: Universidad de Granada
ABSTRACT
The identification of unusual patterns or anomalous features allows critical information to be extracted from data. With the development of the Internet of Things, real-time time series anomaly detection has become a relevant task in many domains, including fault detection in the manufacturing industry, intrusion detection in cybersecurity, and fraud detection in banks. Within this research area, online time series anomaly detection is a more challenging task compared to classical outlier detection for several reasons. First, a complete dataset is not available for training, and therefore, training must be performed incrementally. Second, every new incoming data sample must be processed once without multiple passes through the entire dataset (i.e., one-pass learning). Third, the distribution of data is non-stationary and can change over time (concept drift), requiring the inclusion of adaptation/forgetting mechanisms in outlier detection methods. These constraints hinder the design of online methods that effectively detect anomalies in time series data under such challenging conditions.
This Thesis delves into research around the detection of anomalies in online time series. In recent years, several proposals for online time series anomaly detection algorithms have gradually emerged. However, the literature study performed in this Thesis has identified two crucial shortcomings in online time series anomaly detection: the scarcity of open-source software and the high rate of false positives and negatives of state-of-the-art proposals.
To reduce false positives, the Thesis focuses on ensemble techniques. Ensemble techniques are helpful to solve this type of problem since they can reduce the dependence of the model on the data set and complement the weaknesses of single detectors while enhancing their strengths. Furthermore, several of the most recent studies on anomaly detection demonstrate that ensembles or multiple classifier systems are the most promising research line to obtain robust and accurate detectors.
In order to provide more open source software, the first significant contribution of this thesis is the implementation of an efficient and easy-to-use R library named otsad and the comparative study of implemented detectors. Otsad is the first R package that collects a set of online time-series anomaly detectors: PEWMA, SD-EWMA, TSSD-EWMA, KNN-LDCD, KNN-CAD, and CAD-OSE. It also implements a new false positive reduction technique to improve detectors' results significantly. Inspired by a real-life situation where there is a time-lapse between an alarm being triggered and until corrective action is taken, our proposal uses the number of processed data points between two detected anomalies to reduce the number of false positives. Furthermore, it also includes some advanced functionalities, such as the NAB detector measurement technique and a visualization function. Finally, a comparative study of the effectiveness and efficiency of the implemented methods was carried out to add value to the work.
The last contribution is divided into two parts. First, we propose a framework that allows the generation of more online time series outlier detection algorithms by facilitating the adaptation of available time series prediction algorithms. The framework aims to allow
the extrapolation of the advances made in online time series forecasting to anomaly detection and thus provide the tools to expand the detectors catalog. The proposed framework implements several online normalization and outlier scoring methods already available in state-of-the-art models, as well as novel proposals designed to improve upon baselines. Specifically, two novel normalization methods—one-pass adaptive normalization (OAN) and one-pass adaptive min-max normalization (OAMN), as well as two scoring methods—sigma scoring (SS) and dynamic SS (DSS) are proposed.
Then, we demonstrate the usability and efficacy of the proposed framework by discussing the adaptation of a novel ensemble-based online recurrent extreme learning machine, EORELM-AD. The EORELM-AD was created by implementing the steps of the proposed framework over an ensemble of Online Recurrent Extreme Learning Machines. The ensemble proposal combines several instances initialized with different parameter settings. In this manner, the hyperparameter selection problem is circumvented, which reduces the dependency of the model's configuration on the target dataset. Furthermore, EOR-ELM removes deviating models in each iteration by initializing new models, so it can quickly adapt to distribution changes and significantly reduce false positives. Therefore, EORELM-AD provides a much more robust approach to distribution changes and anomalies in time series.
To conclude, extensive experiments on well-known benchmark datasets for time series outlier detection are presented and discussed, yielding two main conclusions. First, the performance of the proposed EORELM-AD detector is competitive in comparison to several state-of-the-art outlier detection algorithms. Second, the proposed framework is a useful tool for adapting an online time series prediction algorithm to outlier detection.