Real-Word anomaly detection in Surveillance through Semi-supervised Federated Active Learning

Nicolás Cubero Torres, Francisco Barranco Expósito, Eduardo Ros Vidal

Description

This project shows the deployment and research on the application of Semi-supervised Deep Learning models for the Anomaly Detection problem in Surveillance videos being developped on a synchronous federated training architecture under the paradigm of Federated Learning, in which training is performed distributed over many train nodes.

This research is accompanied with the deployment of an Active Learning framework for the continuous learning of the model from continuous video recording streams.

Federated learning system scheme
Fig 1. Federated learning System scheme.

Spatio-temporal Learning of the normal behaviour

Spatio-temporal learner autoencoder architecture
Event classification workflow
Fig 2. Spatio-temporal learner autoencoder scheme & workflow.

Video-segments reconstruction autoencoder models proposed by [Rash_19, Yong_17, Mahm_16] are used to learn spatio-temporal features from video sequences containing normal events. The model is trained to perform accurate reconstruction for the video segments containing normal events. As no abnormal event is feed during the training, it is expected that reconstruction for normal video segments being more accurate than segments containing abnormal events.

Event analisys
Fig 3. Frame cuboids analysis

The Root of the sum of squared erros is proposed as loss function to measure the reconstruction accuracy for the input video segments. Then, a normalized anomaly threshold μ is set to separate both classes of video segments. Additionally, a second temporal threshold λ is introduced in order to mitigate the false positives due to error peaks caused by oclusions, sudden variation of light illuminance or appearance of objects, etc. In this way, the temporal threshold determines the minimum number of consecutive anomalous cuboids required to determine a video time strip to contain an abnormal event.

Federated learning from multiple data sources

Federated agregation
Fig 3. Federated Agregation from multiple video data sources

A synchronous federated learning architecture is proposed to perform the autoencoder model learning from multiple data sources. Training is being performed through two simulated client nodes. Each of one, trains a local autoencoder model from an exclusive set of video segments. To conduct the agregation from the local models to the global model, FedAvg [kone_15] is applied.

Two different sets of experiments have been designed: On the first set, each dataset is split into two disjoint subsets for each of the clients, and local training is performed on each one followed by the agregation. This experiment let us to evaluate the real performance gain through federated agregation in comparison with the base performance showed up by centralized training approaches. On the second set, federated model is trained from multiple video sources provided from different datasets capturing the same scenario from diferent views. Each client node performs local training over its video dataset. In this way, agregation capability from heterogeneus spatial structure information is evaluated and compared against base performance showed by the single models being trained on each of the individual datasets.

Experiment results proves closed quality metrics obtained by the federated models in comparison to the single models being trained over each of the single datasets. This poses up robust and accurate aggregation capabilities got by the Federated Learning paradigm for aggregating from identical spatial-structural data, as well as spatial heterogeneus data.

Training loss evolution on federated learning for client #0
Training loss evolution on federated learning for client #1
Fig 4. Training loss evolution on federated training over each client node for the first experiment set over UCSD Ped 1 dataset.
Class Rec. errors got by centralized learning and federated learning offline
Training time got by centralized learning and federated learning on UCSD Ped 1
Fig 5. Test class reconstruction error and training time comparisons between the centralized learning paradigm and the federated learning paradigm for the first experiment set over UCSD Ped 1 dataset.

Resources

Examples

Following some examples of the global model's event prediction capability can be played over some of the samples from each dataset. For all the samples μ=0.5 and λ=9 are used. Move the mouse over each sample to see the reconstruction made by the model:

UCSD Ped 1 - Test 002
Original Video
Reconstr. Video
UCSD Ped 1 - Test 010
Original Video
Reconstr. Video
UCSD Ped 1 - Test 027
Original Video
Reconstr. Video
UCSD Ped 2 - Test 2
Original Video
Reconstr. Video
UCSD Ped 2 - Test 7
Original Video
Reconstr. Video

References