A Deep Clustering via Automatic Feature Embedded Learning for Human Activity Recognition

Abstract

Traditional clustering algorithms are widely used for building bag-of-words (BOW) models to aggregate spatiotemporal feature points extracted from a video for human activity recognition problems. Their performances are restricted by the computational complexity which limits the number of feature points being used. In contrast, deep clustering yields good clustering performance without the limit of the number of feature points. Therefore, this work proposes a dual stacked autoencoders features embedded clustering (DSAFEC) and a BOW construction method based on the DSAFEC (B-DSAFEC) to reduce the computational complexity and to remove the selection restriction.The DSAFEC first transforms feature points extracted from a video to a learned feature space and then probabilities of cluster assignment of feature points are predicted to build BOWs for human activity recognition. A soft clustering is used by assigning each feature point to multiple clusters yielding the largest probabilities instead of only one in hard clustering. Experimental results on three benchmark human activity datasets show that the B-DSAFEC yields better performance compared to five reference methods which are developed based on either traditional clustering methods or deep clustering methods.

Existing System

? While existing unsupervised remedies of deep clustering leverage network architectures and optimization objectives that are tailored for static image datasets, deep architectures to uncover cluster structures from raw sequence data captured by on-body sensors remains largely unexplored. ? Through extensive experiments, including comparisons with existing methods, we show the effectiveness of our approach to jointly learn unsupervised representations for sensory data and generate cluster assignments with strong semantic correspondence to distinct human activities.

Disadvantages

? Representing local features using the BOW and its variants is successful and popular in dealing with HAR problems. ? The scale invariant feature transform (SIFT) and the dense trajectory (DT) are successful feature extractors for video. ? The DT extracts features based on sampling trajectory and motion boundary descriptor. ? The deep clustering (DC) is applied to address audio source separation problems by predicting implicit segmentation labels of the target spectrogram from audio

Proposed System

• we demonstrate the effectiveness of our proposed approach. Further, we compare our method with closely related approaches, including traditional clustering methods. • We compare against end-to-end deep clustering methods proposed in for still images and show their inability to cater for the sequential nature of time-series data. • Our results demonstrate that our end-to-end approach not only outperforms traditional clustering algorithms applied on both input data and auto-encoding spaces, but also offers a large performance margin over representative deep clustering baselines proposed for image data.

Advantages

? Based on the DT , an improved dense trajectory (IDT) is proposed and shows improved performance in HAR. ? The STIP is also employed to build the BOW in where the BOW size is selected automatically via a minimization of the localized generalization error of a radial basis function neural network (RBFNN). ? Using action bank as feature extraction, directly trains a RBFNN yielding high performance and then performs uncertainty reduction for ambiguous classes. ? The B-DSAFEC can also use a soft cluster assignment (B-DSAFEC-S) to assign a feature point to multiple clusters to improve its performance.

Download DOC Download PPT