FACIAL DEPRESSION RECOGNITION BY DEEP JOINT LABEL DISTRIBUTION AND METRIC LEARNING

Abstract

In mental health assessment, it is validated that nonverbal cues like facial expressions can be indicative of depressive disorders. Recently, the multimodal fusion of facial appearance and dynamics based on convolutional neural networks has demonstrated encouraging performance in depression analysis. However, correlation and complementarity between different visual modalities have not been well studied in prior methods. In this paper, we propose a sequential fusion method for facial depression recognition. For mining the correlated and complementary depression patterns in multimodal learning, a chained-fusion mechanism is introduced to jointly learn facial appearance and dynamics in a unified framework We show that such sequential fusion can provide a probabilistic perspective of the model correlation and complementarity between two different data modalities for improved depression recognition. Results on a benchmark dataset show the superiority of our method against. several state-of-the-art alternatives

Existing System

we propose a sequential fusion method for facial depression recognition. For mining the correlated and complementary depression patterns in multimodal learning, a chained-fusion mechanism is introduced to jointly learn facial appearance and dynamics in a unified framework. We show that such sequential fusion can provide a probabilistic perspective of the model correlation and complementarity between two different data modalities for improved depression recognition. Results on a benchmark dataset show the superiority of our method against several state-of-the-art alternatives

Disadvantages

motion features for RGB-D gesture recognition. By imposing the representation learning of associations between different modalities, Zolfaghari et al. designed a chained multi-stream network to fully exploit the pose, motion, and appearance cues for action classification and detection. Feichtenhofer et al. proposed a two-stream CNN architecture to fuse a spatial and temporal network at a convolutional layer instead of at the softmax layer, which boosted performance on action recognition problem with a substantial saving in parameters

Proposed System

? While encouraging progress has been made over the past few years, automated depression analysis in videos remains challenging due to the following reasons. On the one hand, unlike those large-scale image datasets for visual recognition the size of most existing depression datasets is relatively small due to the privacy concerns. While representation learning based on convolutional neural network (CNN) has been proved to be more effective than hand-crafted descriptors in visual-based depression recognition the lack of labeled data makes the model training with deep networks prone to over-fitting in practice.

Advantages

? we are interested in the visual-based approaches for depression recognition. In the AVEC 2013 competition a facial descriptor named the local phase quantization (LPQ) was used as a baseline for facial depression recognition, where the extracted LPQ features for each video frame are further employed to train a support vector regression (SVR). In, Cummins et al. used the pyramid of histogram of gradients (PHOG) and the space-time interest points (STIPs) for extraction of behavioural cues for depression analysis. Meng et al. proposed to use motion history histogram (MHH) feature to model motion in videos, and then use the partial least squares (PLS) for training regression model.

Download DOC Download PPT