Self-supervised vision transformer-based few-shot learning for facial expression recognition.

Inf. Sci.(2023)

Cited 4|Views28
No score
Abstract
Facial expression recognition (FER) is embedded in many real-world human-computer interaction tasks, such as online learning, depression recognition and remote diagnosis. However, FER is often hindered by privacy concerns and low recognition accuracy due to inadequate data transfer restrictions on public clouds, insufficient quantities of effective labeled samples and class imbalance. To address the above challenges, we have developed an automatic privacy -preserving learning state recognition system for supervising the quality of online teaching with the cooperation of edge servers and cloud servers to reduce the risk of privacy exposure. In particular, we propose few-shot facial expression recognition with a self-supervised vision transformer (SSF-ViT) by integrating self-supervised learning (SSL) and few-shot learning (FSL) to train a deep learning model with fewer labeled samples. Specifically, a vision transformer (ViT) is jointly pretrained with four self-supervised pretext tasks, including image denoising and reconstruction, image rotation prediction, jigsaw puzzle and masked patch prediction, to obtain a pretrained ViT encoder. Then, the pretrained ViT encoder is used on a lab-controlled labeled FER dataset to extract the spatiotemporal features and implement the FER task to fine-tune the parameters. Finally, we construct prototypes to verify the few-shot classification method for specific expression recognition. Support and query sets are divided in the wild FER dataset, and few-shot classification episodes are constructed. The fine-tuned ViT encoder is used as the feature extractor to build the prototype for each support set category, and the expression classification results are obtained by computing the Euclidean distance between the query samples and the prototypes. The extensive experimental results show that SSF-ViT can achieve recognition accuracies of 74.95%, 66.04%, 63.69% and 90.98% on the FER2013, AffectNet, SFEW 2.0 and RAF-DB datasets, respectively. In addition, SSF-ViT can improve the recognition performance of specific expression categories on these datasets.
More
Translated text
Key words
facial expression recognition,learning,self-supervised,transformer-based,few-shot
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined