Lightweight hybrid model based on MobileNet-v2 and Vision Transformer for human-robot interaction
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE(2024)
Abstract
Within convolutional neural networks, convolutional operations are good at extracting local features, but have difficulty in capturing global representations. For Vision Transformer, multi-head self-attention can capture feature dependencies over long distance, but can destruct local feature details. Based on this, we propose a novel lightweight model, named HybridNet, based on MobileNet-v2 and Vision Transformer, capable of combining the advantages of both CNNs and Vision Transformer. In addition, to enhance the capability of HybridNet for temporal information interaction, we incorporate temporal-channel attention in HybridNet. We conducted experiments on Kinetics-400, Jester, and EgoGesture datasets to validate the effectiveness of HybridNet. The experimental results demonstrate that the lightweight model HybridNet achieves 96.3% and 93.9% accuracy on Jester and EgoGesture, respectively, obtaining the performance close to or even comparable with the state-of-the-art methods. Last but not least, we take HybridNet as the real-time gesture recognition model and use the recognition results as commands to control robots in the simulation environment to achieve human-robot interaction. The use of gesture interaction between humans and robots improves communication, facilitates physical collaboration, enables non-verbal expression, enhances accessibility, and creates a more engaging user experience. It adds a dimension of intuitiveness and efficiency to human-robot interaction, making it more dynamic and interactive.
MoreTranslated text
Key words
2-dimensional convolutional neural network,Vision Transformer,Lightweight model,Gesture recognition,Human-robot interaction
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined