ChimpVLM: Ethogram-Enhanced Chimpanzee Behaviour Recognition
arxiv(2024)
摘要
We show that chimpanzee behaviour understanding from camera traps can be
enhanced by providing visual architectures with access to an embedding of text
descriptions that detail species behaviours. In particular, we present a
vision-language model which employs multi-modal decoding of visual features
extracted directly from camera trap videos to process query tokens representing
behaviours and output class predictions. Query tokens are initialised using a
standardised ethogram of chimpanzee behaviour, rather than using random or
name-based initialisations. In addition, the effect of initialising query
tokens using a masked language model fine-tuned on a text corpus of known
behavioural patterns is explored. We evaluate our system on the PanAf500 and
PanAf20K datasets and demonstrate the performance benefits of our multi-modal
decoding approach and query initialisation strategy on multi-class and
multi-label recognition tasks, respectively. Results and ablations corroborate
performance improvements. We achieve state-of-the-art performance over vision
and vision-language models in top-1 accuracy (+6.34
(+1.1
complete source code and network weights for full reproducibility of results
and easy utilisation.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要