DynRefer: Delving into Region-level Multi-modality Tasks via Dynamic Resolution
CoRR(2024)
摘要
Region-level multi-modality methods can translate referred image regions to
human preferred language descriptions. Unfortunately, most of existing methods
using fixed visual inputs remain lacking the resolution adaptability to find
out precise language descriptions. In this study, we propose a dynamic
resolution approach, referred to as DynRefer, to pursue high-accuracy
region-level referring through mimicking the resolution adaptability of human
visual cognition. DynRefer first implements stochastic vision-language
alignment. It aligns desired language descriptions of multi-modality tasks with
images of stochastic resolution, which are constructed by nesting a set of
views around the referred region. DynRefer then implements dynamic
multi-modality referring, which is realized by selecting views based on image
and language priors. This allows the visual information used for referring to
better match human preferences, thereby improving the representational
adaptability of region-level multi-modality models. Extensive experiments show
that DynRefer brings mutual improvement upon tasks including region-level
captioning, open-vocabulary region recognition and attribute detection. Last
but not least, DynRefer achieves new state-of-the-art on multiple region-level
multi-modality tasks using a single model. Code is available at
https://github.com/callsys/DynRefer.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要