Zoom-VQA: Patches, Frames and Clips Integration for Video Quality Assessment
CoRR(2023)
Abstract
Video quality assessment (VQA) aims to simulate the human perception of video quality, which is influenced by factors ranging from low-level color and texture details to high-level semantic content. To effectively model these complicated quality-related factors, in this paper, we decompose video into three levels (\ie, patch level, frame level, and clip level), and propose a novel Zoom-VQA architecture to perceive spatio-temporal features at different levels. It integrates three components: patch attention module, frame pyramid alignment, and clip ensemble strategy, respectively for capturing region-of-interest in the spatial dimension, multi-level information at different feature levels, and distortions distributed over the temporal dimension. Owing to the comprehensive design, Zoom-VQA obtains state-of-the-art results on four VQA benchmarks and achieves 2nd place in the NTIRE 2023 VQA challenge. Notably, Zoom-VQA has outperformed the previous best results on two subsets of LSVQ, achieving 0.8860 (+1.0%) and 0.7985 (+1.9%) of SRCC on the respective subsets. Adequate ablation studies further verify the effectiveness of each component. Codes and models are released in https://github.com/k-zha14/Zoom-VQA.
MoreTranslated text
Key words
clip ensemble strategy,clip level,complicated quality-related factors,different feature levels,frame level,frame pyramid alignment,high-level semantic content,human perception,low-level color,multilevel information,novel Zoom-VQA architecture,NTIRE 2023 VQA challenge,patch attention module,patch level,spatio-temporal features,temporal dimension,texture details,video quality assessment,VQA benchmarks,Zoom-VQA obtains state-of-the-art results
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined