GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos
arxiv(2023)
Abstract
We address the task of generating temporally consistent and physically
plausible images of actions and object state transformations. Given an input
image and a text prompt describing the targeted transformation, our generated
images preserve the environment and transform objects in the initial image. Our
contributions are threefold. First, we leverage a large body of instructional
videos and automatically mine a dataset of triplets of consecutive frames
corresponding to initial object states, actions, and resulting object
transformations. Second, equipped with this data, we develop and train a
conditioned diffusion model dubbed GenHowTo. Third, we evaluate GenHowTo on a
variety of objects and actions and show superior performance compared to
existing methods. In particular, we introduce a quantitative evaluation where
GenHowTo achieves 88
respectively, outperforming prior work by a large margin.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined