Pix2Video

input video	edit prompt	edited video
	watercolor painting of a dog running

input video	edit prompt	edited video
	a group of rainbow fish swimming in an aquarium

input video	edit prompt	edited video
	a rainbow giraffe standing on the hillside with an aurora borealis in the background

input video	edit prompt	edited video
	a group of chocolate pigs looking for food

input video	edit prompt	edited video
	a kite-surfer in the ocean at sunset

input video	edit prompt	edited video
	oil painting of a group of flamingos standing near some rocks and water

input video	edit prompt	edited video
	a grizzly bear walking around the lake, foggy day

Abstract

Image diffusion models, trained on massive image collections, have emerged as the most versatile image generator model in terms of quality and diversity. They support inverting real images and conditional (e.g., text) generation, making them attractive for high-quality image editing applications. We investigate how to use such pre-trained image models for text-guided video editing. The critical challenge is to achieve the target edits while still preserving the content of the source video. Our method works in two simple steps: first, we use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame; then, in the key step, we progressively propagate the changes to the future frames via self-attention feature injection to adapt the core denoising step of the diffusion model. We then consolidate the changes by adjusting the latent code for the frame before continuing the process. Our approach is training-free and generalizes to a wide range of edits. We demonstrate the effectiveness of the approach by extensive experimentation and compare it against four different prior and parallel efforts (on ). We demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning.

Method Overview

Pix2Video first inverts each frame with DDIM-inversion and consider it as the initial noise for the denoising process. To edit each frame (lower row), we select a reference frame (upper row), inject its self-attention features to the UNet. At each diffusion step, we also update the latent of the current frame guided by the latent of the reference.

Comparison with state-of-the-art approaches method

a dog with leopard texture running

input	Jamriska et al.	Text2Live
Prompt-to-Prompt	Tune-a-Video	Pix2Video (ours)

a group of 8-bit pixelated fish swimming in an aquarium

input	Jamriska et al.	Text2Live
Prompt-to-Prompt	Tune-a-Video	Pix2Video (ours)

a kite-surfer in the magical starry ocean with an aurora borealis in the background

input	Jamriska et al.	Text2Live
Prompt-to-Prompt	Tune-a-Video	Pix2Video (ours)

a group of pigs standing in the dirt, claymation

input	Jamriska et al.	Text2Live
Prompt-to-Prompt	Tune-a-Video	Pix2Video (ours)

Applying Jamriska et al. [20] as post-processing

Our method uses depth as a structural cue which helps to preserve the structure of the input video. Hence, it can be used in conjunction with style propagation methods as a post processing to further improve the results. Specifically, we provide every 3rd frame generated by our method as a keyframe to the method of Jamriska et al. [20] and propagate the style of these keyframes to the inbetween frames.

a group of pigs standing in the dirt, claymation

input video	ours	ours post-processed

a copper swan on the lake

input video	ours	ours post-processed

Using Adobe Firefly as the image generator

We have also implemented our method to use Adobe Firefly as the base image generator model and provide results below.

a man in suit walking in a crowded street, facing the camera, watercolor painting style

input video	ours

a watercolor painting of a swan on the lake

input video	ours

More Results

giraffe with space suit standing on the moon
2D vector animation of a group of flamingos standing near some rocks and water
a car on a snow covered road in the countryside
a crochet swan on the lake

Concurrent work

With the increasing success of large scale text to image generation models in image editing tasks, there has been several parallel efforts that focus on using such models for editing videos. We summarize a few concurrent work that we have recently came across. If we have missed out another key concurrent work, please reach out to us over email. .

Video-P2P extends the idea of null inversion to a video clip and optimizes a common null embedding when inverting the video frames. The authors also adapt a cross attention control mechanism similar to Prompt-to-Prompt. We note that our method is orthogonal and the proposed inversion strategy and cross attention control can be adapted in our method as well. While Video-P2P finetunes the generator given the input video, we do not perform any training.

Fate-Zero also proposes a training-free strategy for editing videos. They focus on utilizing cross attention maps to compute blending masks to minimize inconsistency in the background region of the edited videos.

Tune-A-Video finetunes the image generation model given an input video and uses cross-frame attention for achieving consistent edits. We provide comparisons to this method using the version available at the time of paper preparation.

Gen-1 presents a large scale video generation model that uses depth as a structural cue. The model is trained on a large, mixed dataset of images and videos.

Bibtex


                @inproceedings{ceylan2023pix2video,
                    author    = {Ceylan, Duygu and Huang, Chun-Hao and Mitra, Niloy J.},
                    title     = {Pix2Video: Video Editing using Image Diffusion},
                    journal   = {ICCV},
                    year      = {2023},
                }