CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He1 Yinghao Xu3 Yuwei Guo1 Gordon Wetzstein3 Bo Dai2 Hongsheng Li1 Ceyuan Yang2

1The Chinese University of Hong Kong 2Shanghai Artificial Intelligence Laboratory 3Stanford University

[arXiv Report]     [Code]     [BibTeX]


teaser_figure

Abstract

Controllability plays a crucial role in video generation since it allows users to create desired content. However, existing models largely overlooked the precise control of camera pose that serves as a cinematic language to express deeper narrative nuances. To alleviate this issue, we introduce CameraCtrl, enabling accurate camera pose control for text-to-video (T2V) models. After precisely parameterizing the camera trajectory, a plug-and-play camera module is then trained on a T2V model, leaving others untouched. Additionally, a comprehensive study on the effect of various datasets is also conducted, suggesting that videos with diverse camera distribution and similar appearances indeed enhance the controllability and generalization. Experimental results demonstrate the effectiveness of CameraCtrl in achieving precise and domain-adaptive camera control, marking a step forward in the pursuit of dynamic and customized video storytelling from textual and camera pose inputs.

Demo Video

Framework


architecture_figure

(a) CameraCtrl pipeline. Given a pre-trained T2V model, CameraCtrl trains a camera encoder on it. The camera encoder takes the plücker embedding as input and outputs multi-scale camera representations. These features are then integrated into the temporal attention layers of U-Net at their respective scales to control the video generation process. (b) Camera feature injection process. The camera features ct and the latent features zt are first combined through the element-wise addition. A learnable linear layer is adopted to further fuse two representations which are then fed into the first temporal attention layer of each temporal block. The weights of original T2V models are left untouched.

Visualization Results


Same text prompt + Different camera trajectories


CameraCtrl for different domain videos



Integration CameraCtrl with other video control methods



BibTeX

@misc{he2024cameractrl,
    title={CameraCtrl: Enabling Camera Control for Text-to-Video Generation},
    author={Hao He and Yinghao Xu and Yuwei Guo and Gordon Wetzstein and Bo Dai and Hongsheng Li and Ceyuan Yang},
    year={2024},
    eprint={2404.02101},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

We borrow the source code of this project page from DreamBooth.