Blog

HOME
Blog
ICML
Award-winning paper in ICML 2024 on “VideoPoet: A large language model for zero-shot video generation.”

July 22, 2024 / Last updated : July 25, 2024 irfan ICML

Award-winning paper in ICML 2024 on “VideoPoet: A large language model for zero-shot video generation.”

Citation

Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Josh Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam, Ming-Hsuan Yang, Irfan Essa, Huisheng Wang, David A. Ross, Bryan Seybold, Lu Jiang

VideoPoet: A large language model for zero-shot video generation Best Paper Proceedings Article

In: Proceedings of International Conference on Machine Learning (ICML), 2024.

Abstract | Links | BibTeX | Tags: arXiv, best paper award, computational video, computer vision, generative AI, google, ICML

@inproceedings{2024-Kondratyuk-VLLMZVG,

title = {VideoPoet: A large language model for zero-shot video generation},

author = {Dan Kondratyuk and Lijun Yu and Xiuye Gu and José Lezama and Jonathan Huang and Grant Schindler and Rachel Hornung and Vighnesh Birodkar and Jimmy Yan and Ming-Chang Chiu and Krishna Somandepalli and Hassan Akbari and Yair Alon and Yong Cheng and Josh Dillon and Agrim Gupta and Meera Hahn and Anja Hauth and David Hendon and Alonso Martinez and David Minnen and Mikhail Sirotenko and Kihyuk Sohn and Xuan Yang and Hartwig Adam and Ming-Hsuan Yang and Irfan Essa and Huisheng Wang and David A. Ross and Bryan Seybold and Lu Jiang

},

url = {https://arxiv.org/pdf/2312.14125

https://arxiv.org/abs/2312.14125

https://sites.research.google/videopoet/},

doi = {10.48550/arXiv.2312.14125},

year  = {2024},

date = {2024-07-23},

urldate = {2024-07-23},

booktitle = {Proceedings of International Conference on Machine Learning (ICML)},

abstract = {We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

},

keywords = {arXiv, best paper award, computational video, computer vision, generative AI, google, ICML},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

Awarded the Best Paper Award by ICML 2024. More details at the Project Website.

Abstract

We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

Links

Categories: ICML

Tags: Awards Best Paper Award Computational Video Computer Vision Generative AI Google ICML

One thought on “Award-winning paper in ICML 2024 on “VideoPoet: A large language model for zero-shot video generation.””

Sherry Walker says:

July 2, 2025 at 2:51 am

Wow, this is amazing! Reading about VideoPoet and its zero-shot video generation is truly mind-blowing. I’m especially impressed by its ability to weave together text, images, and audio so seamlessly – it feels like the future of storytelling is about to get a whole lot more accessible. How do you envision artists using this technology in new and unexpected creative ways?

Reply

Award-winning paper in ICML 2024 on “VideoPoet: A large language model for zero-shot video generation.”

Citation

Abstract

Links

One thought on “Award-winning paper in ICML 2024 on “VideoPoet: A large language model for zero-shot video generation.””

Leave a Reply Cancel reply

ACM SIGGRAPH Seminal Graphics Papers, Volume 2. Published as part of SIGGRAPH 50th Anniversary Meeting in 2023

CVPR 2025 paper on “Calibrated Multi-Preference Optimization for Aligning Diffusion Models”