Publications

1.

Kyungmin Lee, Xiaohang Li, Qifei Wang, Junfeng He, Junjie Ke, Ming-Hsuan Yang, Irfan Essa, Jinwoo Shin, Feng Yang, Yinxiao Li

Calibrated Multi-Preference Optimization for Aligning Diffusion Models Proceedings Article

In: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2025.

Abstract | Links | BibTeX | Tags: computer vision, CVPR, generative media, google, reinforcement learning

@inproceedings{2025-Lee-CMOADM,

title = {Calibrated Multi-Preference Optimization for Aligning Diffusion Models},

author = {Kyungmin Lee and Xiaohang Li and Qifei Wang and Junfeng He and Junjie Ke and Ming-Hsuan Yang and Irfan Essa and Jinwoo Shin and Feng Yang and Yinxiao Li

},

url = {https://cvpr.thecvf.com/virtual/2025/poster/33781

https://arxiv.org/abs/2502.02588},

doi = {10.48550/arXiv.2502.02588},

year  = {2025},

date = {2025-06-13},

urldate = {2025-06-13},

booktitle = {IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)},

abstract = {Aligning text-to-image (T2I) diffusion models with preference optimization is valuable for human-annotated datasets, but the heavy cost of manual data collection limits scalability. Using reward models offers an alternative, however, current preference optimization methods fall short in exploiting the rich information, as they only consider pairwise preference distribution. Furthermore, they lack generalization to multi-preference scenarios and struggle to handle inconsistencies between rewards. To address this, we present Calibrated Preference Optimization (CaPO), a novel method to align T2I diffusion models by incorporating the general preference from multiple reward models without human annotated data. The core of our approach involves a reward calibration method to approximate the general preference by computing the expected win-rate against the samples generated by the pretrained models. Additionally, we propose a frontier-based pair selection method that effectively manages the multi-preference distribution by selecting pairs from Pareto frontiers. Finally, we use regression loss to fine-tune diffusion models to match the difference between calibrated rewards of a selected pair. Experimental results show that CaPO consistently outperforms prior methods, such as Direct Preference Optimization (DPO), in both single and multi-reward settings validated by evaluation on T2I benchmarks, including GenEval and T2I-Compbench.



},

keywords = {computer vision, CVPR, generative media, google, reinforcement learning},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

2.

Seung Hyun Lee, Jijun jiang, Yiran Xu, Zhuofang Li, Junjie Ke, Yinxiao Li, Junfeng He, Steven Hickson, Katie Datsenko, Sangpil Kim, Ming-Hsuan Yang, Irfan Essa, Feng Yang

Cropper: Vision-Language Model for Image Cropping through In-Context Learning Proceedings Article

In: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2025.

Abstract | Links | BibTeX | Tags: computer vision, CVPR, generative AI, generative media, google

3.

Gong Zhang, Kihyuk Sohn, Meera Hahn, Humphrey Shi, Irfan Essa

FineStyle: Fine-grained Controllable Style Personalization for Text-to-image Models Proceedings Article

In: Advances in Neural Information Processing Systems (NeurIPS), 2024.

Abstract | Links | BibTeX | Tags: computer vision, generative AI, generative media, machine learning, NeurIPS

4.

Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Junfeng He, Gang Li, Sangpil Kim, Irfan Essa, Feng Yang

Parrot: Pareto-optimal multi-reward reinforcement learning framework for text-to-image generation (inproceedings) Proceedings Article

In: Proceedings of European Conference on Computer Vision (ECCV) , 2024.

Abstract | Links | BibTeX | Tags: arXiv, computer vision, ECCV, generative AI, google, reinforcement learning

@inproceedings{2024-Lee-PPMRLFTG,

title = {Parrot: Pareto-optimal multi-reward reinforcement learning framework for text-to-image generation (inproceedings)},

author = {Seung Hyun Lee and Yinxiao Li and Junjie Ke and Innfarn Yoo and Han Zhang and Jiahui Yu and Qifei Wang and Fei Deng and Glenn Entis and Junfeng He and Gang Li and Sangpil Kim and Irfan Essa and Feng Yang

},

url = {https://arxiv.org/abs/2401.05675

https://arxiv.org/pdf/2401.05675

https://dl.acm.org/doi/10.1007/978-3-031-72920-1_26},

doi = {10.48550/arXiv.2401.05675},

year  = {2024},

date = {2024-07-25},

urldate = {2024-07-25},

booktitle = {Proceedings of European Conference on Computer Vision (ECCV)

},

abstract = {Recent works have demonstrated that using reinforcement learning (RL) with multiple quality rewards can improve the quality of generated images in text-to-image (T2I) generation. However, manually adjusting reward weights poses challenges and may cause over-optimization in certain metrics. To solve this, we propose Parrot, which addresses the issue through multi-objective optimization and introduces an effective multi-reward optimization strategy to approximate Pareto optimal. Utilizing batch-wise Pareto optimal selection, Parrot automatically identifies the optimal trade-off among different rewards. We use the novel multi-reward optimization algorithm to jointly optimize the T2I model and a prompt expansion network, resulting in significant improvement of image quality and also allow to control the trade-off of different rewards using a reward related prompt during inference. Furthermore, we introduce original prompt-centered guidance at inference time, ensuring fidelity to user input after prompt expansion. Extensive experiments and a user study validate the superiority of Parrot over several baselines across various quality criteria, including aesthetics, human preference, text-image alignment, and image sentiment.

},

keywords = {arXiv, computer vision, ECCV, generative AI, google, reinforcement learning},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

5.

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, José Lezama

Photorealistic Video Generation with Diffusion Models Proceedings Article

In: European Conference on Computer Vision (ECCV), 2024.

Abstract | Links | BibTeX | Tags: arXiv, computational video, computer vision, generative AI, google

6.

Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Josh Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam, Ming-Hsuan Yang, Irfan Essa, Huisheng Wang, David A. Ross, Bryan Seybold, Lu Jiang

VideoPoet: A large language model for zero-shot video generation Best Paper Proceedings Article

In: Proceedings of International Conference on Machine Learning (ICML), 2024.

Abstract | Links | BibTeX | Tags: arXiv, best paper award, computational video, computer vision, generative AI, google, ICML

@inproceedings{2024-Kondratyuk-VLLMZVG,

title = {VideoPoet: A large language model for zero-shot video generation},

author = {Dan Kondratyuk and Lijun Yu and Xiuye Gu and José Lezama and Jonathan Huang and Grant Schindler and Rachel Hornung and Vighnesh Birodkar and Jimmy Yan and Ming-Chang Chiu and Krishna Somandepalli and Hassan Akbari and Yair Alon and Yong Cheng and Josh Dillon and Agrim Gupta and Meera Hahn and Anja Hauth and David Hendon and Alonso Martinez and David Minnen and Mikhail Sirotenko and Kihyuk Sohn and Xuan Yang and Hartwig Adam and Ming-Hsuan Yang and Irfan Essa and Huisheng Wang and David A. Ross and Bryan Seybold and Lu Jiang

},

url = {https://arxiv.org/pdf/2312.14125

https://arxiv.org/abs/2312.14125

https://sites.research.google/videopoet/},

doi = {10.48550/arXiv.2312.14125},

year  = {2024},

date = {2024-07-23},

urldate = {2024-07-23},

booktitle = {Proceedings of International Conference on Machine Learning (ICML)},

abstract = {We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

},

keywords = {arXiv, best paper award, computational video, computer vision, generative AI, google, ICML},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

7.

Xingqian Xu, Jiayi Guo, Zhangyang Wang, Gao Huang, Irfan Essa, Humphrey Shi

Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models Proceedings Article

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 8682–8692, 2024.

Abstract | Links | BibTeX | Tags: arXiv, computer vision, CVPR, generative AI

@inproceedings{2024-Xu-PDTTTDM,

title = {Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models},

author = {Xingqian Xu and Jiayi Guo and Zhangyang Wang and Gao Huang and Irfan Essa and Humphrey Shi

},

url = {https://openaccess.thecvf.com/content/CVPR2024/papers/Xu_Prompt-Free_Diffusion_Taking_Text_out_of_Text-to-Image_Diffusion_Models_CVPR_2024_paper.pdf

https://openaccess.thecvf.com/content/CVPR2024/html/Xu_Prompt-Free_Diffusion_Taking_Text_out_of_Text-to-Image_Diffusion_Models_CVPR_2024_paper.html

https://arxiv.org/abs/2305.16223

},

doi = {10.48550/arXiv.2305.16223},

year  = {2024},

date = {2024-06-18},

urldate = {2024-06-18},

booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

},

pages = {8682--8692},

abstract = {Text-to-image (T2I) research has grown explosively in the past year owing to the large-scale pre-trained diffusion models and many emerging personalization and editing approaches. Yet one pain point persists: the text prompt engineering and searching high-quality text prompts for customized results is more art than science. Moreover as commonly argued: "an image is worth a thousand words" - the attempt to describe a desired image with texts often ends up being ambiguous and cannot comprehensively cover delicate visual details hence necessitating more additional controls from the visual domain. In this paper we take a bold step forward: taking "Text" out of a pretrained T2I diffusion model to reduce the burdensome prompt engineering efforts for users. Our proposed framework Prompt-Free Diffusion relies on only visual inputs to generate new images: it takes a reference image as "context" an optional image structural conditioning and an initial noise with absolutely no text prompt. The core architecture behind the scene is Semantic Context Encoder (SeeCoder) substituting the commonly used CLIP-based or LLM-based text encoder. The reusability of SeeCoder also makes it a convenient drop-in component: one can also pre-train a SeeCoder in one T2I model and reuse it for another. Through extensive experiments Prompt-Free Diffusion is experimentally found to (i) outperform prior exemplar-based image synthesis approaches; (ii) perform on par with state-of-the-art T2I models using prompts following the best practice; and (iii) be naturally extensible to other downstream applications such as anime figure generation and virtual try-on with promising quality. Our code and models will be open-sourced.

},

keywords = {arXiv, computer vision, CVPR, generative AI},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

8.

Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation Proceedings Article

In: Proceedings of International Conference on Learning Representations (ICLR) , 2024.

Abstract | Links | BibTeX | Tags: AI, arXiv, computer vision, generative AI, google, ICLR

9.

Kihyuk Sohn, Nataniel Ruiz, Kimin Lee, Daniel Castro Chin, Irina Blok, Huiwen Chang, Jarred Barber, Lu Jiang, Glenn Entis, Yuanzhen Li, Yuan Hao, Irfan Essa, Michael Rubinstein, Dilip Krishnan

StyleDrop: Text-to-Image Generation in Any Style Proceedings Article

In: Advances in Neural Information Processing Systems (NeurIPS), 2023.

Abstract | Links | BibTeX | Tags: arXiv, computer vision, generative AI, google, NeurIPS

@inproceedings{2023-Sohn-STGS,

title = {StyleDrop: Text-to-Image Generation in Any Style},

author = {Kihyuk Sohn and Nataniel Ruiz and Kimin Lee and Daniel Castro Chin and Irina Blok and Huiwen Chang and Jarred Barber and Lu Jiang and Glenn Entis and Yuanzhen Li and Yuan Hao and Irfan Essa and Michael Rubinstein and Dilip Krishnan},

url = {https://arxiv.org/abs/2306.00983

https://openreview.net/forum?id=KoaFh16uOc

https://proceedings.neurips.cc/paper_files/paper/2023/hash/d33b177b69425e7685b0b1c05bd2a5e4-Abstract-Conference.html},

doi = {10.48550/arXiv.2306.00983},

year  = {2023},

date = {2023-12-11},

urldate = {2023-12-11},

booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},

abstract = {Pre-trained large text-to-image models synthesize impressive images with an appropriate use of text prompts. However, ambiguities inherent in natural language and out-of-distribution effects make it hard to synthesize image styles, that leverage a specific design pattern, texture or material. In this paper, we introduce StyleDrop, a method that enables the synthesis of images that faithfully follow a specific style using a text-to-image model. The proposed method is extremely versatile and captures nuances and details of a user-provided style, such as color schemes, shading, design patterns, and local and global effects. It efficiently learns a new style by fine-tuning very few trainable parameters (less than 1% of total model parameters) and improving the quality via iterative training with either human or automated feedback. Better yet, StyleDrop is able to deliver impressive results even when the user supplies only a single image that specifies the desired style. An extensive study shows that, for the task of style tuning text-to-image models, StyleDrop implemented on Muse convincingly outperforms other methods, including DreamBooth and textual inversion on Imagen or Stable Diffusion. More results are available at our project website: this https URL},

howpublished = {arXiv:2306.00983},

keywords = {arXiv, computer vision, generative AI, google, NeurIPS},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

10.

Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming-Hsuan Yang, Kevin Murphy, Alexander G. Hauptmann, Lu Jiang

SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs Proceedings Article

In: Advances in Neural Information Processing Systems (NeurIPS), 2023.

Abstract | Links | BibTeX | Tags: arXiv, computational video, computer vision, generative AI, NeurIPS

11.

Nikolai Warner, Meera Hahn, Jonathan Huang, Irfan Essa, Vighnesh Birodkar

Text and Click inputs for unambiguous open vocabulary instance segmentation Proceedings Article

In: Proeedings of British Conference for Machine Vision (BMVC), 2023.

Abstract | Links | BibTeX | Tags: arXiv, BMVC, computer vision, google, image segmentation

12.

Dina Bashkirova, José Lezama, Kihyuk Sohn, Kate Saenko, Irfan Essa

MaskSketch: Unpaired Structure-guided Masked Image Generation Proceedings Article

In: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023.

Abstract | Links | BibTeX | Tags: computer vision, CVPR, generative AI, generative media, google

@inproceedings{2023-Bashkirova-MUSMIG,

title = {MaskSketch: Unpaired Structure-guided Masked Image Generation},

author = { Dina Bashkirova and José Lezama and Kihyuk Sohn and Kate Saenko and Irfan Essa},

url = {https://arxiv.org/abs/2302.05496

https://openaccess.thecvf.com/content/CVPR2023/papers/Bashkirova_MaskSketch_Unpaired_Structure-Guided_Masked_Image_Generation_CVPR_2023_paper.pdf

https://openaccess.thecvf.com/content/CVPR2023/supplemental/Bashkirova_MaskSketch_Unpaired_Structure-Guided_CVPR_2023_supplemental.pdf},

doi = {10.48550/ARXIV.2302.05496},

year  = {2023},

date = {2023-06-01},

urldate = {2023-06-01},

booktitle = {IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)},

abstract = {Recent conditional image generation methods produce images of remarkable diversity, fidelity and realism. However, the majority of these methods allow conditioning only on labels or text prompts, which limits their level of control over the generation result. In this paper, we introduce MaskSketch, an image generation method that allows spatial conditioning of the generation result using a guiding sketch as an extra conditioning signal during sampling. MaskSketch utilizes a pre-trained masked generative transformer, requiring no model training or paired supervision, and works with input sketches of different levels of abstraction. We show that intermediate self-attention maps of a masked generative transformer encode important structural information of the input image, such as scene layout and object shape, and we propose a novel sampling method based on this observation to enable structure-guided generation. Our results show that MaskSketch achieves high image realism and fidelity to the guiding structure. Evaluated on standard benchmark datasets, MaskSketch outperforms state-of-the-art methods for sketch-to-image translation, as well as unpaired image-to-image translation approaches.},

keywords = {computer vision, CVPR, generative AI, generative media, google},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

13.

Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang

MAGVIT: Masked Generative Video Transformer Proceedings Article

In: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023.

Abstract | Links | BibTeX | Tags: computational video, computer vision, CVPR, generative AI, generative media, google

14.

Kihyuk Sohn, Yuan Hao, José Lezama, Luisa Polania, Huiwen Chang, Han Zhang, Irfan Essa, Lu Jiang

Visual Prompt Tuning for Generative Transfer Learning Proceedings Article

In: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023.

Abstract | Links | BibTeX | Tags: computer vision, CVPR, generative AI, generative media, google

15.

Kihyuk Sohn, Albert Shaw, Yuan Hao, Han Zhang, Luisa Polania, Huiwen Chang, Lu Jiang, Irfan Essa

Learning Disentangled Prompts for Compositional Image Synthesis Technical Report

2023.

Abstract | Links | BibTeX | Tags: arXiv, computer vision, generative AI, google, prompt engineering

16.

Erik Wijmans, Manolis Savva, Irfan Essa, Stefan Lee, Ari S. Morcos, Dhruv Batra

Emergence of Maps in the Memories of Blind Navigation Agents Best Paper Proceedings Article

In: Proceedings of International Conference on Learning Representations (ICLR), 2023.

Abstract | Links | BibTeX | Tags: awards, best paper award, computer vision, google, ICLR, machine learning, robotics

@inproceedings{2023-Wijmans-EMMBNA,

title = {Emergence of Maps in the Memories of Blind Navigation Agents},

author = {Erik Wijmans and Manolis Savva and Irfan Essa and Stefan Lee and Ari S. Morcos and Dhruv Batra},

url = {https://arxiv.org/abs/2301.13261

https://wijmans.xyz/publication/eom/

https://openreview.net/forum?id=lTt4KjHSsyl

https://blog.iclr.cc/2023/03/21/announcing-the-iclr-2023-outstanding-paper-award-recipients/},

doi = {10.48550/ARXIV.2301.13261},

year  = {2023},

date = {2023-05-01},

urldate = {2023-05-01},

booktitle = {Proceedings of International Conference on Learning Representations (ICLR)},

abstract = {Animal navigation research posits that organisms build and maintain internal spatial representations, or maps, of their environment. We ask if machines -- specifically, artificial intelligence (AI) navigation agents -- also build implicit (or 'mental') maps. A positive answer to this question would (a) explain the surprising phenomenon in recent literature of ostensibly map-free neural-networks achieving strong performance, and (b) strengthen the evidence of mapping as a fundamental mechanism for navigation by intelligent embodied agents, whether they be biological or artificial. Unlike animal navigation, we can judiciously design the agent's perceptual system and control the learning paradigm to nullify alternative navigation mechanisms. Specifically, we train 'blind' agents -- with sensing limited to only egomotion and no other sensing of any kind -- to perform PointGoal navigation ('go to Δ x, Δ y') via reinforcement learning. Our agents are composed of navigation-agnostic components (fully-connected and recurrent neural networks), and our experimental setup provides no inductive bias towards mapping. Despite these harsh conditions, we find that blind agents are (1) surprisingly effective navigators in new environments (~95% success); (2) they utilize memory over long horizons (remembering ~1,000 steps of past experience in an episode); (3) this memory enables them to exhibit intelligent behavior (following walls, detecting collisions, taking shortcuts); (4) there is emergence of maps and collision detection neurons in the representations of the environment built by a blind agent as it navigates; and (5) the emergent maps are selective and task dependent (e.g. the agent 'forgets' exploratory detours). Overall, this paper presents no new techniques for the AI audience, but a surprising finding, an insight, and an explanation.},

keywords = {awards, best paper award, computer vision, google, ICLR, machine learning, robotics},

pubstate = {published},

tppubtype = {inproceedings}

}