Publications

1.

Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Junfeng He, Gang Li, Sangpil Kim, Irfan Essa, Feng Yang

Parrot: Pareto-optimal multi-reward reinforcement learning framework for text-to-image generation (inproceedings) Proceedings Article

In: Proceedings of European Conference on Computer Vision (ECCV) , 2024.

Abstract | Links | BibTeX | Tags: arXiv, computer vision, ECCV, generative AI, google, reinforcement learning

@inproceedings{2024-Lee-PPMRLFTG,

title = {Parrot: Pareto-optimal multi-reward reinforcement learning framework for text-to-image generation (inproceedings)},

author = {Seung Hyun Lee and Yinxiao Li and Junjie Ke and Innfarn Yoo and Han Zhang and Jiahui Yu and Qifei Wang and Fei Deng and Glenn Entis and Junfeng He and Gang Li and Sangpil Kim and Irfan Essa and Feng Yang

},

url = {https://arxiv.org/abs/2401.05675

https://arxiv.org/pdf/2401.05675

https://dl.acm.org/doi/10.1007/978-3-031-72920-1_26},

doi = {10.48550/arXiv.2401.05675},

year  = {2024},

date = {2024-07-25},

urldate = {2024-07-25},

booktitle = {Proceedings of European Conference on Computer Vision (ECCV)

},

abstract = {Recent works have demonstrated that using reinforcement learning (RL) with multiple quality rewards can improve the quality of generated images in text-to-image (T2I) generation. However, manually adjusting reward weights poses challenges and may cause over-optimization in certain metrics. To solve this, we propose Parrot, which addresses the issue through multi-objective optimization and introduces an effective multi-reward optimization strategy to approximate Pareto optimal. Utilizing batch-wise Pareto optimal selection, Parrot automatically identifies the optimal trade-off among different rewards. We use the novel multi-reward optimization algorithm to jointly optimize the T2I model and a prompt expansion network, resulting in significant improvement of image quality and also allow to control the trade-off of different rewards using a reward related prompt during inference. Furthermore, we introduce original prompt-centered guidance at inference time, ensuring fidelity to user input after prompt expansion. Extensive experiments and a user study validate the superiority of Parrot over several baselines across various quality criteria, including aesthetics, human preference, text-image alignment, and image sentiment.

},

keywords = {arXiv, computer vision, ECCV, generative AI, google, reinforcement learning},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

2.

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, José Lezama

Photorealistic Video Generation with Diffusion Models Proceedings Article

In: European Conference on Computer Vision (ECCV), 2024.

Abstract | Links | BibTeX | Tags: arXiv, computational video, computer vision, generative AI, google

3.

Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Josh Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam, Ming-Hsuan Yang, Irfan Essa, Huisheng Wang, David A. Ross, Bryan Seybold, Lu Jiang

VideoPoet: A large language model for zero-shot video generation Best Paper Proceedings Article

In: Proceedings of International Conference on Machine Learning (ICML), 2024.

Abstract | Links | BibTeX | Tags: arXiv, best paper award, computational video, computer vision, generative AI, google, ICML

@inproceedings{2024-Kondratyuk-VLLMZVG,

title = {VideoPoet: A large language model for zero-shot video generation},

author = {Dan Kondratyuk and Lijun Yu and Xiuye Gu and José Lezama and Jonathan Huang and Grant Schindler and Rachel Hornung and Vighnesh Birodkar and Jimmy Yan and Ming-Chang Chiu and Krishna Somandepalli and Hassan Akbari and Yair Alon and Yong Cheng and Josh Dillon and Agrim Gupta and Meera Hahn and Anja Hauth and David Hendon and Alonso Martinez and David Minnen and Mikhail Sirotenko and Kihyuk Sohn and Xuan Yang and Hartwig Adam and Ming-Hsuan Yang and Irfan Essa and Huisheng Wang and David A. Ross and Bryan Seybold and Lu Jiang

},

url = {https://arxiv.org/pdf/2312.14125

https://arxiv.org/abs/2312.14125

https://sites.research.google/videopoet/},

doi = {10.48550/arXiv.2312.14125},

year  = {2024},

date = {2024-07-23},

urldate = {2024-07-23},

booktitle = {Proceedings of International Conference on Machine Learning (ICML)},

abstract = {We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

},

keywords = {arXiv, best paper award, computational video, computer vision, generative AI, google, ICML},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

4.

Xingqian Xu, Jiayi Guo, Zhangyang Wang, Gao Huang, Irfan Essa, Humphrey Shi

Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models Proceedings Article

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 8682–8692, 2024.

Abstract | Links | BibTeX | Tags: arXiv, computer vision, CVPR, generative AI

@inproceedings{2024-Xu-PDTTTDM,

title = {Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models},

author = {Xingqian Xu and Jiayi Guo and Zhangyang Wang and Gao Huang and Irfan Essa and Humphrey Shi

},

url = {https://openaccess.thecvf.com/content/CVPR2024/papers/Xu_Prompt-Free_Diffusion_Taking_Text_out_of_Text-to-Image_Diffusion_Models_CVPR_2024_paper.pdf

https://openaccess.thecvf.com/content/CVPR2024/html/Xu_Prompt-Free_Diffusion_Taking_Text_out_of_Text-to-Image_Diffusion_Models_CVPR_2024_paper.html

https://arxiv.org/abs/2305.16223

},

doi = {10.48550/arXiv.2305.16223},

year  = {2024},

date = {2024-06-18},

urldate = {2024-06-18},

booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

},

pages = {8682--8692},

abstract = {Text-to-image (T2I) research has grown explosively in the past year owing to the large-scale pre-trained diffusion models and many emerging personalization and editing approaches. Yet one pain point persists: the text prompt engineering and searching high-quality text prompts for customized results is more art than science. Moreover as commonly argued: "an image is worth a thousand words" - the attempt to describe a desired image with texts often ends up being ambiguous and cannot comprehensively cover delicate visual details hence necessitating more additional controls from the visual domain. In this paper we take a bold step forward: taking "Text" out of a pretrained T2I diffusion model to reduce the burdensome prompt engineering efforts for users. Our proposed framework Prompt-Free Diffusion relies on only visual inputs to generate new images: it takes a reference image as "context" an optional image structural conditioning and an initial noise with absolutely no text prompt. The core architecture behind the scene is Semantic Context Encoder (SeeCoder) substituting the commonly used CLIP-based or LLM-based text encoder. The reusability of SeeCoder also makes it a convenient drop-in component: one can also pre-train a SeeCoder in one T2I model and reuse it for another. Through extensive experiments Prompt-Free Diffusion is experimentally found to (i) outperform prior exemplar-based image synthesis approaches; (ii) perform on par with state-of-the-art T2I models using prompts following the best practice; and (iii) be naturally extensible to other downstream applications such as anime figure generation and virtual try-on with promising quality. Our code and models will be open-sourced.

},

keywords = {arXiv, computer vision, CVPR, generative AI},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

5.

Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation Proceedings Article

In: Proceedings of International Conference on Learning Representations (ICLR) , 2024.

Abstract | Links | BibTeX | Tags: AI, arXiv, computer vision, generative AI, google, ICLR

6.

Harish Haresamudram, Irfan Essa, Thomas Ploetz

Towards Learning Discrete Representations via Self-Supervision for Wearables-Based Human Activity Recognition Journal Article

In: Sensors, vol. 24, no. 4, 2024.

Abstract | Links | BibTeX | Tags: activity recognition, arXiv, wearable computing

@article{2023-Haresamudram-TLDRSWHAR,

title = {Towards Learning Discrete Representations via Self-Supervision for Wearables-Based Human Activity Recognition},

author = {Harish Haresamudram and Irfan Essa and Thomas Ploetz},

url = {https://arxiv.org/abs/2306.01108

https://www.mdpi.com/1424-8220/24/4/1238},

doi = {10.48550/arXiv.2306.01108},

year  = {2024},

date = {2024-02-24},

urldate = {2023-06-01},

journal = {Sensors},

volume = {24},

number = {4},

abstract = {Human activity recognition (HAR) in wearable computing is typically based on direct processing of sensor data. Sensor readings are translated into representations, either derived through dedicated preprocessing, or integrated into end-to-end learning. Independent of their origin, for the vast majority of contemporary HAR, those representations are typically continuous in nature. That has not always been the case. In the early days of HAR, discretization approaches have been explored - primarily motivated by the desire to minimize computational requirements, but also with a view on applications beyond mere recognition, such as, activity discovery, fingerprinting, or large-scale search. Those traditional discretization approaches, however, suffer from substantial loss in precision and resolution in the resulting representations with detrimental effects on downstream tasks. Times have changed and in this paper we propose a return to discretized representations. We adopt and apply recent advancements in Vector Quantization (VQ) to wearables applications, which enables us to directly learn a mapping between short spans of sensor data and a codebook of vectors, resulting in recognition performance that is generally on par with their contemporary, continuous counterparts - sometimes surpassing them. Therefore, this work presents a proof-of-concept for demonstrating how effective discrete representations can be derived, enabling applications beyond mere activity classification but also opening up the field to advanced tools for the analysis of symbolic sequences, as they are known, for example, from domains such as natural language processing. Based on an extensive experimental evaluation on a suite of wearables-based benchmark HAR tasks, we demonstrate the potential of our learned discretization scheme and discuss how discretized sensor data analysis can lead to substantial changes in HAR.},

howpublished = {arXiv:2306.01108},

keywords = {activity recognition, arXiv, wearable computing},

pubstate = {published},

tppubtype = {article}

}

Close

7.

Kihyuk Sohn, Nataniel Ruiz, Kimin Lee, Daniel Castro Chin, Irina Blok, Huiwen Chang, Jarred Barber, Lu Jiang, Glenn Entis, Yuanzhen Li, Yuan Hao, Irfan Essa, Michael Rubinstein, Dilip Krishnan

StyleDrop: Text-to-Image Generation in Any Style Proceedings Article

In: Advances in Neural Information Processing Systems (NeurIPS), 2023.

Abstract | Links | BibTeX | Tags: arXiv, computer vision, generative AI, google, NeurIPS

@inproceedings{2023-Sohn-STGS,

title = {StyleDrop: Text-to-Image Generation in Any Style},

author = {Kihyuk Sohn and Nataniel Ruiz and Kimin Lee and Daniel Castro Chin and Irina Blok and Huiwen Chang and Jarred Barber and Lu Jiang and Glenn Entis and Yuanzhen Li and Yuan Hao and Irfan Essa and Michael Rubinstein and Dilip Krishnan},

url = {https://arxiv.org/abs/2306.00983

https://openreview.net/forum?id=KoaFh16uOc

https://proceedings.neurips.cc/paper_files/paper/2023/hash/d33b177b69425e7685b0b1c05bd2a5e4-Abstract-Conference.html},

doi = {10.48550/arXiv.2306.00983},

year  = {2023},

date = {2023-12-11},

urldate = {2023-12-11},

booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},

abstract = {Pre-trained large text-to-image models synthesize impressive images with an appropriate use of text prompts. However, ambiguities inherent in natural language and out-of-distribution effects make it hard to synthesize image styles, that leverage a specific design pattern, texture or material. In this paper, we introduce StyleDrop, a method that enables the synthesis of images that faithfully follow a specific style using a text-to-image model. The proposed method is extremely versatile and captures nuances and details of a user-provided style, such as color schemes, shading, design patterns, and local and global effects. It efficiently learns a new style by fine-tuning very few trainable parameters (less than 1% of total model parameters) and improving the quality via iterative training with either human or automated feedback. Better yet, StyleDrop is able to deliver impressive results even when the user supplies only a single image that specifies the desired style. An extensive study shows that, for the task of style tuning text-to-image models, StyleDrop implemented on Muse convincingly outperforms other methods, including DreamBooth and textual inversion on Imagen or Stable Diffusion. More results are available at our project website: this https URL},

howpublished = {arXiv:2306.00983},

keywords = {arXiv, computer vision, generative AI, google, NeurIPS},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

8.

Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming-Hsuan Yang, Kevin Murphy, Alexander G. Hauptmann, Lu Jiang

SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs Proceedings Article

In: Advances in Neural Information Processing Systems (NeurIPS), 2023.

Abstract | Links | BibTeX | Tags: arXiv, computational video, computer vision, generative AI, NeurIPS

9.

Nikolai Warner, Meera Hahn, Jonathan Huang, Irfan Essa, Vighnesh Birodkar

Text and Click inputs for unambiguous open vocabulary instance segmentation Proceedings Article

In: Proeedings of British Conference for Machine Vision (BMVC), 2023.

Abstract | Links | BibTeX | Tags: arXiv, BMVC, computer vision, google, image segmentation

10.

K. Niranjan Kumar, Irfan Essa, Sehoon Ha

Words into Action: Learning Diverse Humanoid Robot Behaviors using Language Guided Iterative Motion Refinement Proceedings Article

In: CoRL Workshop on Language and Robot Learning Language as Grounding (with CoRL 2023), 2023.

Abstract | Links | BibTeX | Tags: arXiv, CoRL, robotics, vision & language

11.

Kihyuk Sohn, Albert Shaw, Yuan Hao, Han Zhang, Luisa Polania, Huiwen Chang, Lu Jiang, Irfan Essa

Learning Disentangled Prompts for Compositional Image Synthesis Technical Report

2023.

Abstract | Links | BibTeX | Tags: arXiv, computer vision, generative AI, google, prompt engineering

12.

Apoorva Beedu, Zhile Ren, Varun Agrawal, Irfan Essa

VideoPose: Estimating 6D object pose from videos Technical Report

2021.

Abstract | Links | BibTeX | Tags: arXiv, computer vision, object detection, pose estimation

13.

Karan Samel, Zelin Zhao, Binghong Chen, Shuang Li, Dharmashankar Subramanian, Irfan Essa, Le Song

Neural Temporal Logic Programming Technical Report

2021.

Abstract | Links | BibTeX | Tags: activity recognition, arXiv, machine learning, openreview

14.

Dan Scarafoni, Irfan Essa, Thomas Ploetz

PLAN-B: Predicting Likely Alternative Next Best Sequences for Action Prediction Technical Report

no. arXiv:2103.15987, 2021.

Abstract | Links | BibTeX | Tags: activity recognition, arXiv, computer vision

15.

Erik Wijmans, Julian Straub, Dhruv Batra, Irfan Essa, Judy Hoffman, Ari Morcos

Analyzing Visual Representations in Embodied Navigation Tasks Technical Report

no. arXiv:2003.05993, 2020.

Abstract | Links | BibTeX | Tags: arXiv, embodied agents, navigation

16.

Jonathan C Balloch, Varun Agrawal, Irfan Essa, Sonia Chernova

Unbiasing Semantic Segmentation For Robot Perception using Synthetic Data Feature Transfer Technical Report

no. arXiv:1809.03676, 2018.

Abstract | Links | BibTeX | Tags: arXiv, robotics, scene understanding

17.

Steven Hickson, Anelia Angelova, Irfan Essa, Rahul Sukthankar

Object category learning and retrieval with weak supervision Technical Report

no. arXiv:1801.08985, 2018.

Abstract | Links | BibTeX | Tags: arXiv, computer vision, machine learning, object detection

18.

Huda Alamri, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Jue Wang, Irfan Essa, Dhruv Batra, Devi Parikh, Anoop Cherian, Tim K Marks, Chiori Hori

Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7 Technical Report

no. arXiv:1806.00525, 2018.

Abstract | Links | BibTeX | Tags: arXiv, embodied agents, multimedia, vision & language