A searchable list of some of my publications is below. You can also access my publications from the following sites.
My ORCID is
Publications:
Dina Bashkirova, José Lezama, Kihyuk Sohn, Kate Saenko, Irfan Essa
MaskSketch: Unpaired Structure-guided Masked Image Generation Inproceedings Forthcoming
In: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), Forthcoming.
Abstract | Links | BibTeX | Tags: computer vision, CVPR, generative media, google
@inproceedings{2023-Bashkirova-MUSMIG,
title = {MaskSketch: Unpaired Structure-guided Masked Image Generation},
author = { Dina Bashkirova and José Lezama and Kihyuk Sohn and Kate Saenko and Irfan Essa},
url = {https://arxiv.org/abs/2302.05496},
doi = {10.48550/ARXIV.2302.05496},
year = {2023},
date = {2023-06-01},
urldate = {2023-06-01},
booktitle = {IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)},
abstract = {Recent conditional image generation methods produce images of remarkable diversity, fidelity and realism. However, the majority of these methods allow conditioning only on labels or text prompts, which limits their level of control over the generation result. In this paper, we introduce MaskSketch, an image generation method that allows spatial conditioning of the generation result using a guiding sketch as an extra conditioning signal during sampling. MaskSketch utilizes a pre-trained masked generative transformer, requiring no model training or paired supervision, and works with input sketches of different levels of abstraction. We show that intermediate self-attention maps of a masked generative transformer encode important structural information of the input image, such as scene layout and object shape, and we propose a novel sampling method based on this observation to enable structure-guided generation. Our results show that MaskSketch achieves high image realism and fidelity to the guiding structure. Evaluated on standard benchmark datasets, MaskSketch outperforms state-of-the-art methods for sketch-to-image translation, as well as unpaired image-to-image translation approaches.},
keywords = {computer vision, CVPR, generative media, google},
pubstate = {forthcoming},
tppubtype = {inproceedings}
}
Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang
MAGVIT: Masked Generative Video Transformer Inproceedings Forthcoming
In: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), Forthcoming.
Abstract | Links | BibTeX | Tags: computational video, computer vision, CVPR, generative media, google
@inproceedings{2023-Yu-MMGVT,
title = {MAGVIT: Masked Generative Video Transformer},
author = {Lijun Yu and Yong Cheng and Kihyuk Sohn and José Lezama and Han Zhang and Huiwen Chang and Alexander G. Hauptmann and Ming-Hsuan Yang and Yuan Hao and Irfan Essa and Lu Jiang},
url = {https://arxiv.org/abs/2212.05199
https://magvit.cs.cmu.edu/},
doi = {10.48550/ARXIV.2212.05199},
year = {2023},
date = {2023-06-01},
urldate = {2023-06-01},
booktitle = {IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)},
abstract = {We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at this https URL.},
keywords = {computational video, computer vision, CVPR, generative media, google},
pubstate = {forthcoming},
tppubtype = {inproceedings}
}
Kihyuk Sohn, Yuan Hao, José Lezama, Luisa Polania, Huiwen Chang, Han Zhang, Irfan Essa, Lu Jiang
Visual Prompt Tuning for Generative Transfer Learning Inproceedings Forthcoming
In: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), Forthcoming.
Abstract | Links | BibTeX | Tags: computer vision, CVPR, generative media, google
@inproceedings{2022-Sohn-VPTGTL,
title = {Visual Prompt Tuning for Generative Transfer Learning},
author = {Kihyuk Sohn and Yuan Hao and José Lezama and Luisa Polania and Huiwen Chang and Han Zhang and Irfan Essa and Lu Jiang},
url = {https://arxiv.org/abs/2210.00990},
doi = {10.48550/ARXIV.2210.00990},
year = {2023},
date = {2023-06-01},
urldate = {2023-06-01},
booktitle = {IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)},
abstract = {Transferring knowledge from an image synthesis model trained on a large dataset is a promising direction for learning generative image models from various domains efficiently. While previous works have studied GAN models, we present a recipe for learning vision transformers by generative knowledge transfer. We base our framework on state-of-the-art generative vision transformers that represent an image as a sequence of visual tokens to the autoregressive or non-autoregressive transformers. To adapt to a new domain, we employ prompt tuning, which prepends learnable tokens called prompt to the image token sequence, and introduce a new prompt design for our task. We study on a variety of visual domains, including visual task adaptation benchmark~citezhai2019large, with varying amount of training images, and show effectiveness of knowledge transfer and a significantly better image generation quality over existing works.},
keywords = {computer vision, CVPR, generative media, google},
pubstate = {forthcoming},
tppubtype = {inproceedings}
}
José Lezama, Tim Salimans, Lu Jiang, Huiwen Chang, Jonathan Ho, Irfan Essa
Discrete Predictor-Corrector Diffusion Models for Image Synthesis Inproceedings Forthcoming
In: International Conference on Learning Representations (ICLR), Forthcoming.
Abstract | Links | BibTeX | Tags: computer vision, generative media, google, ICLR, machine learning
@inproceedings{2023-Lezama-DPDMIS,
title = {Discrete Predictor-Corrector Diffusion Models for Image Synthesis},
author = {José Lezama and Tim Salimans and Lu Jiang and Huiwen Chang and Jonathan Ho and Irfan Essa},
url = {https://openreview.net/forum?id=VM8batVBWvg},
year = {2023},
date = {2023-05-01},
urldate = {2023-05-01},
booktitle = {International Conference on Learning Representations (ICLR)},
abstract = {We introduce Discrete Predictor-Corrector diffusion models (DPC), extending predictor-corrector samplers in Gaussian diffusion models to the discrete case. Predictor-corrector samplers are a class of samplers for diffusion models, which improve on ancestral samplers by correcting the sampling distribution of intermediate diffusion states using MCMC methods. In DPC, the Langevin corrector, which does not have a direct counterpart in discrete space, is replaced with a discrete MCMC transition defined by a learned corrector kernel. The corrector kernel is trained to make the correction steps achieve asymptotic convergence, in distribution, to the correct marginal of the intermediate diffusion states. Equipped with DPC, we revisit recent transformer-based non-autoregressive generative models through the lens of discrete diffusion, and find that DPC can alleviate the compounding decoding error due to the parallel sampling of visual tokens. Our experiments show that DPC improves upon existing discrete latent space models for class-conditional image generation on ImageNet, and outperforms continuous diffusion models and GANs, according to standard metrics and user preference studies},
keywords = {computer vision, generative media, google, ICLR, machine learning},
pubstate = {forthcoming},
tppubtype = {inproceedings}
}
Erik Wijmans, Manolis Savva, Irfan Essa, Stefan Lee, Ari S. Morcos, Dhruv Batra
Emergence of Maps in the Memories of Blind Navigation Agents Inproceedings Forthcoming
In: Proceedings of International Conference on Learning Representations (ICLR), Forthcoming.
Abstract | Links | BibTeX | Tags: awards, best paper award, computer vision, google, ICLR, machine learning, robotics
@inproceedings{2023-Wijmans-EMMBNA,
title = {Emergence of Maps in the Memories of Blind Navigation Agents},
author = {Erik Wijmans and Manolis Savva and Irfan Essa and Stefan Lee and Ari S. Morcos and Dhruv Batra},
url = {https://arxiv.org/abs/2301.13261
https://openreview.net/forum?id=lTt4KjHSsyl
https://blog.iclr.cc/2023/03/21/announcing-the-iclr-2023-outstanding-paper-award-recipients/},
doi = {10.48550/ARXIV.2301.13261},
year = {2023},
date = {2023-05-01},
urldate = {2023-05-01},
booktitle = {Proceedings of International Conference on Learning Representations (ICLR)},
abstract = {Animal navigation research posits that organisms build and maintain internal spatial representations, or maps, of their environment. We ask if machines -- specifically, artificial intelligence (AI) navigation agents -- also build implicit (or 'mental') maps. A positive answer to this question would (a) explain the surprising phenomenon in recent literature of ostensibly map-free neural-networks achieving strong performance, and (b) strengthen the evidence of mapping as a fundamental mechanism for navigation by intelligent embodied agents, whether they be biological or artificial. Unlike animal navigation, we can judiciously design the agent's perceptual system and control the learning paradigm to nullify alternative navigation mechanisms. Specifically, we train 'blind' agents -- with sensing limited to only egomotion and no other sensing of any kind -- to perform PointGoal navigation ('go to Δ x, Δ y') via reinforcement learning. Our agents are composed of navigation-agnostic components (fully-connected and recurrent neural networks), and our experimental setup provides no inductive bias towards mapping. Despite these harsh conditions, we find that blind agents are (1) surprisingly effective navigators in new environments (~95% success); (2) they utilize memory over long horizons (remembering ~1,000 steps of past experience in an episode); (3) this memory enables them to exhibit intelligent behavior (following walls, detecting collisions, taking shortcuts); (4) there is emergence of maps and collision detection neurons in the representations of the environment built by a blind agent as it navigates; and (5) the emergent maps are selective and task dependent (e.g. the agent 'forgets' exploratory detours). Overall, this paper presents no new techniques for the AI audience, but a surprising finding, an insight, and an explanation.},
keywords = {awards, best paper award, computer vision, google, ICLR, machine learning, robotics},
pubstate = {forthcoming},
tppubtype = {inproceedings}
}
Yi-Hao Peng, Peggy Chi, Anjuli Kannan, Meredith Morris, Irfan Essa
Slide Gestalt: Automatic Structure Extraction in Slide Decks for Non-Visual Access Inproceedings Forthcoming
In: ACM Symposium on User Interface Software and Technology (UIST), Forthcoming.
Abstract | Links | BibTeX | Tags: accessibility, CHI, google, human-computer interaction
@inproceedings{2023-Peng-SGASESDNA,
title = {Slide Gestalt: Automatic Structure Extraction in Slide Decks for Non-Visual Access},
author = {Yi-Hao Peng and Peggy Chi and Anjuli Kannan and Meredith Morris and Irfan Essa},
url = {https://research.google/pubs/pub52182/},
year = {2023},
date = {2023-04-23},
urldate = {2023-04-23},
booktitle = {ACM Symposium on User Interface Software and Technology (UIST)},
abstract = {Presentation slides commonly use visual patterns for structural navigation, such as titles, dividers, and build slides. However, screen readers do not capture such intention, making it time-consuming and less accessible for blind and visually impaired (BVI) users to linearly consume slides with repeated content. We present Slide Gestalt, an automatic approach that identifies the hierarchical structure in a slide deck. Slide Gestalt computes the visual and textual correspondences between slides to generate hierarchical groupings. Readers can navigate the slide deck from the higher-level section overview to the lower-level description of a slide group or individual elements interactively with our UI. We derived side consumption and authoring practices from interviews with BVI readers and sighted creators and an analysis of 100 decks. We performed our pipeline with 50 real-world slide decks and a large dataset. Feedback from eight BVI participants showed that Slide Gestalt helped navigate a slide deck by anchoring content more efficiently, compared to using accessible slides.},
keywords = {accessibility, CHI, google, human-computer interaction},
pubstate = {forthcoming},
tppubtype = {inproceedings}
}
Karan Samel, Jun Ma, Zhengyang Wang, Tong Zhao, Irfan Essa
Knowledge Relevance BERT: Integrating Noisy Knowledge into Language Representation. Inproceedings
In: AAAI workshop on Knowledge Augmented Methods for NLP (KnowledgeNLP-AAAI 2023), 2023.
Abstract | Links | BibTeX | Tags: AI, knowledge representation, NLP
@inproceedings{2023-Samel-KRBINKILR,
title = {Knowledge Relevance BERT: Integrating Noisy Knowledge into Language Representation.},
author = {Karan Samel and Jun Ma and Zhengyang Wang and Tong Zhao and Irfan Essa},
url = {https://knowledge-nlp.github.io/aaai2023/papers/005-KRBERT-oral.pdf},
year = {2023},
date = {2023-02-01},
urldate = {2023-02-01},
booktitle = {AAAI workshop on Knowledge Augmented Methods for NLP (KnowledgeNLP-AAAI 2023)},
abstract = {Integrating structured knowledge into language model representations increases recall of domain-specific information useful for downstream tasks. Matching between knowledge graph entities and text entity mentions can be easily performed when entity names are unique or entity-linking data exists. When extending this setting to new domains, newly mined knowledge contains ambiguous and incorrect information without explicit linking information. In such settings, we design a framework to robustly link relevant knowledge to input texts as an intermediate modeling step while performing end-to-end domain fine-tuning tasks. This is done by first computing the similarity of the existing task labels with candidate knowledge triplets to generate relevance labels. We use these labels to train a relevance model, which predicts the relevance of the inserted triplets to the original text. This relevance model is integrated within a language model, leading to our Knowledge Relevance BERT (KR-BERT) framework. We test KR-BERT for linking and ranking tasks on a real-world e-commerce dataset and a public entity linking task, where we show performance improvements over strong baselines.},
keywords = {AI, knowledge representation, NLP},
pubstate = {published},
tppubtype = {inproceedings}
}
Tianhao Zhang, Weilong Yang, Honglak Lee, Hung-Yu Tseng, Irfan Essa, Lu Jiang
Image manipulation by text instruction Patent
2023.
Abstract | Links | BibTeX | Tags: content creation, google, media generation, patents
@patent{2023-Zhang-IMTI,
title = {Image manipulation by text instruction},
author = {Tianhao Zhang and Weilong Yang and Honglak Lee and Hung-Yu Tseng and Irfan Essa and Lu Jiang},
url = {https://patents.google.com/patent/US11562518},
year = {2023},
date = {2023-01-01},
urldate = {2023-01-01},
abstract = {A method for generating an output image from an input image and an input text instruction that specifies a location and a modification of an edit applied to the input image using a neural network is described. The neural network includes an image encoder, an image decoder, and an instruction attention network. The method includes receiving the input image and the input text instruction; extracting, from the input image, an input image feature that represents features of the input image using the image encoder; generating a spatial feature and a modification feature from the input text instruction using the instruction attention network; generating an edited image feature from the input image feature, the spatial feature and the modification feature; and generating the output image from the edited image feature using the image decoder.},
howpublished = {US Patent # US11562518},
keywords = {content creation, google, media generation, patents},
pubstate = {published},
tppubtype = {patent}
}
Erik Wijmans, Irfan Essa, Dhruv Batra
How to Train PointGoal Navigation Agents on a (Sample and Compute) Budget Inproceedings
In: International Conference on Autonomous Agents and Multi-Agent Systems, 2022.
Abstract | Links | BibTeX | Tags: computer vision, embodied agents, navigation
@inproceedings{2022-Wijmans-TPNASCB,
title = {How to Train PointGoal Navigation Agents on a (Sample and Compute) Budget},
author = {Erik Wijmans and Irfan Essa and Dhruv Batra},
url = {https://arxiv.org/abs/2012.06117
https://ifaamas.org/Proceedings/aamas2022/pdfs/p1762.pdf},
doi = {10.48550/arXiv.2012.06117},
year = {2022},
date = {2022-12-01},
urldate = {2020-12-01},
booktitle = {International Conference on Autonomous Agents and Multi-Agent Systems},
journal = {arXiv},
number = {arXiv:2012.06117},
abstract = {PointGoal navigation has seen significant recent interest and progress, spurred on by the Habitat platform and associated challenge. In this paper, we study PointGoal navigation under both a sample budget (75 million frames) and a compute budget (1 GPU for 1 day). We conduct an extensive set of experiments, cumulatively totaling over 50,000 GPU-hours, that let us identify and discuss a number of ostensibly minor but significant design choices -- the advantage estimation procedure (a key component in training), visual encoder architecture, and a seemingly minor hyper-parameter change. Overall, these design choices to lead considerable and consistent improvements over the baselines present in Savva et al. Under a sample budget, performance for RGB-D agents improves 8 SPL on Gibson (14% relative improvement) and 20 SPL on Matterport3D (38% relative improvement). Under a compute budget, performance for RGB-D agents improves by 19 SPL on Gibson (32% relative improvement) and 35 SPL on Matterport3D (220% relative improvement). We hope our findings and recommendations will make serve to make the community's experiments more efficient.},
keywords = {computer vision, embodied agents, navigation},
pubstate = {published},
tppubtype = {inproceedings}
}
Erik Wijmans, Irfan Essa, Dhruv Batra
VER: Scaling On-Policy RL Leads to the Emergence of Navigation in Embodied Rearrangement Inproceedings
In: Oh, Alice H., Agarwal, Alekh, Belgrave, Danielle, Cho, Kyunghyun (Ed.): Advances in Neural Information Processing Systems (NeurIPS), 2022.
Abstract | Links | BibTeX | Tags: machine learning, NeurIPS, reinforcement learning, robotics
@inproceedings{2022-Wijmans-SOLENER,
title = {VER: Scaling On-Policy RL Leads to the Emergence of Navigation in Embodied Rearrangement},
author = {Erik Wijmans and Irfan Essa and Dhruv Batra},
editor = {Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},
url = {https://arxiv.org/abs/2210.05064
https://openreview.net/forum?id=VrJWseIN98},
doi = {10.48550/ARXIV.2210.05064},
year = {2022},
date = {2022-12-01},
urldate = {2022-12-01},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
abstract = {We present Variable Experience Rollout (VER), a technique for efficiently scaling batched on-policy reinforcement learning in heterogenous environments (where different environments take vastly different times to generate rollouts) to many GPUs residing on, potentially, many machines. VER combines the strengths of and blurs the line between synchronous and asynchronous on-policy RL methods (SyncOnRL and AsyncOnRL, respectively). Specifically, it learns from on-policy experience (like SyncOnRL) and has no synchronization points (like AsyncOnRL) enabling high throughput.
We find that VER leads to significant and consistent speed-ups across a broad range of embodied navigation and mobile manipulation tasks in photorealistic 3D simulation environments. Specifically, for PointGoal navigation and ObjectGoal navigation in Habitat 1.0, VER is 60-100% faster (1.6-2x speedup) than DD-PPO, the current state of art for distributed SyncOnRL, with similar sample efficiency. For mobile manipulation tasks (open fridge/cabinet, pick/place objects) in Habitat 2.0 VER is 150% faster (2.5x speedup) on 1 GPU and 170% faster (2.7x speedup) on 8 GPUs than DD-PPO. Compared to SampleFactory (the current state-of-the-art AsyncOnRL), VER matches its speed on 1 GPU, and is 70% faster (1.7x speedup) on 8 GPUs with better sample efficiency.
We leverage these speed-ups to train chained skills for GeometricGoal rearrangement tasks in the Home Assistant Benchmark (HAB). We find a surprising emergence of navigation in skills that do not ostensible require any navigation. Specifically, the Pick skill involves a robot picking an object from a table. During training the robot was always spawned close to the table and never needed to navigate. However, we find that if base movement is part of the action space, the robot learns to navigate then pick an object in new environments with 50% success, demonstrating surprisingly high out-of-distribution generalization.},
keywords = {machine learning, NeurIPS, reinforcement learning, robotics},
pubstate = {published},
tppubtype = {inproceedings}
}
We find that VER leads to significant and consistent speed-ups across a broad range of embodied navigation and mobile manipulation tasks in photorealistic 3D simulation environments. Specifically, for PointGoal navigation and ObjectGoal navigation in Habitat 1.0, VER is 60-100% faster (1.6-2x speedup) than DD-PPO, the current state of art for distributed SyncOnRL, with similar sample efficiency. For mobile manipulation tasks (open fridge/cabinet, pick/place objects) in Habitat 2.0 VER is 150% faster (2.5x speedup) on 1 GPU and 170% faster (2.7x speedup) on 8 GPUs than DD-PPO. Compared to SampleFactory (the current state-of-the-art AsyncOnRL), VER matches its speed on 1 GPU, and is 70% faster (1.7x speedup) on 8 GPUs with better sample efficiency.
We leverage these speed-ups to train chained skills for GeometricGoal rearrangement tasks in the Home Assistant Benchmark (HAB). We find a surprising emergence of navigation in skills that do not ostensible require any navigation. Specifically, the Pick skill involves a robot picking an object from a table. During training the robot was always spawned close to the table and never needed to navigate. However, we find that if base movement is part of the action space, the robot learns to navigate then pick an object in new environments with 50% success, demonstrating surprisingly high out-of-distribution generalization.
Huda Alamri, Anthony Bilic, Michael Hu, Apoorva Beedu, Irfan Essa
End-to-end Multimodal Representation Learning for Video Dialog Inproceedings
In: NeuRIPS Workshop on Vision Transformers: Theory and applications, 2022.
Abstract | Links | BibTeX | Tags: computational video, computer vision, vision transformers
@inproceedings{2022-Alamri-EMRLVD,
title = {End-to-end Multimodal Representation Learning for Video Dialog},
author = {Huda Alamri and Anthony Bilic and Michael Hu and Apoorva Beedu and Irfan Essa},
url = {https://arxiv.org/abs/2210.14512},
doi = {10.48550/arXiv.2210.14512},
year = {2022},
date = {2022-12-01},
urldate = {2022-12-01},
booktitle = {NeuRIPS Workshop on Vision Transformers: Theory and applications},
abstract = {Video-based dialog task is a challenging multimodal learning task that has received increasing attention over the past few years with state-of-the-art obtaining new performance records. This progress is largely powered by the adaptation of the more powerful transformer-based language encoders. Despite this progress, existing approaches do not effectively utilize visual features to help solve tasks. Recent studies show that state-of-the-art models are biased towards textual information rather than visual cues. In order to better leverage the available visual information, this study proposes a new framework that combines 3D-CNN network and transformer-based networks into a single visual encoder to extract more robust semantic representations from videos. The visual encoder is jointly trained end-to-end with other input modalities such as text and audio. Experiments on the AVSD task show significant improvement over baselines in both generative and retrieval tasks.},
keywords = {computational video, computer vision, vision transformers},
pubstate = {published},
tppubtype = {inproceedings}
}
Apoorva Beedu, Huda Alamri, Irfan Essa
Video based Object 6D Pose Estimation using Transformers Inproceedings
In: NeuRIPS Workshop on Vision Transformers: Theory and applications, 2022.
Abstract | Links | BibTeX | Tags: computer vision, vision transformers
@inproceedings{2022-Beedu-VBOPEUT,
title = {Video based Object 6D Pose Estimation using Transformers},
author = {Apoorva Beedu and Huda Alamri and Irfan Essa},
url = {https://arxiv.org/abs/2210.13540},
doi = {10.48550/arXiv.2210.13540},
year = {2022},
date = {2022-12-01},
urldate = {2022-12-01},
booktitle = {NeuRIPS Workshop on Vision Transformers: Theory and applications},
abstract = {We introduce a Transformer based 6D Object Pose Estimation framework VideoPose, comprising an end-to-end attention based modelling architecture, that attends to previous frames in order to estimate accurate 6D Object Poses in videos. Our approach leverages the temporal information from a video sequence for pose refinement, along with being computationally efficient and robust. Compared to existing methods, our architecture is able to capture and reason from long-range dependencies efficiently, thus iteratively refining over video sequences.Experimental evaluation on the YCB-Video dataset shows that our approach is on par with the state-of-the-art Transformer methods, and performs significantly better relative to CNN based approaches. Further, with a speed of 33 fps, it is also more efficient and therefore applicable to a variety of applications that require real-time object pose estimation. Training code and pretrained models are available at https://anonymous.4open.science/r/VideoPose-3C8C.},
keywords = {computer vision, vision transformers},
pubstate = {published},
tppubtype = {inproceedings}
}
José Lezama, Huiwen Chang, Lu Jiang, Irfan Essa
Improved Masked Image Generation with Token-Critic Inproceedings
In: European Conference on Computer Vision (ECCV), arXiv, 2022, ISBN: 978-3-031-20050-2.
Abstract | Links | BibTeX | Tags: computer vision, ECCV, generative media, google
@inproceedings{2022-Lezama-IMIGWT,
title = {Improved Masked Image Generation with Token-Critic},
author = {José Lezama and Huiwen Chang and Lu Jiang and Irfan Essa},
url = {https://arxiv.org/abs/2209.04439
https://rdcu.be/c61MZ},
doi = {10.1007/978-3-031-20050-2_5},
isbn = {978-3-031-20050-2},
year = {2022},
date = {2022-10-28},
urldate = {2022-10-28},
booktitle = {European Conference on Computer Vision (ECCV)},
volume = {13683},
publisher = {arXiv},
abstract = {Non-autoregressive generative transformers recently demonstrated impressive image generation performance, and orders of magnitude faster sampling than their autoregressive counterparts. However, optimal parallel sampling from the true joint distribution of visual tokens remains an open challenge. In this paper we introduce Token-Critic, an auxiliary model to guide the sampling of a non-autoregressive generative transformer. Given a masked-and-reconstructed real image, the Token-Critic model is trained to distinguish which visual tokens belong to the original image and which were sampled by the generative transformer. During non-autoregressive iterative sampling, Token-Critic is used to select which tokens to accept and which to reject and resample. Coupled with Token-Critic, a state-of-the-art generative transformer significantly improves its performance, and outperforms recent diffusion models and GANs in terms of the trade-off between generated image quality and diversity, in the challenging class-conditional ImageNet generation.},
keywords = {computer vision, ECCV, generative media, google},
pubstate = {published},
tppubtype = {inproceedings}
}
Xiang Kong, Lu Jiang, Huiwen Chang, Han Zhang, Yuan Hao, Haifeng Gong, Irfan Essa
BLT: Bidirectional Layout Transformer for Controllable Layout Generation Inproceedings
In: European Conference on Computer Vision (ECCV), 2022, ISBN: 978-3-031-19789-5.
Abstract | Links | BibTeX | Tags: computer vision, ECCV, generative media, google, vision transformer
@inproceedings{2022-Kong-BLTCLG,
title = {BLT: Bidirectional Layout Transformer for Controllable Layout Generation},
author = {Xiang Kong and Lu Jiang and Huiwen Chang and Han Zhang and Yuan Hao and Haifeng Gong and Irfan Essa},
url = {https://arxiv.org/abs/2112.05112
https://rdcu.be/c61AE},
doi = {10.1007/978-3-031-19790-1_29},
isbn = {978-3-031-19789-5},
year = {2022},
date = {2022-10-25},
urldate = {2022-10-25},
booktitle = {European Conference on Computer Vision (ECCV)},
volume = {13677},
abstract = {Creating visual layouts is a critical step in graphic design. Automatic generation of such layouts is essential for scalable and diverse visual designs. To advance conditional layout generation, we introduce BLT, a bidirectional layout transformer. BLT differs from previous work on transformers in adopting non-autoregressive transformers. In training, BLT learns to predict the masked attributes by attending to surrounding attributes in two directions. During inference, BLT first generates a draft layout from the input and then iteratively refines it into a high-quality layout by masking out low-confident attributes. The masks generated in both training and inference are controlled by a new hierarchical sampling policy. We verify the proposed model on six benchmarks of diverse design tasks. Experimental results demonstrate two benefits compared to the state-of-the-art layout transformer models. First, our model empowers layout transformers to fulfill controllable layout generation. Second, it achieves up to 10x speedup in generating a layout at inference time than the layout transformer baseline. Code is released at https://shawnkx.github.io/blt.},
keywords = {computer vision, ECCV, generative media, google, vision transformer},
pubstate = {published},
tppubtype = {inproceedings}
}
Peggy Chi, Tao Dong, Christian Frueh, Brian Colonna, Vivek Kwatra, Irfan Essa
Synthesis-Assisted Video Prototyping From a Document Inproceedings
In: Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pp. 1–10, 2022.
Abstract | Links | BibTeX | Tags: computational video, generative media, google, human-computer interaction, UIST, video editing
@inproceedings{2022-Chi-SVPFD,
title = {Synthesis-Assisted Video Prototyping From a Document},
author = {Peggy Chi and Tao Dong and Christian Frueh and Brian Colonna and Vivek Kwatra and Irfan Essa},
url = {https://research.google/pubs/pub51631/
https://dl.acm.org/doi/abs/10.1145/3526113.3545676},
doi = {10.1145/3526113.3545676},
year = {2022},
date = {2022-10-01},
urldate = {2022-10-01},
booktitle = {Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology},
pages = {1--10},
abstract = {Video productions commonly start with a script, especially for talking head videos that feature a speaker narrating to the camera. When the source materials come from a written document -- such as a web tutorial, it takes iterations to refine content from a text article to a spoken dialogue, while considering visual compositions in each scene. We propose Doc2Video, a video prototyping approach that converts a document to interactive scripting with a preview of synthetic talking head videos. Our pipeline decomposes a source document into a series of scenes, each automatically creating a synthesized video of a virtual instructor. Designed for a specific domain -- programming cookbooks, we apply visual elements from the source document, such as a keyword, a code snippet or a screenshot, in suitable layouts. Users edit narration sentences, break or combine sections, and modify visuals to prototype a video in our Editing UI. We evaluated our pipeline with public programming cookbooks. Feedback from professional creators shows that our method provided a reasonable starting point to engage them in interactive scripting for a narrated instructional video.},
keywords = {computational video, generative media, google, human-computer interaction, UIST, video editing},
pubstate = {published},
tppubtype = {inproceedings}
}
Harish Haresamudram, Irfan Essa, Thomas Ploetz
Assessing the State of Self-Supervised Human Activity Recognition using Wearables Journal Article
In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), vol. 6, iss. 3, no. 116, pp. 1–47, 2022.
Abstract | Links | BibTeX | Tags: activity recognition, IMWUT, ubiquitous computing, wearable computing
@article{2022-Haresamudram-ASSHARUW,
title = {Assessing the State of Self-Supervised Human Activity Recognition using Wearables},
author = {Harish Haresamudram and Irfan Essa and Thomas Ploetz},
url = {https://dl.acm.org/doi/10.1145/3550299
https://arxiv.org/abs/2202.12938
https://arxiv.org/pdf/2202.12938
},
doi = {doi.org/10.1145/3550299},
year = {2022},
date = {2022-09-07},
urldate = {2022-09-07},
booktitle = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)},
journal = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)},
volume = {6},
number = {116},
issue = {3},
pages = {1–47},
publisher = {ACM},
abstract = {The emergence of self-supervised learning in the field of wearables-based human activity recognition (HAR) has opened up opportunities to tackle the most pressing challenges in the field, namely to exploit unlabeled data to derive reliable recognition systems for scenarios where only small amounts of labeled training samples can be collected. As such, self-supervision, i.e., the paradigm of 'pretrain-then-finetune' has the potential to become a strong alternative to the predominant end-to-end training approaches, let alone hand-crafted features for the classic activity recognition chain. Recently a number of contributions have been made that introduced self-supervised learning into the field of HAR, including, Multi-task self-supervision, Masked Reconstruction, CPC, and SimCLR, to name but a few. With the initial success of these methods, the time has come for a systematic inventory and analysis of the potential self-supervised learning has for the field. This paper provides exactly that. We assess the progress of self-supervised HAR research by introducing a framework that performs a multi-faceted exploration of model performance. We organize the framework into three dimensions, each containing three constituent criteria, such that each dimension captures specific aspects of performance, including the robustness to differing source and target conditions, the influence of dataset characteristics, and the feature space characteristics. We utilize this framework to assess seven state-of-the-art self-supervised methods for HAR, leading to the formulation of insights into the properties of these techniques and to establish their value towards learning representations for diverse scenarios.
},
keywords = {activity recognition, IMWUT, ubiquitous computing, wearable computing},
pubstate = {published},
tppubtype = {article}
}
Daniel Nkemelu, Harshil Shah, Irfan Essa, Michael L. Best
Tackling Hate Speech in Low-resource Languages with Context Experts Inproceedings
In: International Conference on Information & Communication Technologies and Development (ICTD), 2022.
Abstract | Links | BibTeX | Tags: computational journalism, ICTD, social computing
@inproceedings{2022-Nkemelu-THSLLWCE,
title = {Tackling Hate Speech in Low-resource Languages with Context Experts},
author = {Daniel Nkemelu and Harshil Shah and Irfan Essa and Michael L. Best},
url = {https://www.nkemelu.com/data/ictd2022_nkemelu_final.pdf
},
year = {2022},
date = {2022-06-01},
urldate = {2022-06-01},
booktitle = {International Conference on Information & Communication Technologies and Development (ICTD)},
abstract = {Given Myanmar's historical and socio-political context, hate speech spread on social media have escalated into offline unrest and violence. This paper presents findings from our remote study on the automatic detection of hate speech online in Myanmar. We argue that effectively addressing this problem will require community-based approaches that combine the knowledge of context experts with machine learning tools that can analyze the vast amount of data produced. To this end, we develop a systematic process to facilitate this collaboration covering key aspects of data collection, annotation, and model validation strategies. We highlight challenges in this area stemming from small and imbalanced datasets, the need to balance non-glamorous data work and stakeholder priorities, and closed data sharing practices. Stemming from these findings, we discuss avenues for further work in developing and deploying hate speech detection systems for low-resource languages.},
keywords = {computational journalism, ICTD, social computing},
pubstate = {published},
tppubtype = {inproceedings}
}
Niranjan Kumar, Irfan Essa, Sehoon Ha
Graph-based Cluttered Scene Generation and Interactive Exploration using Deep Reinforcement Learning Inproceedings
In: Proceedings International Conference on Robotics and Automation (ICRA), pp. 7521-7527, 2022.
Abstract | Links | BibTeX | Tags: ICRA, machine learning, reinforcement learning, robotics
@inproceedings{2021-Kumar-GCSGIEUDRL,
title = {Graph-based Cluttered Scene Generation and Interactive Exploration using Deep Reinforcement Learning},
author = {Niranjan Kumar and Irfan Essa and Sehoon Ha},
url = {https://doi.org/10.1109/ICRA46639.2022.9811874
https://arxiv.org/abs/2109.10460
https://arxiv.org/pdf/2109.10460
https://www.kniranjankumar.com/projects/5_clutr
https://kniranjankumar.github.io/assets/pdf/graph_based_clutter.pdf
https://youtu.be/T2Jo7wwaXss},
doi = {10.1109/ICRA46639.2022.9811874},
year = {2022},
date = {2022-05-01},
urldate = {2022-05-01},
booktitle = {Proceedings International Conference on Robotics and Automation (ICRA)},
journal = {arXiv},
number = {2109.10460},
pages = {7521-7527},
abstract = {We introduce a novel method to teach a robotic agent to interactively explore cluttered yet structured scenes, such as kitchen pantries and grocery shelves, by leveraging the physical plausibility of the scene. We propose a novel learning framework to train an effective scene exploration policy to discover hidden objects with minimal interactions. First, we define a novel scene grammar to represent structured clutter. Then we train a Graph Neural Network (GNN) based Scene Generation agent using deep reinforcement learning (deep RL), to manipulate this Scene Grammar to create a diverse set of stable scenes, each containing multiple hidden objects. Given such cluttered scenes, we then train a Scene Exploration agent, using deep RL, to uncover hidden objects by interactively rearranging the scene.
},
keywords = {ICRA, machine learning, reinforcement learning, robotics},
pubstate = {published},
tppubtype = {inproceedings}
}
Karan Samel, Zelin Zhao, Binghong Chen, Shuang Li, Dharmashankar Subramanian, Irfan Essa, Le Song
Learning Temporal Rules from Noisy Timeseries Data Journal Article
In: arXiv preprint arXiv:2202.05403, 2022.
Abstract | Links | BibTeX | Tags: activity recognition, machine learning
@article{2022-Samel-LTRFNTD,
title = {Learning Temporal Rules from Noisy Timeseries Data},
author = {Karan Samel and Zelin Zhao and Binghong Chen and Shuang Li and Dharmashankar Subramanian and Irfan Essa and Le Song},
url = {https://arxiv.org/abs/2202.05403
https://arxiv.org/pdf/2202.05403},
year = {2022},
date = {2022-02-01},
urldate = {2022-02-01},
journal = {arXiv preprint arXiv:2202.05403},
abstract = {Events across a timeline are a common data representation, seen in different temporal modalities. Individual atomic events can occur in a certain temporal ordering to compose higher level composite events. Examples of a composite event are a patient's medical symptom or a baseball player hitting a home run, caused distinct temporal orderings of patient vitals and player movements respectively. Such salient composite events are provided as labels in temporal datasets and most works optimize models to predict these composite event labels directly. We focus on uncovering the underlying atomic events and their relations that lead to the composite events within a noisy temporal data setting. We propose Neural Temporal Logic Programming (Neural TLP) which first learns implicit temporal relations between atomic events and then lifts logic rules for composite events, given only the composite events labels for supervision. This is done through efficiently searching through the combinatorial space of all temporal logic rules in an end-to-end differentiable manner. We evaluate our method on video and healthcare datasets where it outperforms the baseline methods for rule discovery.
},
keywords = {activity recognition, machine learning},
pubstate = {published},
tppubtype = {article}
}
Chengzhi Mao, Lu Jiang, Mostafa Dehghani, Carl Vondrick, Rahul Sukthankar, Irfan Essa
Discrete Representations Strengthen Vision Transformer Robustness Inproceedings
In: Proceedings of International Conference on Learning Representations (ICLR), 2022.
Abstract | Links | BibTeX | Tags: computer vision, google, machine learning, vision transformer
@inproceedings{2022-Mao-DRSVTR,
title = {Discrete Representations Strengthen Vision Transformer Robustness},
author = {Chengzhi Mao and Lu Jiang and Mostafa Dehghani and Carl Vondrick and Rahul Sukthankar and Irfan Essa},
url = {https://iclr.cc/virtual/2022/poster/6647
https://arxiv.org/abs/2111.10493
https://research.google/pubs/pub51388/
https://openreview.net/forum?id=8hWs60AZcWk},
doi = {10.48550/arXiv.2111.10493},
year = {2022},
date = {2022-01-28},
urldate = {2022-04-01},
booktitle = {Proceedings of International Conference on Learning Representations (ICLR)},
journal = {arXiv preprint arXiv:2111.10493},
abstract = {Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition. While recent studies suggest that ViTs are more robust than their convolutional counterparts, our experiments find that ViTs trained on ImageNet are overly reliant on local textures and fail to make adequate use of shape information. ViTs thus have difficulties generalizing to out-of-distribution, real-world data. To address this deficiency, we present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder. Different from the standard continuous pixel tokens, discrete tokens are invariant under small perturbations and contain less information individually, which promote ViTs to learn global information that is invariant. Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks while maintaining the performance on ImageNet.},
keywords = {computer vision, google, machine learning, vision transformer},
pubstate = {published},
tppubtype = {inproceedings}
}
Other Publication Sites
A few more sites that aggregate research publications: Academic.edu, Bibsonomy, CiteULike, Mendeley.
Copyright/About
[Please see the Copyright Statement that may apply to the content listed here.]
This list of publications is produced by using the teachPress plugin for WordPress.