Publications in 2022

Erik Wijmans, Irfan Essa, Dhruv Batra

How to Train PointGoal Navigation Agents on a (Sample and Compute) Budget Proceedings Article

In: International Conference on Autonomous Agents and Multi-Agent Systems, 2022.

Abstract | Links | BibTeX | Tags: computer vision, embodied agents, navigation

Erik Wijmans, Irfan Essa, Dhruv Batra

VER: Scaling On-Policy RL Leads to the Emergence of Navigation in Embodied Rearrangement Proceedings Article

In: Oh, Alice H., Agarwal, Alekh, Belgrave, Danielle, Cho, Kyunghyun (Ed.): Advances in Neural Information Processing Systems (NeurIPS), 2022.

Abstract | Links | BibTeX | Tags: machine learning, NeurIPS, reinforcement learning, robotics

@inproceedings{2022-Wijmans-SOLENER,

title = {VER: Scaling On-Policy RL Leads to the Emergence of Navigation in Embodied Rearrangement},

author = {Erik Wijmans and Irfan Essa and Dhruv Batra},

editor = {Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},

url = {https://arxiv.org/abs/2210.05064

https://openreview.net/forum?id=VrJWseIN98},

doi = {10.48550/ARXIV.2210.05064},

year  = {2022},

date = {2022-12-01},

urldate = {2022-12-01},

booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},

abstract = {We present Variable Experience Rollout (VER), a technique for efficiently scaling batched on-policy reinforcement learning in heterogenous environments (where different environments take vastly different times to generate rollouts) to many GPUs residing on, potentially, many machines. VER combines the strengths of and blurs the line between synchronous and asynchronous on-policy RL methods (SyncOnRL and AsyncOnRL, respectively). Specifically, it learns from on-policy experience (like SyncOnRL) and has no synchronization points (like AsyncOnRL) enabling high throughput. 

 

We find that VER leads to significant and consistent speed-ups across a broad range of embodied navigation and mobile manipulation tasks in photorealistic 3D simulation environments. Specifically, for PointGoal navigation and ObjectGoal navigation in Habitat 1.0, VER is 60-100% faster (1.6-2x speedup) than DD-PPO, the current state of art for distributed SyncOnRL, with similar sample efficiency. For mobile manipulation tasks (open fridge/cabinet, pick/place objects) in Habitat 2.0 VER is 150% faster (2.5x speedup) on 1 GPU and 170% faster (2.7x speedup) on 8 GPUs than DD-PPO. Compared to SampleFactory (the current state-of-the-art AsyncOnRL), VER matches its speed on 1 GPU, and is 70% faster (1.7x speedup) on 8 GPUs with better sample efficiency. 

 

We leverage these speed-ups to train chained skills for GeometricGoal rearrangement tasks in the Home Assistant Benchmark (HAB). We find a surprising emergence of navigation in skills that do not ostensible require any navigation. Specifically, the Pick skill involves a robot picking an object from a table. During training the robot was always spawned close to the table and never needed to navigate. However, we find that if base movement is part of the action space, the robot learns to navigate then pick an object in new environments with 50% success, demonstrating surprisingly high out-of-distribution generalization.},

keywords = {machine learning, NeurIPS, reinforcement learning, robotics},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

Huda Alamri, Anthony Bilic, Michael Hu, Apoorva Beedu, Irfan Essa

End-to-end Multimodal Representation Learning for Video Dialog Proceedings Article

In: NeuRIPS Workshop on Vision Transformers: Theory and applications, 2022.

Abstract | Links | BibTeX | Tags: computational video, computer vision, vision transformers

Apoorva Beedu, Huda Alamri, Irfan Essa

Video based Object 6D Pose Estimation using Transformers Proceedings Article

In: NeuRIPS Workshop on Vision Transformers: Theory and applications, 2022.

Abstract | Links | BibTeX | Tags: computer vision, vision transformers

José Lezama, Huiwen Chang, Lu Jiang, Irfan Essa

Improved Masked Image Generation with Token-Critic Proceedings Article

In: European Conference on Computer Vision (ECCV), arXiv, 2022, ISBN: 978-3-031-20050-2.

Abstract | Links | BibTeX | Tags: computer vision, ECCV, generative AI, generative media, google

Xiang Kong, Lu Jiang, Huiwen Chang, Han Zhang, Yuan Hao, Haifeng Gong, Irfan Essa

BLT: Bidirectional Layout Transformer for Controllable Layout Generation Proceedings Article

In: European Conference on Computer Vision (ECCV), 2022, ISBN: 978-3-031-19789-5.

Abstract | Links | BibTeX | Tags: computer vision, ECCV, generative AI, generative media, google, vision transformer

Peggy Chi, Tao Dong, Christian Frueh, Brian Colonna, Vivek Kwatra, Irfan Essa

Synthesis-Assisted Video Prototyping From a Document Proceedings Article

In: Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pp. 1–10, 2022.

Abstract | Links | BibTeX | Tags: computational video, generative media, google, human-computer interaction, UIST, video editing

Harish Haresamudram, Irfan Essa, Thomas Ploetz

Assessing the State of Self-Supervised Human Activity Recognition using Wearables Journal Article

In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), vol. 6, iss. 3, no. 116, pp. 1–47, 2022.

Abstract | Links | BibTeX | Tags: activity recognition, IMWUT, ubiquitous computing, wearable computing

@article{2022-Haresamudram-ASSHARUW,

title = {Assessing the State of Self-Supervised Human Activity Recognition using Wearables},

author = {Harish Haresamudram and Irfan Essa and Thomas Ploetz},

url = {https://dl.acm.org/doi/10.1145/3550299

https://arxiv.org/abs/2202.12938

https://arxiv.org/pdf/2202.12938

},

doi = {doi.org/10.1145/3550299},

year  = {2022},

date = {2022-09-07},

urldate = {2022-09-07},

booktitle = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)},

journal = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)},

volume = {6},

number = {116},

issue = {3},

pages = {1–47},

publisher = {ACM},

abstract = {The emergence of self-supervised learning in the field of wearables-based human activity recognition (HAR) has opened up opportunities to tackle the most pressing challenges in the field, namely to exploit unlabeled data to derive reliable recognition systems for scenarios where only small amounts of labeled training samples can be collected. As such, self-supervision, i.e., the paradigm of 'pretrain-then-finetune' has the potential to become a strong alternative to the predominant end-to-end training approaches, let alone hand-crafted features for the classic activity recognition chain. Recently a number of contributions have been made that introduced self-supervised learning into the field of HAR, including, Multi-task self-supervision, Masked Reconstruction, CPC, and SimCLR, to name but a few. With the initial success of these methods, the time has come for a systematic inventory and analysis of the potential self-supervised learning has for the field. This paper provides exactly that. We assess the progress of self-supervised HAR research by introducing a framework that performs a multi-faceted exploration of model performance. We organize the framework into three dimensions, each containing three constituent criteria, such that each dimension captures specific aspects of performance, including the robustness to differing source and target conditions, the influence of dataset characteristics, and the feature space characteristics. We utilize this framework to assess seven state-of-the-art self-supervised methods for HAR, leading to the formulation of insights into the properties of these techniques and to establish their value towards learning representations for diverse scenarios.



},

keywords = {activity recognition, IMWUT, ubiquitous computing, wearable computing},

pubstate = {published},

tppubtype = {article}

}