A searchable list of some of my publications is below. You can also access my publications from the following sites.
My ORCID is
Publications:
Erik Wijmans, Irfan Essa, Dhruv Batra
How to Train PointGoal Navigation Agents on a (Sample and Compute) Budget Proceedings Article
In: International Conference on Autonomous Agents and Multi-Agent Systems, 2022.
Abstract | Links | BibTeX | Tags: computer vision, embodied agents, navigation
@inproceedings{2022-Wijmans-TPNASCB,
title = {How to Train PointGoal Navigation Agents on a (Sample and Compute) Budget},
author = {Erik Wijmans and Irfan Essa and Dhruv Batra},
url = {https://arxiv.org/abs/2012.06117
https://ifaamas.org/Proceedings/aamas2022/pdfs/p1762.pdf},
doi = {10.48550/arXiv.2012.06117},
year = {2022},
date = {2022-12-01},
urldate = {2020-12-01},
booktitle = {International Conference on Autonomous Agents and Multi-Agent Systems},
journal = {arXiv},
number = {arXiv:2012.06117},
abstract = {PointGoal navigation has seen significant recent interest and progress, spurred on by the Habitat platform and associated challenge. In this paper, we study PointGoal navigation under both a sample budget (75 million frames) and a compute budget (1 GPU for 1 day). We conduct an extensive set of experiments, cumulatively totaling over 50,000 GPU-hours, that let us identify and discuss a number of ostensibly minor but significant design choices -- the advantage estimation procedure (a key component in training), visual encoder architecture, and a seemingly minor hyper-parameter change. Overall, these design choices to lead considerable and consistent improvements over the baselines present in Savva et al. Under a sample budget, performance for RGB-D agents improves 8 SPL on Gibson (14% relative improvement) and 20 SPL on Matterport3D (38% relative improvement). Under a compute budget, performance for RGB-D agents improves by 19 SPL on Gibson (32% relative improvement) and 35 SPL on Matterport3D (220% relative improvement). We hope our findings and recommendations will make serve to make the community's experiments more efficient.},
keywords = {computer vision, embodied agents, navigation},
pubstate = {published},
tppubtype = {inproceedings}
}
Vincent Cartillier, Zhile Ren, Neha Jain, Stefan Lee, Irfan Essa, Dhruv Batra
Semantic MapNet: Building Allocentric SemanticMaps and Representations from Egocentric Views Proceedings Article
In: Proceedings of American Association of Artificial Intelligence Conference (AAAI), AAAI, 2021.
Abstract | Links | BibTeX | Tags: AAAI, AI, embodied agents, first-person vision
@inproceedings{2021-Cartillier-SMBASRFEV,
title = {Semantic MapNet: Building Allocentric SemanticMaps and Representations from Egocentric Views},
author = {Vincent Cartillier and Zhile Ren and Neha Jain and Stefan Lee and Irfan Essa and Dhruv Batra},
url = {https://arxiv.org/abs/2010.01191
https://vincentcartillier.github.io/smnet.html
https://ojs.aaai.org/index.php/AAAI/article/view/16180/15987},
doi = {10.48550/arXiv.2010.01191},
year = {2021},
date = {2021-02-01},
urldate = {2021-02-01},
booktitle = {Proceedings of American Association of Artificial Intelligence Conference (AAAI)},
publisher = {AAAI},
abstract = {We study the task of semantic mapping -- specifically, an embodied agent (a robot or an egocentric AI assistant) is given a tour of a new environment and asked to build an allocentric top-down semantic map (`what is where?') from egocentric observations of an RGB-D camera with known pose (via localization sensors). Importantly, our goal is to build neural episodic memories and spatio-semantic representations of 3D spaces that enable the agent to easily learn subsequent tasks in the same space -- navigating to objects seen during the tour (`Find chair') or answering questions about the space (`How many chairs did you see in the house?').
Towards this goal, we present Semantic MapNet (SMNet), which consists of: (1) an Egocentric
Visual Encoder that encodes each egocentric RGB-D frame, (2) a Feature Projector that projects egocentric features to appropriate locations on a floor-plan, (3) a Spatial Memory Tensor of size floor-plan length × width × feature-dims that learns to accumulate projected egocentric features, and (4) a Map Decoder that uses the memory tensor to produce semantic top-down maps. SMNet combines the strengths of (known) projective camera geometry and neural representation learning. On the task of semantic mapping in the Matterport3D dataset, SMNet significantly outperforms competitive baselines by 4.01-16.81% (absolute) on mean-IoU and 3.81-19.69% (absolute) on Boundary-F1 metrics. Moreover, we show how to use the spatio-semantic allocentric representations build by SMNet for the task of ObjectNav and Embodied Question Answering.},
keywords = {AAAI, AI, embodied agents, first-person vision},
pubstate = {published},
tppubtype = {inproceedings}
}
Towards this goal, we present Semantic MapNet (SMNet), which consists of: (1) an Egocentric
Visual Encoder that encodes each egocentric RGB-D frame, (2) a Feature Projector that projects egocentric features to appropriate locations on a floor-plan, (3) a Spatial Memory Tensor of size floor-plan length × width × feature-dims that learns to accumulate projected egocentric features, and (4) a Map Decoder that uses the memory tensor to produce semantic top-down maps. SMNet combines the strengths of (known) projective camera geometry and neural representation learning. On the task of semantic mapping in the Matterport3D dataset, SMNet significantly outperforms competitive baselines by 4.01-16.81% (absolute) on mean-IoU and 3.81-19.69% (absolute) on Boundary-F1 metrics. Moreover, we show how to use the spatio-semantic allocentric representations build by SMNet for the task of ObjectNav and Embodied Question Answering.
Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, Dhruv Batra
Decentralized Distributed PPO: Solving PointGoal Navigation Proceedings Article
In: Proceedings of International Conference on Learning Representations (ICLR), 2020.
Abstract | Links | BibTeX | Tags: embodied agents, ICLR, navigation, systems for ML
@inproceedings{2020-Wijmans-DDSPN,
title = {Decentralized Distributed PPO: Solving PointGoal Navigation},
author = {Erik Wijmans and Abhishek Kadian and Ari Morcos and Stefan Lee and Irfan Essa and Devi Parikh and Manolis Savva and Dhruv Batra},
url = {https://arxiv.org/abs/1911.00357
https://paperswithcode.com/paper/decentralized-distributed-ppo-solving},
year = {2020},
date = {2020-04-01},
urldate = {2020-04-01},
booktitle = {Proceedings of International Conference on Learning Representations (ICLR)},
abstract = {We present Decentralized Distributed Proximal Policy Optimization (DD-PPO), a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever stale), making it conceptually simple and easy to implement. In our experiments on training virtual robots to navigate in Habitat-Sim, DD-PPO exhibits near-linear scaling -- achieving a speedup of 107x on 128 GPUs over a serial implementation. We leverage this scaling to train an agent for 2.5 Billion steps of experience (the equivalent of 80 years of human experience) -- over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs.
This massive-scale training not only sets the state of art on Habitat Autonomous Navigation Challenge 2019, but essentially solves the task --near-perfect autonomous navigation in an unseen environment without access to a map, directly from an RGB-D camera and a GPS+Compass sensor. Fortuitously, error vs computation exhibits a power-law-like distribution; thus, 90% of peak performance is obtained relatively early (at 100 million steps) and relatively cheaply (under 1 day with 8 GPUs). Finally, we show that the scene understanding and navigation policies learned can be transferred to other navigation tasks -- the analog of ImageNet pre-training + task-specific fine-tuning for embodied AI. Our model outperforms ImageNet pre-trained CNNs on these transfer tasks and can serve as a universal resource (all models and code are publicly available).},
keywords = {embodied agents, ICLR, navigation, systems for ML},
pubstate = {published},
tppubtype = {inproceedings}
}
This massive-scale training not only sets the state of art on Habitat Autonomous Navigation Challenge 2019, but essentially solves the task --near-perfect autonomous navigation in an unseen environment without access to a map, directly from an RGB-D camera and a GPS+Compass sensor. Fortuitously, error vs computation exhibits a power-law-like distribution; thus, 90% of peak performance is obtained relatively early (at 100 million steps) and relatively cheaply (under 1 day with 8 GPUs). Finally, we show that the scene understanding and navigation policies learned can be transferred to other navigation tasks -- the analog of ImageNet pre-training + task-specific fine-tuning for embodied AI. Our model outperforms ImageNet pre-trained CNNs on these transfer tasks and can serve as a universal resource (all models and code are publicly available).
Erik Wijmans, Julian Straub, Dhruv Batra, Irfan Essa, Judy Hoffman, Ari Morcos
Analyzing Visual Representations in Embodied Navigation Tasks Technical Report
no. arXiv:2003.05993, 2020.
Abstract | Links | BibTeX | Tags: arXiv, embodied agents, navigation
@techreport{2020-Wijmans-AVRENT,
title = {Analyzing Visual Representations in Embodied Navigation Tasks},
author = {Erik Wijmans and Julian Straub and Dhruv Batra and Irfan Essa and Judy Hoffman and Ari Morcos},
url = {https://arxiv.org/abs/2003.05993
https://arxiv.org/pdf/2003.05993},
doi = {10.48550/arXiv.2003.05993},
year = {2020},
date = {2020-03-01},
urldate = {2020-03-01},
journal = {arXiv},
number = {arXiv:2003.05993},
abstract = {Recent advances in deep reinforcement learning require a large amount of training data and generally result in representations that are often over specialized to the target task. In this work, we present a methodology to study the underlying potential causes for this specialization. We use the recently proposed projection weighted Canonical Correlation Analysis (PWCCA) to measure the similarity of visual representations learned in the same environment by performing different tasks.
We then leverage our proposed methodology to examine the task dependence of visual representations learned on related but distinct embodied navigation tasks. Surprisingly, we find that slight differences in task have no measurable effect on the visual representation for both SqueezeNet and ResNet architectures. We then empirically demonstrate that visual representations learned on one task can be effectively transferred to a different task.},
howpublished = {arXiv:2003.05993},
keywords = {arXiv, embodied agents, navigation},
pubstate = {published},
tppubtype = {techreport}
}
We then leverage our proposed methodology to examine the task dependence of visual representations learned on related but distinct embodied navigation tasks. Surprisingly, we find that slight differences in task have no measurable effect on the visual representation for both SqueezeNet and ResNet architectures. We then empirically demonstrate that visual representations learned on one task can be effectively transferred to a different task.
Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh
Audio Visual Scene-Aware Dialog Proceedings Article
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
Abstract | Links | BibTeX | Tags: computational video, computer vision, CVPR, embodied agents, vision & language
@inproceedings{2019-Alamri-AVSD,
title = {Audio Visual Scene-Aware Dialog},
author = {Huda Alamri and Vincent Cartillier and Abhishek Das and Jue Wang and Anoop Cherian and Irfan Essa and Dhruv Batra and Tim K. Marks and Chiori Hori and Peter Anderson and Stefan Lee and Devi Parikh},
url = {https://openaccess.thecvf.com/content_CVPR_2019/papers/Alamri_Audio_Visual_Scene-Aware_Dialog_CVPR_2019_paper.pdf
https://video-dialog.com/
https://arxiv.org/abs/1901.09107},
doi = {10.1109/CVPR.2019.00774},
year = {2019},
date = {2019-06-01},
urldate = {2019-06-01},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
abstract = {We introduce the task of scene-aware dialog. Our goal is to generate a complete and natural response to a question about a scene, given video and audio of the scene and the history of previous turns in the dialog. To answer successfully, agents must ground concepts from the question in the video while leveraging contextual cues from the dialog history. To benchmark this task, we introduce the Audio Visual Scene-Aware Dialog (AVSD) Dataset. For each of more than 11,000 videos of human actions from the Charades dataset, our dataset contains a dialog about the video, plus a final summary of the video by one of the dialog participants. We train several baseline systems for this task and evaluate the performance of the trained models using both qualitative and quantitative metrics. Our results indicate that models must utilize all the available inputs (video, audio, question, and dialog history) to perform best on this dataset.
},
keywords = {computational video, computer vision, CVPR, embodied agents, vision & language},
pubstate = {published},
tppubtype = {inproceedings}
}
Huda Alamri, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Jue Wang, Irfan Essa, Dhruv Batra, Devi Parikh, Anoop Cherian, Tim K Marks, Chiori Hori
Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7 Technical Report
no. arXiv:1806.00525, 2018.
Abstract | Links | BibTeX | Tags: arXiv, embodied agents, multimedia, vision & language
@techreport{2018-Alamri-AVSDACD,
title = {Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7},
author = {Huda Alamri and Vincent Cartillier and Raphael Gontijo Lopes and Abhishek Das and Jue Wang and Irfan Essa and Dhruv Batra and Devi Parikh and Anoop Cherian and Tim K Marks and Chiori Hori},
url = {https://video-dialog.com/
https://arxiv.org/abs/1806.00525},
doi = {10.48550/arXiv.1806.00525},
year = {2018},
date = {2018-06-01},
urldate = {2018-06-01},
journal = {arXiv},
number = {arXiv:1806.00525},
abstract = {Scene-aware dialog systems will be able to have conversations with users about the objects and events around them. Progress on such systems can be made by integrating state-of-the-art technologies from multiple research areas including end-to-end dialog systems visual dialog, and video description. We introduce the Audio Visual Scene Aware Dialog (AVSD) challenge and dataset. In this challenge, which is one track of the 7th Dialog System Technology Challenges (DSTC7) workshop1, the task is to build a system that generates responses in a dialog about an input video
},
howpublished = {arXiv:1806.00525},
keywords = {arXiv, embodied agents, multimedia, vision & language},
pubstate = {published},
tppubtype = {techreport}
}
Other Publication Sites
A few more sites that aggregate research publications: Academic.edu, Bibsonomy, CiteULike, Mendeley.
Copyright/About
[Please see the Copyright Statement that may apply to the content listed here.]
This list of publications is produced by using the teachPress plugin for WordPress.