Publications

41.

Harish Haresamudram, Irfan Essa, Thomas Ploetz

Contrastive Predictive Coding for Human Activity Recognition Journal Article

In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 5, no. 2, pp. 1–26, 2021.

Abstract | Links | BibTeX | Tags: activity recognition, IMWUT, machine learning, ubiquitous computing

42.

Anh Truong, Peggy Chi, David Salesin, Irfan Essa, Maneesh Agrawala

Automatic Generation of Two-Level Hierarchical Tutorials from Instructional Makeup Videos Proceedings Article

In: ACM CHI Conference on Human factors in Computing Systems, 2021.

Abstract | Links | BibTeX | Tags: CHI, computational video, google, human-computer interaction, video summarization

@inproceedings{2021-Truong-AGTHTFIMV,

title = {Automatic Generation of Two-Level Hierarchical Tutorials from Instructional Makeup Videos},

author = {Anh Truong and Peggy Chi and David Salesin and Irfan Essa and Maneesh Agrawala},

url = {https://dl.acm.org/doi/10.1145/3411764.3445721

https://research.google/pubs/pub50007/

http://anhtruong.org/makeup_breakdown/},

doi = {10.1145/3411764.3445721},

year  = {2021},

date = {2021-05-01},

urldate = {2021-05-01},

booktitle = {ACM CHI Conference on Human factors in Computing Systems},

abstract = {We present a multi-modal approach for automatically generating hierarchical tutorials from instructional makeup videos. Our approach is inspired by prior research in cognitive psychology, which suggests that people mentally segment procedural tasks into event hierarchies, where coarse-grained events focus on objects while fine-grained events focus on actions. In the instructional makeup domain, we find that objects correspond to facial parts while fine-grained steps correspond to actions on those facial parts. Given an input instructional makeup video, we apply a set of heuristics that combine computer vision techniques with transcript text analysis to automatically identify the fine-level action steps and group these steps by facial part to form the coarse-level events. We provide a voice-enabled, mixed-media UI to visualize the resulting hierarchy and allow users to efficiently navigate the tutorial (e.g., skip ahead, return to previous steps) at their own pace. Users can navigate the hierarchy at both the facial-part and action-step levels using click-based interactions and voice commands. We demonstrate the effectiveness of segmentation algorithms and the resulting mixed-media UI on a variety of input makeup videos. A user study shows that users prefer following instructional makeup videos in our mixed-media format to the standard video UI and that they find our format much easier to navigate.},

keywords = {CHI, computational video, google, human-computer interaction, video summarization},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

43.

Dan Scarafoni, Irfan Essa, Thomas Ploetz

PLAN-B: Predicting Likely Alternative Next Best Sequences for Action Prediction Technical Report

no. arXiv:2103.15987, 2021.

Abstract | Links | BibTeX | Tags: activity recognition, arXiv, computer vision

44.

Vincent Cartillier, Zhile Ren, Neha Jain, Stefan Lee, Irfan Essa, Dhruv Batra

Semantic MapNet: Building Allocentric SemanticMaps and Representations from Egocentric Views Proceedings Article

In: Proceedings of American Association of Artificial Intelligence Conference (AAAI), AAAI, 2021.

Abstract | Links | BibTeX | Tags: AAAI, AI, embodied agents, first-person vision

@inproceedings{2021-Cartillier-SMBASRFEV,

title = {Semantic MapNet: Building Allocentric SemanticMaps and Representations from Egocentric Views},

author = {Vincent Cartillier and Zhile Ren and Neha Jain and Stefan Lee and Irfan Essa and Dhruv Batra},

url = {https://arxiv.org/abs/2010.01191

https://vincentcartillier.github.io/smnet.html

https://ojs.aaai.org/index.php/AAAI/article/view/16180/15987},

doi = {10.48550/arXiv.2010.01191},

year  = {2021},

date = {2021-02-01},

urldate = {2021-02-01},

booktitle = {Proceedings of American Association of Artificial Intelligence Conference (AAAI)},

publisher = {AAAI},

abstract = {We study the task of semantic mapping -- specifically, an embodied agent (a robot or an egocentric AI assistant) is given a tour of a new environment and asked to build an allocentric top-down semantic map (`what is where?') from egocentric observations of an RGB-D camera with known pose (via localization sensors). Importantly, our goal is to build neural episodic memories and spatio-semantic representations of 3D spaces that enable the agent to easily learn subsequent tasks in the same space -- navigating to objects seen during the tour (`Find chair') or answering questions about the space (`How many chairs did you see in the house?'). 

Towards this goal, we present Semantic MapNet (SMNet), which consists of: (1) an Egocentric 

 

Visual Encoder that encodes each egocentric RGB-D frame, (2) a Feature Projector that projects egocentric features to appropriate locations on a floor-plan, (3) a Spatial Memory Tensor of size floor-plan length × width × feature-dims that learns to accumulate projected egocentric features, and (4) a Map Decoder that uses the memory tensor to produce semantic top-down maps. SMNet combines the strengths of (known) projective camera geometry and neural representation learning. On the task of semantic mapping in the Matterport3D dataset, SMNet significantly outperforms competitive baselines by 4.01-16.81% (absolute) on mean-IoU and 3.81-19.69% (absolute) on Boundary-F1 metrics. Moreover, we show how to use the spatio-semantic allocentric representations build by SMNet for the task of ObjectNav and Embodied Question Answering.},

keywords = {AAAI, AI, embodied agents, first-person vision},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

45.

Niranjan Kumar, Irfan Essa, Sehoon Ha, C. Karen Liu

Estimating Mass Distribution of Articulated Objects through Non-prehensile Manipulation Proceedings Article

In: Neural Information Processing Systems (NeurIPS) Workshop on Object Representations for Learning and Reasoning, NeurIPS 2020.

Abstract | Links | BibTeX | Tags: reinforcement learning, robotics

@inproceedings{2020-Kumar-EMDAOTNM,

title = {Estimating Mass Distribution of Articulated Objects through Non-prehensile Manipulation},

author = {Niranjan Kumar and Irfan Essa and Sehoon Ha and C. Karen Liu},

url = {https://orlrworkshop.github.io/program/orlr_25.html

http://arxiv.org/abs/1907.03964

https://www.kniranjankumar.com/projects/1_mass_prediction

https://www.youtube.com/watch?v=o3zBdVWvWZw

https://kniranjankumar.github.io/assets/pdf/Estimating_Mass_Distribution_of_Articulated_Objects_using_Non_prehensile_Manipulation.pdf},

year  = {2020},

date = {2020-12-01},

urldate = {2020-12-01},

booktitle = {Neural Information Processing Systems (NeurIPS) Workshop on Object Representations for Learning and Reasoning},

organization = {NeurIPS},

abstract = {We explore the problem of estimating the mass distribution of an articulated object by an interactive robotic agent. Our method predicts the mass distribution of an object by using limited sensing and actuating capabilities of a robotic agent that is interacting with the object. We are inspired by the role of exploratory play in human infants. We take the combined approach of supervised and reinforcement learning to train an agent that learns to strategically interact with the object to estimate the object's mass distribution. Our method consists of two neural networks: (i) the policy network which decides how to interact with the object, and (ii) the predictor network that estimates the mass distribution given a history of observations and interactions. Using our method, we train a robotic arm to estimate the mass distribution of an object with moving parts (e.g. an articulated rigid body system) by pushing it on a surface with unknown friction properties. We also demonstrate how our training from simulations can be transferred to real hardware using a small amount of real-world data for fine-tuning. We use a UR10 robot to interact with 3D printed articulated chains with varying mass distributions and show that our method significantly outperforms the baseline system that uses random pushes to interact with the object.},

howpublished = {arXiv preprint arXiv:1907.03964},

keywords = {reinforcement learning, robotics},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

46.

Peggy Chi, Zheng Sun, Katrina Panovich, Irfan Essa

Automatic Video Creation From a Web Page Proceedings Article

In: Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, pp. 279–292, ACM CHI 2020.

Abstract | Links | BibTeX | Tags: computational video, google, human-computer interaction, UIST, video editing

47.

Harish Haresamudram, Apoorva Beedu, Varun Agrawal, Patrick L Grady, Irfan Essa, Judy Hoffman, Thomas Plötz

Masked reconstruction based self-supervision for human activity recognition Proceedings Article

In: Proceedings of the International Symposium on Wearable Computers (ISWC), pp. 45–49, 2020.

Abstract | Links | BibTeX | Tags: activity recognition, ISWC, machine learning, wearable computing

48.

Hsin-Ying Lee, Lu Jiang, Irfan Essa, Madison Le, Haifeng Gong, Ming-Hsuan Yang, Weilong Yang

Neural Design Network: Graphic Layout Generation with Constraints Proceedings Article

In: Proceedings of European Conference on Computer Vision (ECCV), 2020.

Links | BibTeX | Tags: computer vision, content creation, ECCV, generative media, google

49.

Caroline Pantofaru, Vinay Bettadapura, Krishna Bharat, Irfan Essa

Systems and methods for directing content generation using a first-person point-of-view device. Patent

2020.

Abstract | Links | BibTeX | Tags: computer vision, google, patents

50.

Peggy Chi, Irfan Essa

Interactive Visual Description of a Web Page for Smart Speakers Proceedings Article

In: Proceedings of ACM CHI Workshop, CUI@CHI: Mapping Grand Challenges for the Conversational User Interface Community, Honolulu, Hawaii, USA, 2020.

Abstract | Links | BibTeX | Tags: accessibility, CHI, google, human-computer interaction

51.

Steven Hickson, Anelia Angelova, Irfan Essa, Rahul Sukthankar

Category learning neural networks Patent

2020.

Abstract | Links | BibTeX | Tags: google, machine learning, patents

52.

Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, Dhruv Batra

Decentralized Distributed PPO: Solving PointGoal Navigation Proceedings Article

In: Proceedings of International Conference on Learning Representations (ICLR), 2020.

Abstract | Links | BibTeX | Tags: embodied agents, ICLR, navigation, systems for ML

@inproceedings{2020-Wijmans-DDSPN,

title = {Decentralized Distributed PPO: Solving PointGoal Navigation},

author = {Erik Wijmans and Abhishek Kadian and Ari Morcos and Stefan Lee and Irfan Essa and Devi Parikh and Manolis Savva and Dhruv Batra},

url = {https://arxiv.org/abs/1911.00357

https://paperswithcode.com/paper/decentralized-distributed-ppo-solving},

year  = {2020},

date = {2020-04-01},

urldate = {2020-04-01},

booktitle = {Proceedings of International Conference on Learning Representations (ICLR)},

abstract = {We present Decentralized Distributed Proximal Policy Optimization (DD-PPO), a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever stale), making it conceptually simple and easy to implement. In our experiments on training virtual robots to navigate in Habitat-Sim, DD-PPO exhibits near-linear scaling -- achieving a speedup of 107x on 128 GPUs over a serial implementation. We leverage this scaling to train an agent for 2.5 Billion steps of experience (the equivalent of 80 years of human experience) -- over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs.

This massive-scale training not only sets the state of art on Habitat Autonomous Navigation Challenge 2019, but essentially solves the task --near-perfect autonomous navigation in an unseen environment without access to a map, directly from an RGB-D camera and a GPS+Compass sensor. Fortuitously, error vs computation exhibits a power-law-like distribution; thus, 90% of peak performance is obtained relatively early (at 100 million steps) and relatively cheaply (under 1 day with 8 GPUs). Finally, we show that the scene understanding and navigation policies learned can be transferred to other navigation tasks -- the analog of ImageNet pre-training + task-specific fine-tuning for embodied AI. Our model outperforms ImageNet pre-trained CNNs on these transfer tasks and can serve as a universal resource (all models and code are publicly available).},

keywords = {embodied agents, ICLR, navigation, systems for ML},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

53.

Erik Wijmans, Julian Straub, Dhruv Batra, Irfan Essa, Judy Hoffman, Ari Morcos

Analyzing Visual Representations in Embodied Navigation Tasks Technical Report

no. arXiv:2003.05993, 2020.

Abstract | Links | BibTeX | Tags: arXiv, embodied agents, navigation

54.

Thad Eugene Starner, Irfan Essa, Hayes Solos Raffle, Daniel Aminzade

Object occlusion to initiate a visual search Patent

2019, (US Patent 10,437,882).

Abstract | Links | BibTeX | Tags: computer vision, google, patents

55.

Steven Hickson, Karthik Raveendran, Alireza Fathi, Kevin Murphy, Irfan Essa

Floors are Flat: Leveraging Semantics for Real-Time Surface Normal Prediction Proceedings Article

In: IEEE International Conference on Computer Vision (ICCV) Workshop on Geometry Meets Deep Learning, 2019.

Abstract | Links | BibTeX | Tags: computer vision, google, ICCV

56.

Zoher Ghogawala, Melissa Dunbar, Irfan Essa

Artificial Intelligence for the Treatment of Lumbar Spondylolisthesis Journal Article

In: Neurosurgery Clinics of North America, vol. 30, no. 3, pp. 383 - 389, 2019, ISSN: 1042-3680, (Lumbar Spondylolisthesis).

Abstract | Links | BibTeX | Tags: AI, computational health, Predictive analytics

57.

Aneeq Zia, Liheng Guo, Linlin Zhou, Irfan Essa, Anthony Jarc

Novel evaluation of surgical activity recognition models using task-based efficiency metrics Journal Article

In: International Journal of Computer Assisted Radiology and Surgery, 2019.

Abstract | Links | BibTeX | Tags: activity assessment, activity recognition, surgical training

@article{2019-Zia-NESARMUTEM,

title = {Novel evaluation of surgical activity recognition models using task-based efficiency metrics},

author = {Aneeq Zia and Liheng Guo and Linlin Zhou and Irfan Essa and Anthony Jarc},

url = {https://www.ncbi.nlm.nih.gov/pubmed/31267333},

doi = {10.1007/s11548-019-02025-w},

year  = {2019},

date = {2019-07-01},

urldate = {2019-07-01},

journal = {International Journal of Computer Assisted Radiology and Surgery},

abstract = {PURPOSE: Surgical task-based metrics (rather than entire 

 procedure metrics) can be used to improve surgeon training and, 

 ultimately, patient care through focused training interventions. 

 Machine learning models to automatically recognize individual 

 tasks or activities are needed to overcome the otherwise manual 

 effort of video review. Traditionally, these models have been 

 evaluated using frame-level accuracy. Here, we propose evaluating 

 surgical activity recognition models by their effect on 

 task-based efficiency metrics. In this way, we can determine when 

 models have achieved adequate performance for providing surgeon 

 feedback via metrics from individual tasks. METHODS: We propose a 

 new CNN-LSTM model, RP-Net-V2, to recognize the 12 steps of 

 robotic-assisted radical prostatectomies (RARP). We evaluated our 

 model both in terms of conventional methods (e.g., Jaccard Index, 

 task boundary accuracy) as well as novel ways, such as the 

 accuracy of efficiency metrics computed from instrument movements 

 and system events. RESULTS: Our proposed model achieves a Jaccard 

 Index of 0.85 thereby outperforming previous models on RARP. 

 Additionally, we show that metrics computed from tasks 

 automatically identified using RP-Net-V2 correlate well with 

 metrics from tasks labeled by clinical experts. CONCLUSION: We 

 demonstrate that metrics-based evaluation of surgical activity 

 recognition models is a viable approach to determine when models 

 can be used to quantify surgical efficiencies. We believe this 

 approach and our results illustrate the potential for fully 

 automated, postoperative efficiency reports.},

keywords = {activity assessment, activity recognition, surgical training},

pubstate = {published},

tppubtype = {article}

}

Close

58.

Zoher Ghogawala, Melissa Dunbar, Irfan Essa

Lumbar spondylolisthesis: modern registries and the development of artificial intelligence Journal Article

In: Journal of Neurosurgery: Spine (JNSPG 75th Anniversary Invited Review Article), vol. 30, no. 6, pp. 729-735, 2019.

Links | BibTeX | Tags: AI, computational health, Predictive analytics

59.

Erik Wijmans, Samyak Datta, Oleksandr Maksymets, Abhishek Das, Georgia Gkioxari, Stefan Lee, Irfan Essa, Devi Parikh, Dhruv Batra

Embodied Question Answering in Photorealistic Environments With Point Cloud Perception Proceedings Article

In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

Links | BibTeX | Tags: computer vision, CVPR, vision & language

60.

Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh

Audio Visual Scene-Aware Dialog Proceedings Article

In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

Abstract | Links | BibTeX | Tags: computational video, computer vision, CVPR, embodied agents, vision & language