A searchable list of some of my publications is below. You can also access my publications from the following sites.
My ORCID is https://orcid.org/0000-0002-6236-2969Publications:
Harish Haresamudram, Irfan Essa, Thomas Ploetz
Contrastive Predictive Coding for Human Activity Recognition Journal Article
In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 5, no. 2, pp. 1–26, 2021.
Abstract | Links | BibTeX | Tags: activity recognition, IMWUT, machine learning, ubiquitous computing
@article{2021-Haresamudram-CPCHAR,
title = {Contrastive Predictive Coding for Human Activity Recognition},
author = {Harish Haresamudram and Irfan Essa and Thomas Ploetz},
url = {https://doi.org/10.1145/3463506
https://arxiv.org/abs/2012.05333},
doi = {10.1145/3463506},
year = {2021},
date = {2021-06-01},
urldate = {2021-06-01},
booktitle = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies},
journal = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies},
volume = {5},
number = {2},
pages = {1--26},
abstract = {Feature extraction is crucial for human activity recognition (HAR) using body-worn movement sensors. Recently, learned representations have been used successfully, offering promising alternatives to manually engineered features. Our work focuses on effective use of small amounts of labeled data and the opportunistic exploitation of unlabeled data that are straightforward to collect in mobile and ubiquitous computing scenarios. We hypothesize and demonstrate that explicitly considering the temporality of sensor data at representation level plays an important role for effective HAR in challenging scenarios. We introduce the Contrastive Predictive Coding (CPC) framework to human activity recognition, which captures the long-term temporal structure of sensor data streams. Through a range of experimental evaluations on real-life recognition tasks, we demonstrate its effectiveness for improved HAR. CPC-based pre-training is self-supervised, and the resulting learned representations can be integrated into standard activity chains. It leads to significantly improved recognition performance when only small amounts of labeled training data are available, thereby demonstrating the practical value of our approach.},
keywords = {activity recognition, IMWUT, machine learning, ubiquitous computing},
pubstate = {published},
tppubtype = {article}
}
Anh Truong, Peggy Chi, David Salesin, Irfan Essa, Maneesh Agrawala
Automatic Generation of Two-Level Hierarchical Tutorials from Instructional Makeup Videos Proceedings Article
In: ACM CHI Conference on Human factors in Computing Systems, 2021.
Abstract | Links | BibTeX | Tags: CHI, computational video, google, human-computer interaction, video summarization
@inproceedings{2021-Truong-AGTHTFIMV,
title = {Automatic Generation of Two-Level Hierarchical Tutorials from Instructional Makeup Videos},
author = {Anh Truong and Peggy Chi and David Salesin and Irfan Essa and Maneesh Agrawala},
url = {https://dl.acm.org/doi/10.1145/3411764.3445721
https://research.google/pubs/pub50007/
http://anhtruong.org/makeup_breakdown/},
doi = {10.1145/3411764.3445721},
year = {2021},
date = {2021-05-01},
urldate = {2021-05-01},
booktitle = {ACM CHI Conference on Human factors in Computing Systems},
abstract = {We present a multi-modal approach for automatically generating hierarchical tutorials from instructional makeup videos. Our approach is inspired by prior research in cognitive psychology, which suggests that people mentally segment procedural tasks into event hierarchies, where coarse-grained events focus on objects while fine-grained events focus on actions. In the instructional makeup domain, we find that objects correspond to facial parts while fine-grained steps correspond to actions on those facial parts. Given an input instructional makeup video, we apply a set of heuristics that combine computer vision techniques with transcript text analysis to automatically identify the fine-level action steps and group these steps by facial part to form the coarse-level events. We provide a voice-enabled, mixed-media UI to visualize the resulting hierarchy and allow users to efficiently navigate the tutorial (e.g., skip ahead, return to previous steps) at their own pace. Users can navigate the hierarchy at both the facial-part and action-step levels using click-based interactions and voice commands. We demonstrate the effectiveness of segmentation algorithms and the resulting mixed-media UI on a variety of input makeup videos. A user study shows that users prefer following instructional makeup videos in our mixed-media format to the standard video UI and that they find our format much easier to navigate.},
keywords = {CHI, computational video, google, human-computer interaction, video summarization},
pubstate = {published},
tppubtype = {inproceedings}
}
Dan Scarafoni, Irfan Essa, Thomas Ploetz
PLAN-B: Predicting Likely Alternative Next Best Sequences for Action Prediction Technical Report
no. arXiv:2103.15987, 2021.
Abstract | Links | BibTeX | Tags: activity recognition, arXiv, computer vision
@techreport{2021-Scarafoni-PPLANBSAP,
title = {PLAN-B: Predicting Likely Alternative Next Best Sequences for Action Prediction},
author = {Dan Scarafoni and Irfan Essa and Thomas Ploetz},
url = {https://arxiv.org/abs/2103.15987},
doi = {10.48550/arXiv.2103.15987},
year = {2021},
date = {2021-03-01},
urldate = {2021-03-01},
journal = {arXiv},
number = {arXiv:2103.15987},
abstract = {Action prediction focuses on anticipating actions before they happen. Recent works leverage probabilistic approaches to describe future uncertainties and sample future actions. However, these methods cannot easily find all alternative predictions, which are essential given the inherent unpredictability of the future, and current evaluation protocols do not measure a system's ability to find such alternatives. We re-examine action prediction in terms of its ability to predict not only the top predictions, but also top alternatives with the accuracy@k metric. In addition, we propose Choice F1: a metric inspired by F1 score which evaluates a prediction system's ability to find all plausible futures while keeping only the most probable ones. To evaluate this problem, we present a novel method, Predicting the Likely Alternative Next Best, or PLAN-B, for action prediction which automatically finds the set of most likely alternative futures. PLAN-B consists of two novel components: (i) a Choice Table which ensures that all possible futures are found, and (ii) a "Collaborative" RNN system which combines both action sequence and feature information. We demonstrate that our system outperforms state-of-the-art results on benchmark datasets.
},
keywords = {activity recognition, arXiv, computer vision},
pubstate = {published},
tppubtype = {techreport}
}
Vincent Cartillier, Zhile Ren, Neha Jain, Stefan Lee, Irfan Essa, Dhruv Batra
Semantic MapNet: Building Allocentric SemanticMaps and Representations from Egocentric Views Proceedings Article
In: Proceedings of American Association of Artificial Intelligence Conference (AAAI), AAAI, 2021.
Abstract | Links | BibTeX | Tags: AAAI, AI, embodied agents, first-person vision
@inproceedings{2021-Cartillier-SMBASRFEV,
title = {Semantic MapNet: Building Allocentric SemanticMaps and Representations from Egocentric Views},
author = {Vincent Cartillier and Zhile Ren and Neha Jain and Stefan Lee and Irfan Essa and Dhruv Batra},
url = {https://arxiv.org/abs/2010.01191
https://vincentcartillier.github.io/smnet.html
https://ojs.aaai.org/index.php/AAAI/article/view/16180/15987},
doi = {10.48550/arXiv.2010.01191},
year = {2021},
date = {2021-02-01},
urldate = {2021-02-01},
booktitle = {Proceedings of American Association of Artificial Intelligence Conference (AAAI)},
publisher = {AAAI},
abstract = {We study the task of semantic mapping -- specifically, an embodied agent (a robot or an egocentric AI assistant) is given a tour of a new environment and asked to build an allocentric top-down semantic map (`what is where?') from egocentric observations of an RGB-D camera with known pose (via localization sensors). Importantly, our goal is to build neural episodic memories and spatio-semantic representations of 3D spaces that enable the agent to easily learn subsequent tasks in the same space -- navigating to objects seen during the tour (`Find chair') or answering questions about the space (`How many chairs did you see in the house?').
Towards this goal, we present Semantic MapNet (SMNet), which consists of: (1) an Egocentric
Visual Encoder that encodes each egocentric RGB-D frame, (2) a Feature Projector that projects egocentric features to appropriate locations on a floor-plan, (3) a Spatial Memory Tensor of size floor-plan length × width × feature-dims that learns to accumulate projected egocentric features, and (4) a Map Decoder that uses the memory tensor to produce semantic top-down maps. SMNet combines the strengths of (known) projective camera geometry and neural representation learning. On the task of semantic mapping in the Matterport3D dataset, SMNet significantly outperforms competitive baselines by 4.01-16.81% (absolute) on mean-IoU and 3.81-19.69% (absolute) on Boundary-F1 metrics. Moreover, we show how to use the spatio-semantic allocentric representations build by SMNet for the task of ObjectNav and Embodied Question Answering.},
keywords = {AAAI, AI, embodied agents, first-person vision},
pubstate = {published},
tppubtype = {inproceedings}
}
Towards this goal, we present Semantic MapNet (SMNet), which consists of: (1) an Egocentric
Visual Encoder that encodes each egocentric RGB-D frame, (2) a Feature Projector that projects egocentric features to appropriate locations on a floor-plan, (3) a Spatial Memory Tensor of size floor-plan length × width × feature-dims that learns to accumulate projected egocentric features, and (4) a Map Decoder that uses the memory tensor to produce semantic top-down maps. SMNet combines the strengths of (known) projective camera geometry and neural representation learning. On the task of semantic mapping in the Matterport3D dataset, SMNet significantly outperforms competitive baselines by 4.01-16.81% (absolute) on mean-IoU and 3.81-19.69% (absolute) on Boundary-F1 metrics. Moreover, we show how to use the spatio-semantic allocentric representations build by SMNet for the task of ObjectNav and Embodied Question Answering.
Niranjan Kumar, Irfan Essa, Sehoon Ha, C. Karen Liu
Estimating Mass Distribution of Articulated Objects through Non-prehensile Manipulation Proceedings Article
In: Neural Information Processing Systems (NeurIPS) Workshop on Object Representations for Learning and Reasoning, NeurIPS 2020.
Abstract | Links | BibTeX | Tags: reinforcement learning, robotics
@inproceedings{2020-Kumar-EMDAOTNM,
title = {Estimating Mass Distribution of Articulated Objects through Non-prehensile Manipulation},
author = {Niranjan Kumar and Irfan Essa and Sehoon Ha and C. Karen Liu},
url = {https://orlrworkshop.github.io/program/orlr_25.html
http://arxiv.org/abs/1907.03964
https://www.kniranjankumar.com/projects/1_mass_prediction
https://www.youtube.com/watch?v=o3zBdVWvWZw
https://kniranjankumar.github.io/assets/pdf/Estimating_Mass_Distribution_of_Articulated_Objects_using_Non_prehensile_Manipulation.pdf},
year = {2020},
date = {2020-12-01},
urldate = {2020-12-01},
booktitle = {Neural Information Processing Systems (NeurIPS) Workshop on Object Representations for Learning and Reasoning},
organization = {NeurIPS},
abstract = {We explore the problem of estimating the mass distribution of an articulated object by an interactive robotic agent. Our method predicts the mass distribution of an object by using limited sensing and actuating capabilities of a robotic agent that is interacting with the object. We are inspired by the role of exploratory play in human infants. We take the combined approach of supervised and reinforcement learning to train an agent that learns to strategically interact with the object to estimate the object's mass distribution. Our method consists of two neural networks: (i) the policy network which decides how to interact with the object, and (ii) the predictor network that estimates the mass distribution given a history of observations and interactions. Using our method, we train a robotic arm to estimate the mass distribution of an object with moving parts (e.g. an articulated rigid body system) by pushing it on a surface with unknown friction properties. We also demonstrate how our training from simulations can be transferred to real hardware using a small amount of real-world data for fine-tuning. We use a UR10 robot to interact with 3D printed articulated chains with varying mass distributions and show that our method significantly outperforms the baseline system that uses random pushes to interact with the object.},
howpublished = {arXiv preprint arXiv:1907.03964},
keywords = {reinforcement learning, robotics},
pubstate = {published},
tppubtype = {inproceedings}
}
Peggy Chi, Zheng Sun, Katrina Panovich, Irfan Essa
Automatic Video Creation From a Web Page Proceedings Article
In: Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, pp. 279–292, ACM CHI 2020.
Abstract | Links | BibTeX | Tags: computational video, google, human-computer interaction, UIST, video editing
@inproceedings{2020-Chi-AVCFP,
title = {Automatic Video Creation From a Web Page},
author = {Peggy Chi and Zheng Sun and Katrina Panovich and Irfan Essa},
url = {https://dl.acm.org/doi/abs/10.1145/3379337.3415814
https://research.google/pubs/pub49618/
https://ai.googleblog.com/2020/10/experimenting-with-automatic-video.html
https://www.youtube.com/watch?v=3yFYc-Wet8k},
doi = {10.1145/3379337.3415814},
year = {2020},
date = {2020-10-01},
urldate = {2020-10-01},
booktitle = {Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology},
pages = {279--292},
organization = {ACM CHI},
abstract = {Creating marketing videos from scratch can be challenging, especially when designing for multiple platforms with different viewing criteria. We present URL2Video, an automatic approach that converts a web page into a short video given temporal and visual constraints. URL2Video captures quality materials and design styles extracted from a web page, including fonts, colors, and layouts. Using constraint programming, URL2Video's design engine organizes the visual assets into a sequence of shots and renders to a video with user-specified aspect ratio and duration. Creators can review the video composition, modify constraints, and generate video variation through a user interface. We learned the design process from designers and compared our automatically generated results with their creation through interviews and an online survey. The evaluation shows that URL2Video effectively extracted design elements from a web page and supported designers by bootstrapping the video creation process.},
keywords = {computational video, google, human-computer interaction, UIST, video editing},
pubstate = {published},
tppubtype = {inproceedings}
}
Harish Haresamudram, Apoorva Beedu, Varun Agrawal, Patrick L Grady, Irfan Essa, Judy Hoffman, Thomas Plötz
Masked reconstruction based self-supervision for human activity recognition Proceedings Article
In: Proceedings of the International Symposium on Wearable Computers (ISWC), pp. 45–49, 2020.
Abstract | Links | BibTeX | Tags: activity recognition, ISWC, machine learning, wearable computing
@inproceedings{2020-Haresamudram-MRBSHAR,
title = {Masked reconstruction based self-supervision for human activity recognition},
author = {Harish Haresamudram and Apoorva Beedu and Varun Agrawal and Patrick L Grady and Irfan Essa and Judy Hoffman and Thomas Plötz},
url = {https://dl.acm.org/doi/10.1145/3410531.3414306
https://harkash.github.io/publication/masked-reconstruction
https://arxiv.org/abs/2202.12938},
doi = {10.1145/3410531.3414306},
year = {2020},
date = {2020-09-01},
urldate = {2020-09-01},
booktitle = {Proceedings of the International Symposium on Wearable Computers (ISWC)},
pages = {45--49},
abstract = {The ubiquitous availability of wearable sensing devices has rendered large scale collection of movement data a straightforward endeavor. Yet, annotation of these data remains a challenge and as such, publicly available datasets for human activity recognition (HAR) are typically limited in size as well as in variability, which constrains HAR model training and effectiveness. We introduce masked reconstruction as a viable self-supervised pre-training objective for human activity recognition and explore its effectiveness in comparison to state-of-the-art unsupervised learning techniques. In scenarios with small labeled datasets, the pre-training results in improvements over end-to-end learning on two of the four benchmark datasets. This is promising because the pre-training objective can be integrated "as is" into state-of-the-art recognition pipelines to effectively facilitate improved model robustness, and thus, ultimately, leading to better recognition performance.
},
keywords = {activity recognition, ISWC, machine learning, wearable computing},
pubstate = {published},
tppubtype = {inproceedings}
}
Hsin-Ying Lee, Lu Jiang, Irfan Essa, Madison Le, Haifeng Gong, Ming-Hsuan Yang, Weilong Yang
Neural Design Network: Graphic Layout Generation with Constraints Proceedings Article
In: Proceedings of European Conference on Computer Vision (ECCV), 2020.
Links | BibTeX | Tags: computer vision, content creation, ECCV, generative media, google
@inproceedings{2020-Lee-NDNGLGWC,
title = {Neural Design Network: Graphic Layout Generation with Constraints},
author = {Hsin-Ying Lee and Lu Jiang and Irfan Essa and Madison Le and Haifeng Gong and Ming-Hsuan Yang and Weilong Yang},
url = {https://arxiv.org/abs/1912.09421
https://rdcu.be/c7sqw},
doi = {10.1007/978-3-030-58580-8_29},
year = {2020},
date = {2020-08-01},
urldate = {2020-08-01},
booktitle = {Proceedings of European Conference on Computer Vision (ECCV)},
keywords = {computer vision, content creation, ECCV, generative media, google},
pubstate = {published},
tppubtype = {inproceedings}
}
Caroline Pantofaru, Vinay Bettadapura, Krishna Bharat, Irfan Essa
Systems and methods for directing content generation using a first-person point-of-view device. Patent
2020.
Abstract | Links | BibTeX | Tags: computer vision, google, patents
@patent{2020-Pantofaru-SMDCGUFPD,
title = {Systems and methods for directing content generation using a first-person point-of-view device.},
author = {Caroline Pantofaru and Vinay Bettadapura and Krishna Bharat and Irfan Essa},
url = {https://patents.google.com/patent/US10721439},
year = {2020},
date = {2020-07-21},
urldate = {2020-07-01},
publisher = {(US Patent # 10721439)},
abstract = {A method for personalizing a content item using captured footage is disclosed. The method includes receiving a first video feed from a first camera, wherein the first camera is designated as a source camera for capturing an event during a first time duration. The method also includes receiving data from a second camera, and determining, based on the received data from the second camera, that an action was performed using the second camera, the action being indicative of a region of interest (ROI) of the user of the second camera occurring within a second time duration. The method further includes designating the second camera as the source camera for capturing the event during the second time duration.
},
howpublished = {US Patent # 10721439},
keywords = {computer vision, google, patents},
pubstate = {published},
tppubtype = {patent}
}
Peggy Chi, Irfan Essa
Interactive Visual Description of a Web Page for Smart Speakers Proceedings Article
In: Proceedings of ACM CHI Workshop, CUI@CHI: Mapping Grand Challenges for the Conversational User Interface Community, Honolulu, Hawaii, USA, 2020.
Abstract | Links | BibTeX | Tags: accessibility, CHI, google, human-computer interaction
@inproceedings{2020-Chi-IVDPSS,
title = {Interactive Visual Description of a Web Page for Smart Speakers},
author = {Peggy Chi and Irfan Essa},
url = {https://research.google/pubs/pub49441/
http://www.speechinteraction.org/CHI2020/programme.html},
year = {2020},
date = {2020-05-01},
urldate = {2020-05-01},
booktitle = {Proceedings of ACM CHI Workshop, CUI@CHI: Mapping Grand Challenges for the Conversational User Interface Community},
address = {Honolulu, Hawaii, USA},
abstract = {Smart speakers are becoming ubiquitous for accessing lightweight information using speech. While these devices are powerful for question answering and service operations using voice commands, it is challenging to navigate content of rich formats–including web pages–that are consumed by mainstream computing devices. We conducted a comparative study with 12 participants that suggests and motivates the use of a narrative voice output of a web page as being easier to follow and comprehend than a conventional screen reader. We are developing a tool that automatically narrates web documents based on their visual structures with interactive prompts. We discuss the design challenges for a conversational agent to intelligently select content for a more personalized experience, where we hope to contribute to the CUI workshop and form a discussion for future research.
},
keywords = {accessibility, CHI, google, human-computer interaction},
pubstate = {published},
tppubtype = {inproceedings}
}
Steven Hickson, Anelia Angelova, Irfan Essa, Rahul Sukthankar
Category learning neural networks Patent
2020.
Abstract | Links | BibTeX | Tags: google, machine learning, patents
@patent{2020-Hickson-CLNN,
title = {Category learning neural networks},
author = {Steven Hickson and Anelia Angelova and Irfan Essa and Rahul Sukthankar},
url = {https://patents.google.com/patent/US10635979},
year = {2020},
date = {2020-04-28},
urldate = {2020-04-28},
publisher = {(US Patent # 10635979)},
abstract = {Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for determining a clustering of images into a plurality of semantic categories. In one aspect, a method comprises: training a categorization neural network, comprising, at each of a plurality of iterations: processing an image depicting an object using the categorization neural network to generate (i) a current prediction for whether the image depicts an object or a background region, and (ii) a current embedding of the image; determining a plurality of current cluster centers based on the current values of the categorization neural network parameters, wherein each cluster center represents a respective semantic category; and determining a gradient of an objective function that includes a classification loss and a clustering loss, wherein the clustering loss depends on a similarity between the current embedding of the image and the current cluster centers.
},
howpublished = {US Patent #10635979},
keywords = {google, machine learning, patents},
pubstate = {published},
tppubtype = {patent}
}
Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, Dhruv Batra
Decentralized Distributed PPO: Solving PointGoal Navigation Proceedings Article
In: Proceedings of International Conference on Learning Representations (ICLR), 2020.
Abstract | Links | BibTeX | Tags: embodied agents, ICLR, navigation, systems for ML
@inproceedings{2020-Wijmans-DDSPN,
title = {Decentralized Distributed PPO: Solving PointGoal Navigation},
author = {Erik Wijmans and Abhishek Kadian and Ari Morcos and Stefan Lee and Irfan Essa and Devi Parikh and Manolis Savva and Dhruv Batra},
url = {https://arxiv.org/abs/1911.00357
https://paperswithcode.com/paper/decentralized-distributed-ppo-solving},
year = {2020},
date = {2020-04-01},
urldate = {2020-04-01},
booktitle = {Proceedings of International Conference on Learning Representations (ICLR)},
abstract = {We present Decentralized Distributed Proximal Policy Optimization (DD-PPO), a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever stale), making it conceptually simple and easy to implement. In our experiments on training virtual robots to navigate in Habitat-Sim, DD-PPO exhibits near-linear scaling -- achieving a speedup of 107x on 128 GPUs over a serial implementation. We leverage this scaling to train an agent for 2.5 Billion steps of experience (the equivalent of 80 years of human experience) -- over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs.
This massive-scale training not only sets the state of art on Habitat Autonomous Navigation Challenge 2019, but essentially solves the task --near-perfect autonomous navigation in an unseen environment without access to a map, directly from an RGB-D camera and a GPS+Compass sensor. Fortuitously, error vs computation exhibits a power-law-like distribution; thus, 90% of peak performance is obtained relatively early (at 100 million steps) and relatively cheaply (under 1 day with 8 GPUs). Finally, we show that the scene understanding and navigation policies learned can be transferred to other navigation tasks -- the analog of ImageNet pre-training + task-specific fine-tuning for embodied AI. Our model outperforms ImageNet pre-trained CNNs on these transfer tasks and can serve as a universal resource (all models and code are publicly available).},
keywords = {embodied agents, ICLR, navigation, systems for ML},
pubstate = {published},
tppubtype = {inproceedings}
}
This massive-scale training not only sets the state of art on Habitat Autonomous Navigation Challenge 2019, but essentially solves the task --near-perfect autonomous navigation in an unseen environment without access to a map, directly from an RGB-D camera and a GPS+Compass sensor. Fortuitously, error vs computation exhibits a power-law-like distribution; thus, 90% of peak performance is obtained relatively early (at 100 million steps) and relatively cheaply (under 1 day with 8 GPUs). Finally, we show that the scene understanding and navigation policies learned can be transferred to other navigation tasks -- the analog of ImageNet pre-training + task-specific fine-tuning for embodied AI. Our model outperforms ImageNet pre-trained CNNs on these transfer tasks and can serve as a universal resource (all models and code are publicly available).
Erik Wijmans, Julian Straub, Dhruv Batra, Irfan Essa, Judy Hoffman, Ari Morcos
Analyzing Visual Representations in Embodied Navigation Tasks Technical Report
no. arXiv:2003.05993, 2020.
Abstract | Links | BibTeX | Tags: arXiv, embodied agents, navigation
@techreport{2020-Wijmans-AVRENT,
title = {Analyzing Visual Representations in Embodied Navigation Tasks},
author = {Erik Wijmans and Julian Straub and Dhruv Batra and Irfan Essa and Judy Hoffman and Ari Morcos},
url = {https://arxiv.org/abs/2003.05993
https://arxiv.org/pdf/2003.05993},
doi = {10.48550/arXiv.2003.05993},
year = {2020},
date = {2020-03-01},
urldate = {2020-03-01},
journal = {arXiv},
number = {arXiv:2003.05993},
abstract = {Recent advances in deep reinforcement learning require a large amount of training data and generally result in representations that are often over specialized to the target task. In this work, we present a methodology to study the underlying potential causes for this specialization. We use the recently proposed projection weighted Canonical Correlation Analysis (PWCCA) to measure the similarity of visual representations learned in the same environment by performing different tasks.
We then leverage our proposed methodology to examine the task dependence of visual representations learned on related but distinct embodied navigation tasks. Surprisingly, we find that slight differences in task have no measurable effect on the visual representation for both SqueezeNet and ResNet architectures. We then empirically demonstrate that visual representations learned on one task can be effectively transferred to a different task.},
howpublished = {arXiv:2003.05993},
keywords = {arXiv, embodied agents, navigation},
pubstate = {published},
tppubtype = {techreport}
}
We then leverage our proposed methodology to examine the task dependence of visual representations learned on related but distinct embodied navigation tasks. Surprisingly, we find that slight differences in task have no measurable effect on the visual representation for both SqueezeNet and ResNet architectures. We then empirically demonstrate that visual representations learned on one task can be effectively transferred to a different task.
Thad Eugene Starner, Irfan Essa, Hayes Solos Raffle, Daniel Aminzade
Object occlusion to initiate a visual search Patent
2019, (US Patent 10,437,882).
Abstract | Links | BibTeX | Tags: computer vision, google, patents
@patent{2019-Starner-OOIVS,
title = {Object occlusion to initiate a visual search},
author = {Thad Eugene Starner and Irfan Essa and Hayes Solos Raffle and Daniel Aminzade},
url = {https://patents.google.com/patent/US10437882},
year = {2019},
date = {2019-10-01},
urldate = {2019-10-01},
publisher = {(US Patent # 10437882)},
abstract = {Methods, systems, and apparatus, including computer programs encoded on computer storage media, for video segmentation. One of the methods includes receiving a digital video; performing hierarchical graph-based video segmentation on at least one frame of the digital video to generate a boundary representation for the at least one frame; generating a vector representation from the boundary representation for the at least one frame of the digital video, wherein generating the vector representation includes generating a polygon composed of at least three vectors, wherein each vector comprises two vertices connected by a line segment, from a boundary in the boundary representation; linking the vector representation to the at least one frame of the digital video; and storing the vector representation with the at least one frame of the digital video.
},
howpublished = {US Patent # 10437882},
note = {US Patent 10,437,882},
keywords = {computer vision, google, patents},
pubstate = {published},
tppubtype = {patent}
}
Steven Hickson, Karthik Raveendran, Alireza Fathi, Kevin Murphy, Irfan Essa
Floors are Flat: Leveraging Semantics for Real-Time Surface Normal Prediction Proceedings Article
In: IEEE International Conference on Computer Vision (ICCV) Workshop on Geometry Meets Deep Learning, 2019.
Abstract | Links | BibTeX | Tags: computer vision, google, ICCV
@inproceedings{2019-Hickson-FFLSRSNP,
title = {Floors are Flat: Leveraging Semantics for Real-Time Surface Normal Prediction},
author = {Steven Hickson and Karthik Raveendran and Alireza Fathi and Kevin Murphy and Irfan Essa},
url = {https://arxiv.org/abs/1906.06792
https://openaccess.thecvf.com/content_ICCVW_2019/papers/GMDL/Hickson_Floors_are_Flat_Leveraging_Semantics_for_Real-Time_Surface_Normal_Prediction_ICCVW_2019_paper.pdf},
doi = {10.1109/ICCVW.2019.00501},
year = {2019},
date = {2019-10-01},
urldate = {2019-10-01},
booktitle = {IEEE International Conference on Computer Vision (ICCV) Workshop on Geometry Meets Deep Learning},
abstract = {We propose 4 insights that help to significantly improve the performance of deep learning models that predict surface normals and semantic labels from a single RGB image. These insights are: (1) denoise the "ground truth" surface normals in the training set to ensure consistency with the semantic labels; (2) concurrently train on a mix of real and synthetic data, instead of pretraining on synthetic and fine-tuning on real; (3) jointly predict normals and semantics using a shared model, but only backpropagate errors on pixels that have valid training labels; (4) slim down the model and use grayscale instead of color inputs. Despite the simplicity of these steps, we demonstrate consistently improved state of the art results on several datasets, using a model that runs at 12 fps on a standard mobile phone.
},
howpublished = {arXiv preprint arXiv:1906.06792},
keywords = {computer vision, google, ICCV},
pubstate = {published},
tppubtype = {inproceedings}
}
Zoher Ghogawala, Melissa Dunbar, Irfan Essa
Artificial Intelligence for the Treatment of Lumbar Spondylolisthesis Journal Article
In: Neurosurgery Clinics of North America, vol. 30, no. 3, pp. 383 - 389, 2019, ISSN: 1042-3680, (Lumbar Spondylolisthesis).
Abstract | Links | BibTeX | Tags: AI, computational health, Predictive analytics
@article{2019-Ghogawala-AITLS,
title = {Artificial Intelligence for the Treatment of Lumbar Spondylolisthesis},
author = {Zoher Ghogawala and Melissa Dunbar and Irfan Essa},
url = {http://www.sciencedirect.com/science/article/pii/S1042368019300257
https://pubmed.ncbi.nlm.nih.gov/31078239/},
doi = {10.1016/j.nec.2019.02.012},
issn = {1042-3680},
year = {2019},
date = {2019-07-01},
urldate = {2019-07-01},
journal = {Neurosurgery Clinics of North America},
volume = {30},
number = {3},
pages = {383 - 389},
abstract = {Multiple registries are currently collecting patient-specific data on lumbar spondylolisthesis including outcomes data. The collection of imaging diagnostics data along with comparative outcomes data following decompression versus decompression and fusion treatments for degenerative spondylolisthesis represents an enormous opportunity for modern machine-learning analytics research.
},
note = {Lumbar Spondylolisthesis},
keywords = {AI, computational health, Predictive analytics},
pubstate = {published},
tppubtype = {article}
}
Aneeq Zia, Liheng Guo, Linlin Zhou, Irfan Essa, Anthony Jarc
Novel evaluation of surgical activity recognition models using task-based efficiency metrics Journal Article
In: International Journal of Computer Assisted Radiology and Surgery, 2019.
Abstract | Links | BibTeX | Tags: activity assessment, activity recognition, surgical training
@article{2019-Zia-NESARMUTEM,
title = {Novel evaluation of surgical activity recognition models using task-based efficiency metrics},
author = {Aneeq Zia and Liheng Guo and Linlin Zhou and Irfan Essa and Anthony Jarc},
url = {https://www.ncbi.nlm.nih.gov/pubmed/31267333},
doi = {10.1007/s11548-019-02025-w},
year = {2019},
date = {2019-07-01},
urldate = {2019-07-01},
journal = {International Journal of Computer Assisted Radiology and Surgery},
abstract = {PURPOSE: Surgical task-based metrics (rather than entire
procedure metrics) can be used to improve surgeon training and,
ultimately, patient care through focused training interventions.
Machine learning models to automatically recognize individual
tasks or activities are needed to overcome the otherwise manual
effort of video review. Traditionally, these models have been
evaluated using frame-level accuracy. Here, we propose evaluating
surgical activity recognition models by their effect on
task-based efficiency metrics. In this way, we can determine when
models have achieved adequate performance for providing surgeon
feedback via metrics from individual tasks. METHODS: We propose a
new CNN-LSTM model, RP-Net-V2, to recognize the 12 steps of
robotic-assisted radical prostatectomies (RARP). We evaluated our
model both in terms of conventional methods (e.g., Jaccard Index,
task boundary accuracy) as well as novel ways, such as the
accuracy of efficiency metrics computed from instrument movements
and system events. RESULTS: Our proposed model achieves a Jaccard
Index of 0.85 thereby outperforming previous models on RARP.
Additionally, we show that metrics computed from tasks
automatically identified using RP-Net-V2 correlate well with
metrics from tasks labeled by clinical experts. CONCLUSION: We
demonstrate that metrics-based evaluation of surgical activity
recognition models is a viable approach to determine when models
can be used to quantify surgical efficiencies. We believe this
approach and our results illustrate the potential for fully
automated, postoperative efficiency reports.},
keywords = {activity assessment, activity recognition, surgical training},
pubstate = {published},
tppubtype = {article}
}
procedure metrics) can be used to improve surgeon training and,
ultimately, patient care through focused training interventions.
Machine learning models to automatically recognize individual
tasks or activities are needed to overcome the otherwise manual
effort of video review. Traditionally, these models have been
evaluated using frame-level accuracy. Here, we propose evaluating
surgical activity recognition models by their effect on
task-based efficiency metrics. In this way, we can determine when
models have achieved adequate performance for providing surgeon
feedback via metrics from individual tasks. METHODS: We propose a
new CNN-LSTM model, RP-Net-V2, to recognize the 12 steps of
robotic-assisted radical prostatectomies (RARP). We evaluated our
model both in terms of conventional methods (e.g., Jaccard Index,
task boundary accuracy) as well as novel ways, such as the
accuracy of efficiency metrics computed from instrument movements
and system events. RESULTS: Our proposed model achieves a Jaccard
Index of 0.85 thereby outperforming previous models on RARP.
Additionally, we show that metrics computed from tasks
automatically identified using RP-Net-V2 correlate well with
metrics from tasks labeled by clinical experts. CONCLUSION: We
demonstrate that metrics-based evaluation of surgical activity
recognition models is a viable approach to determine when models
can be used to quantify surgical efficiencies. We believe this
approach and our results illustrate the potential for fully
automated, postoperative efficiency reports.
Zoher Ghogawala, Melissa Dunbar, Irfan Essa
Lumbar spondylolisthesis: modern registries and the development of artificial intelligence Journal Article
In: Journal of Neurosurgery: Spine (JNSPG 75th Anniversary Invited Review Article), vol. 30, no. 6, pp. 729-735, 2019.
Links | BibTeX | Tags: AI, computational health, Predictive analytics
@article{2019-Ghogawala-LSMRDAI,
title = {Lumbar spondylolisthesis: modern registries and the development of artificial intelligence},
author = {Zoher Ghogawala and Melissa Dunbar and Irfan Essa},
doi = {10.3171/2019.2.SPINE18751},
year = {2019},
date = {2019-06-01},
urldate = {2019-06-01},
journal = {Journal of Neurosurgery: Spine (JNSPG 75th Anniversary Invited Review Article)},
volume = {30},
number = {6},
pages = {729-735},
keywords = {AI, computational health, Predictive analytics},
pubstate = {published},
tppubtype = {article}
}
Erik Wijmans, Samyak Datta, Oleksandr Maksymets, Abhishek Das, Georgia Gkioxari, Stefan Lee, Irfan Essa, Devi Parikh, Dhruv Batra
Embodied Question Answering in Photorealistic Environments With Point Cloud Perception Proceedings Article
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
Links | BibTeX | Tags: computer vision, CVPR, vision & language
@inproceedings{2019-Wijmans-EQAPEWPCP,
title = {Embodied Question Answering in Photorealistic Environments With Point Cloud Perception},
author = {Erik Wijmans and Samyak Datta and Oleksandr Maksymets and Abhishek Das and Georgia Gkioxari and Stefan Lee and Irfan Essa and Devi Parikh and Dhruv Batra},
doi = {10.1109/CVPR.2019.00682},
year = {2019},
date = {2019-06-01},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
keywords = {computer vision, CVPR, vision & language},
pubstate = {published},
tppubtype = {inproceedings}
}
Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh
Audio Visual Scene-Aware Dialog Proceedings Article
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
Abstract | Links | BibTeX | Tags: computational video, computer vision, CVPR, embodied agents, vision & language
@inproceedings{2019-Alamri-AVSD,
title = {Audio Visual Scene-Aware Dialog},
author = {Huda Alamri and Vincent Cartillier and Abhishek Das and Jue Wang and Anoop Cherian and Irfan Essa and Dhruv Batra and Tim K. Marks and Chiori Hori and Peter Anderson and Stefan Lee and Devi Parikh},
url = {https://openaccess.thecvf.com/content_CVPR_2019/papers/Alamri_Audio_Visual_Scene-Aware_Dialog_CVPR_2019_paper.pdf
https://video-dialog.com/
https://arxiv.org/abs/1901.09107},
doi = {10.1109/CVPR.2019.00774},
year = {2019},
date = {2019-06-01},
urldate = {2019-06-01},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
abstract = {We introduce the task of scene-aware dialog. Our goal is to generate a complete and natural response to a question about a scene, given video and audio of the scene and the history of previous turns in the dialog. To answer successfully, agents must ground concepts from the question in the video while leveraging contextual cues from the dialog history. To benchmark this task, we introduce the Audio Visual Scene-Aware Dialog (AVSD) Dataset. For each of more than 11,000 videos of human actions from the Charades dataset, our dataset contains a dialog about the video, plus a final summary of the video by one of the dialog participants. We train several baseline systems for this task and evaluate the performance of the trained models using both qualitative and quantitative metrics. Our results indicate that models must utilize all the available inputs (video, audio, question, and dialog history) to perform best on this dataset.
},
keywords = {computational video, computer vision, CVPR, embodied agents, vision & language},
pubstate = {published},
tppubtype = {inproceedings}
}
Other Publication Sites
A few more sites that aggregate research publications: Academic.edu, Bibsonomy, CiteULike, Mendeley.
Copyright/About
[Please see the Copyright Statement that may apply to the content listed here.]
This list of publications is produced by using the teachPress plugin for WordPress.