A searchable list of some of my publications is below. You can also access my publications from the following sites.
My ORCID is
Publications:
Xiang Kong, Lu Jiang, Huiwen Chang, Han Zhang, Yuan Hao, Haifeng Gong, Irfan Essa
BLT: Bidirectional Layout Transformer for Controllable Layout Generation Proceedings Article
In: European Conference on Computer Vision (ECCV), 2022, ISBN: 978-3-031-19789-5.
Abstract | Links | BibTeX | Tags: computer vision, ECCV, generative AI, generative media, google, vision transformer
@inproceedings{2022-Kong-BLTCLG,
title = {BLT: Bidirectional Layout Transformer for Controllable Layout Generation},
author = {Xiang Kong and Lu Jiang and Huiwen Chang and Han Zhang and Yuan Hao and Haifeng Gong and Irfan Essa},
url = {https://arxiv.org/abs/2112.05112
https://rdcu.be/c61AE},
doi = {10.1007/978-3-031-19790-1_29},
isbn = {978-3-031-19789-5},
year = {2022},
date = {2022-10-25},
urldate = {2022-10-25},
booktitle = {European Conference on Computer Vision (ECCV)},
volume = {13677},
abstract = {Creating visual layouts is a critical step in graphic design. Automatic generation of such layouts is essential for scalable and diverse visual designs. To advance conditional layout generation, we introduce BLT, a bidirectional layout transformer. BLT differs from previous work on transformers in adopting non-autoregressive transformers. In training, BLT learns to predict the masked attributes by attending to surrounding attributes in two directions. During inference, BLT first generates a draft layout from the input and then iteratively refines it into a high-quality layout by masking out low-confident attributes. The masks generated in both training and inference are controlled by a new hierarchical sampling policy. We verify the proposed model on six benchmarks of diverse design tasks. Experimental results demonstrate two benefits compared to the state-of-the-art layout transformer models. First, our model empowers layout transformers to fulfill controllable layout generation. Second, it achieves up to 10x speedup in generating a layout at inference time than the layout transformer baseline. Code is released at https://shawnkx.github.io/blt.},
keywords = {computer vision, ECCV, generative AI, generative media, google, vision transformer},
pubstate = {published},
tppubtype = {inproceedings}
}
Peggy Chi, Tao Dong, Christian Frueh, Brian Colonna, Vivek Kwatra, Irfan Essa
Synthesis-Assisted Video Prototyping From a Document Proceedings Article
In: Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pp. 1–10, 2022.
Abstract | Links | BibTeX | Tags: computational video, generative media, google, human-computer interaction, UIST, video editing
@inproceedings{2022-Chi-SVPFD,
title = {Synthesis-Assisted Video Prototyping From a Document},
author = {Peggy Chi and Tao Dong and Christian Frueh and Brian Colonna and Vivek Kwatra and Irfan Essa},
url = {https://research.google/pubs/pub51631/
https://dl.acm.org/doi/abs/10.1145/3526113.3545676},
doi = {10.1145/3526113.3545676},
year = {2022},
date = {2022-10-01},
urldate = {2022-10-01},
booktitle = {Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology},
pages = {1--10},
abstract = {Video productions commonly start with a script, especially for talking head videos that feature a speaker narrating to the camera. When the source materials come from a written document -- such as a web tutorial, it takes iterations to refine content from a text article to a spoken dialogue, while considering visual compositions in each scene. We propose Doc2Video, a video prototyping approach that converts a document to interactive scripting with a preview of synthetic talking head videos. Our pipeline decomposes a source document into a series of scenes, each automatically creating a synthesized video of a virtual instructor. Designed for a specific domain -- programming cookbooks, we apply visual elements from the source document, such as a keyword, a code snippet or a screenshot, in suitable layouts. Users edit narration sentences, break or combine sections, and modify visuals to prototype a video in our Editing UI. We evaluated our pipeline with public programming cookbooks. Feedback from professional creators shows that our method provided a reasonable starting point to engage them in interactive scripting for a narrated instructional video.},
keywords = {computational video, generative media, google, human-computer interaction, UIST, video editing},
pubstate = {published},
tppubtype = {inproceedings}
}
Harish Haresamudram, Irfan Essa, Thomas Ploetz
Assessing the State of Self-Supervised Human Activity Recognition using Wearables Journal Article
In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), vol. 6, iss. 3, no. 116, pp. 1–47, 2022.
Abstract | Links | BibTeX | Tags: activity recognition, IMWUT, ubiquitous computing, wearable computing
@article{2022-Haresamudram-ASSHARUW,
title = {Assessing the State of Self-Supervised Human Activity Recognition using Wearables},
author = {Harish Haresamudram and Irfan Essa and Thomas Ploetz},
url = {https://dl.acm.org/doi/10.1145/3550299
https://arxiv.org/abs/2202.12938
https://arxiv.org/pdf/2202.12938
},
doi = {doi.org/10.1145/3550299},
year = {2022},
date = {2022-09-07},
urldate = {2022-09-07},
booktitle = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)},
journal = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)},
volume = {6},
number = {116},
issue = {3},
pages = {1–47},
publisher = {ACM},
abstract = {The emergence of self-supervised learning in the field of wearables-based human activity recognition (HAR) has opened up opportunities to tackle the most pressing challenges in the field, namely to exploit unlabeled data to derive reliable recognition systems for scenarios where only small amounts of labeled training samples can be collected. As such, self-supervision, i.e., the paradigm of 'pretrain-then-finetune' has the potential to become a strong alternative to the predominant end-to-end training approaches, let alone hand-crafted features for the classic activity recognition chain. Recently a number of contributions have been made that introduced self-supervised learning into the field of HAR, including, Multi-task self-supervision, Masked Reconstruction, CPC, and SimCLR, to name but a few. With the initial success of these methods, the time has come for a systematic inventory and analysis of the potential self-supervised learning has for the field. This paper provides exactly that. We assess the progress of self-supervised HAR research by introducing a framework that performs a multi-faceted exploration of model performance. We organize the framework into three dimensions, each containing three constituent criteria, such that each dimension captures specific aspects of performance, including the robustness to differing source and target conditions, the influence of dataset characteristics, and the feature space characteristics. We utilize this framework to assess seven state-of-the-art self-supervised methods for HAR, leading to the formulation of insights into the properties of these techniques and to establish their value towards learning representations for diverse scenarios.
},
keywords = {activity recognition, IMWUT, ubiquitous computing, wearable computing},
pubstate = {published},
tppubtype = {article}
}
Daniel Nkemelu, Harshil Shah, Irfan Essa, Michael L. Best
Tackling Hate Speech in Low-resource Languages with Context Experts Proceedings Article
In: International Conference on Information & Communication Technologies and Development (ICTD), 2022.
Abstract | Links | BibTeX | Tags: computational journalism, ICTD, social computing
@inproceedings{2022-Nkemelu-THSLLWCE,
title = {Tackling Hate Speech in Low-resource Languages with Context Experts},
author = {Daniel Nkemelu and Harshil Shah and Irfan Essa and Michael L. Best},
url = {https://www.nkemelu.com/data/ictd2022_nkemelu_final.pdf
},
year = {2022},
date = {2022-06-01},
urldate = {2022-06-01},
booktitle = {International Conference on Information & Communication Technologies and Development (ICTD)},
abstract = {Given Myanmar's historical and socio-political context, hate speech spread on social media have escalated into offline unrest and violence. This paper presents findings from our remote study on the automatic detection of hate speech online in Myanmar. We argue that effectively addressing this problem will require community-based approaches that combine the knowledge of context experts with machine learning tools that can analyze the vast amount of data produced. To this end, we develop a systematic process to facilitate this collaboration covering key aspects of data collection, annotation, and model validation strategies. We highlight challenges in this area stemming from small and imbalanced datasets, the need to balance non-glamorous data work and stakeholder priorities, and closed data sharing practices. Stemming from these findings, we discuss avenues for further work in developing and deploying hate speech detection systems for low-resource languages.},
keywords = {computational journalism, ICTD, social computing},
pubstate = {published},
tppubtype = {inproceedings}
}
Niranjan Kumar, Irfan Essa, Sehoon Ha
Graph-based Cluttered Scene Generation and Interactive Exploration using Deep Reinforcement Learning Proceedings Article
In: Proceedings International Conference on Robotics and Automation (ICRA), pp. 7521-7527, 2022.
Abstract | Links | BibTeX | Tags: ICRA, machine learning, reinforcement learning, robotics
@inproceedings{2021-Kumar-GCSGIEUDRL,
title = {Graph-based Cluttered Scene Generation and Interactive Exploration using Deep Reinforcement Learning},
author = {Niranjan Kumar and Irfan Essa and Sehoon Ha},
url = {https://doi.org/10.1109/ICRA46639.2022.9811874
https://arxiv.org/abs/2109.10460
https://arxiv.org/pdf/2109.10460
https://www.kniranjankumar.com/projects/5_clutr
https://kniranjankumar.github.io/assets/pdf/graph_based_clutter.pdf
https://youtu.be/T2Jo7wwaXss},
doi = {10.1109/ICRA46639.2022.9811874},
year = {2022},
date = {2022-05-01},
urldate = {2022-05-01},
booktitle = {Proceedings International Conference on Robotics and Automation (ICRA)},
journal = {arXiv},
number = {2109.10460},
pages = {7521-7527},
abstract = {We introduce a novel method to teach a robotic agent to interactively explore cluttered yet structured scenes, such as kitchen pantries and grocery shelves, by leveraging the physical plausibility of the scene. We propose a novel learning framework to train an effective scene exploration policy to discover hidden objects with minimal interactions. First, we define a novel scene grammar to represent structured clutter. Then we train a Graph Neural Network (GNN) based Scene Generation agent using deep reinforcement learning (deep RL), to manipulate this Scene Grammar to create a diverse set of stable scenes, each containing multiple hidden objects. Given such cluttered scenes, we then train a Scene Exploration agent, using deep RL, to uncover hidden objects by interactively rearranging the scene.
},
keywords = {ICRA, machine learning, reinforcement learning, robotics},
pubstate = {published},
tppubtype = {inproceedings}
}
Karan Samel, Zelin Zhao, Binghong Chen, Shuang Li, Dharmashankar Subramanian, Irfan Essa, Le Song
Learning Temporal Rules from Noisy Timeseries Data Journal Article
In: arXiv preprint arXiv:2202.05403, 2022.
Abstract | Links | BibTeX | Tags: activity recognition, machine learning
@article{2022-Samel-LTRFNTD,
title = {Learning Temporal Rules from Noisy Timeseries Data},
author = {Karan Samel and Zelin Zhao and Binghong Chen and Shuang Li and Dharmashankar Subramanian and Irfan Essa and Le Song},
url = {https://arxiv.org/abs/2202.05403
https://arxiv.org/pdf/2202.05403},
year = {2022},
date = {2022-02-01},
urldate = {2022-02-01},
journal = {arXiv preprint arXiv:2202.05403},
abstract = {Events across a timeline are a common data representation, seen in different temporal modalities. Individual atomic events can occur in a certain temporal ordering to compose higher level composite events. Examples of a composite event are a patient's medical symptom or a baseball player hitting a home run, caused distinct temporal orderings of patient vitals and player movements respectively. Such salient composite events are provided as labels in temporal datasets and most works optimize models to predict these composite event labels directly. We focus on uncovering the underlying atomic events and their relations that lead to the composite events within a noisy temporal data setting. We propose Neural Temporal Logic Programming (Neural TLP) which first learns implicit temporal relations between atomic events and then lifts logic rules for composite events, given only the composite events labels for supervision. This is done through efficiently searching through the combinatorial space of all temporal logic rules in an end-to-end differentiable manner. We evaluate our method on video and healthcare datasets where it outperforms the baseline methods for rule discovery.
},
keywords = {activity recognition, machine learning},
pubstate = {published},
tppubtype = {article}
}
Chengzhi Mao, Lu Jiang, Mostafa Dehghani, Carl Vondrick, Rahul Sukthankar, Irfan Essa
Discrete Representations Strengthen Vision Transformer Robustness Proceedings Article
In: Proceedings of International Conference on Learning Representations (ICLR), 2022.
Abstract | Links | BibTeX | Tags: computer vision, google, machine learning, vision transformer
@inproceedings{2022-Mao-DRSVTR,
title = {Discrete Representations Strengthen Vision Transformer Robustness},
author = {Chengzhi Mao and Lu Jiang and Mostafa Dehghani and Carl Vondrick and Rahul Sukthankar and Irfan Essa},
url = {https://iclr.cc/virtual/2022/poster/6647
https://arxiv.org/abs/2111.10493
https://research.google/pubs/pub51388/
https://openreview.net/forum?id=8hWs60AZcWk},
doi = {10.48550/arXiv.2111.10493},
year = {2022},
date = {2022-01-28},
urldate = {2022-04-01},
booktitle = {Proceedings of International Conference on Learning Representations (ICLR)},
journal = {arXiv preprint arXiv:2111.10493},
abstract = {Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition. While recent studies suggest that ViTs are more robust than their convolutional counterparts, our experiments find that ViTs trained on ImageNet are overly reliant on local textures and fail to make adequate use of shape information. ViTs thus have difficulties generalizing to out-of-distribution, real-world data. To address this deficiency, we present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder. Different from the standard continuous pixel tokens, discrete tokens are invariant under small perturbations and contain less information individually, which promote ViTs to learn global information that is invariant. Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks while maintaining the performance on ImageNet.},
keywords = {computer vision, google, machine learning, vision transformer},
pubstate = {published},
tppubtype = {inproceedings}
}
Steven Hickson, Karthik Raveendran, Irfan Essa
Sharing Decoders: Network Fission for Multi-Task Pixel Prediction Proceedings Article
In: IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3771–3780, 2022.
Abstract | Links | BibTeX | Tags: computer vision, google, machine learning
@inproceedings{2022-Hickson-SDNFMPP,
title = {Sharing Decoders: Network Fission for Multi-Task Pixel Prediction},
author = {Steven Hickson and Karthik Raveendran and Irfan Essa},
url = {https://openaccess.thecvf.com/content/WACV2022/papers/Hickson_Sharing_Decoders_Network_Fission_for_Multi-Task_Pixel_Prediction_WACV_2022_paper.pdf
https://openaccess.thecvf.com/content/WACV2022/supplemental/Hickson_Sharing_Decoders_Network_WACV_2022_supplemental.pdf
https://youtu.be/qqYODA4C6AU},
doi = {10.1109/WACV51458.2022.00371},
year = {2022},
date = {2022-01-01},
urldate = {2022-01-01},
booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision},
pages = {3771--3780},
abstract = {We examine the benefits of splitting encoder-decoders for multitask learning and showcase results on three tasks (semantics, surface normals, and depth) while adding very few FLOPS per task. Current hard parameter sharing methods for multi-task pixel-wise labeling use one shared encoder with separate decoders for each task. We generalize this notion and term the splitting of encoder-decoder architectures at different points as fission. Our ablation studies on fission show that sharing most of the decoder layers in multi-task encoder-decoder networks results in improvement while adding far fewer parameters per task. Our proposed method trains faster, uses less memory, results in better accuracy, and uses significantly fewer floating point operations (FLOPS) than conventional multi-task methods, with additional tasks only requiring 0.017% more FLOPS than the single-task network.},
keywords = {computer vision, google, machine learning},
pubstate = {published},
tppubtype = {inproceedings}
}
Niranjan Kumar, Irfan Essa, Sehoon Ha
Cascaded Compositional Residual Learning for Complex Interactive Behaviors Proceedings Article
In: Sim-to-Real Robot Learning: Locomotion and Beyond Workshop at the Conference on Robot Learning (CoRL), arXiv, 2022.
Abstract | Links | BibTeX | Tags: reinforcement learning, robotics
@inproceedings{2022-Kumar-CCRLCIB,
title = {Cascaded Compositional Residual Learning for Complex Interactive Behaviors},
author = {Niranjan Kumar and Irfan Essa and Sehoon Ha},
url = {https://arxiv.org/abs/2212.08954
https://www.kniranjankumar.com/ccrl/static/pdf/paper.pdf
https://youtu.be/fAklIxiK7Qg
},
doi = {10.48550/ARXIV.2212.08954},
year = {2022},
date = {2022-01-01},
urldate = {2022-01-01},
booktitle = {Sim-to-Real Robot Learning: Locomotion and Beyond Workshop at the Conference on Robot Learning (CoRL)},
publisher = {arXiv},
abstract = {Real-world autonomous missions often require rich interaction with nearby objects, such as doors or switches, along with effective navigation. However, such complex behaviors are difficult to learn because they involve both high-level planning and low-level motor control. We present a novel framework, Cascaded Compositional Residual Learning (CCRL), which learns composite skills by recursively leveraging a library of previously learned control policies. Our framework learns multiplicative policy composition, task-specific residual actions, and synthetic goal information simultaneously while freezing the prerequisite policies. We further explicitly control the style of the motion by regularizing residual actions. We show that our framework learns joint-level control policies for a diverse set of motor skills ranging from basic locomotion to complex interactive navigation, including navigating around obstacles, pushing objects, crawling under a table, pushing a door open with its leg, and holding it open while walking through it. The proposed CCRL framework leads to policies with consistent styles and lower joint torques, which we successfully transfer to a real Unitree A1 robot without any additional fine-tuning.},
keywords = {reinforcement learning, robotics},
pubstate = {published},
tppubtype = {inproceedings}
}
Apoorva Beedu, Zhile Ren, Varun Agrawal, Irfan Essa
VideoPose: Estimating 6D object pose from videos Technical Report
2021.
Abstract | Links | BibTeX | Tags: arXiv, computer vision, object detection, pose estimation
@techreport{2021-Beedu-VEOPFV,
title = {VideoPose: Estimating 6D object pose from videos},
author = {Apoorva Beedu and Zhile Ren and Varun Agrawal and Irfan Essa},
url = {https://arxiv.org/abs/2111.10677},
doi = {10.48550/arXiv.2111.10677},
year = {2021},
date = {2021-11-01},
urldate = {2021-11-01},
journal = {arXiv preprint arXiv:2111.10677},
abstract = {We introduce a simple yet effective algorithm that uses convolutional neural networks to directly estimate object poses from videos. Our approach leverages the temporal information from a video sequence, and is computationally efficient and robust to support robotic and AR domains. Our proposed network takes a pre-trained 2D object detector as input, and aggregates visual features through a recurrent neural network to make predictions at each frame. Experimental evaluation on the YCB-Video dataset show that our approach is on par with the state-of-the-art algorithms. Further, with a speed of 30 fps, it is also more efficient than the state-of-the-art, and therefore applicable to a variety of applications that require real-time object pose estimation.},
keywords = {arXiv, computer vision, object detection, pose estimation},
pubstate = {published},
tppubtype = {techreport}
}
Tianhao Zhang, Hung-Yu Tseng, Lu Jiang, Weilong Yang, Honglak Lee, Irfan Essa
Text as Neural Operator: Image Manipulation by Text Instruction Proceedings Article
In: ACM International Conference on Multimedia (ACM-MM), ACM Press, 2021.
Abstract | Links | BibTeX | Tags: computer vision, generative media, google, multimedia
@inproceedings{2021-Zhang-TNOIMTI,
title = {Text as Neural Operator: Image Manipulation by Text Instruction},
author = {Tianhao Zhang and Hung-Yu Tseng and Lu Jiang and Weilong Yang and Honglak Lee and Irfan Essa},
url = {https://dl.acm.org/doi/10.1145/3474085.3475343
https://arxiv.org/abs/2008.04556},
doi = {10.1145/3474085.3475343},
year = {2021},
date = {2021-10-01},
urldate = {2021-10-01},
booktitle = {ACM International Conference on Multimedia (ACM-MM)},
publisher = {ACM Press},
abstract = {In recent years, text-guided image manipulation has gained increasing attention in the multimedia and computer vision community. The input to conditional image generation has evolved from image-only to multimodality. In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects. The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image. We propose a GAN-based method to tackle this problem. The key idea is to treat text as neural operators to locally modify the image feature. We show that the proposed model performs favorably against recent strong baselines on three public datasets. Specifically, it generates images of greater fidelity and semantic relevance, and when used as a image query, leads to better retrieval performance.},
keywords = {computer vision, generative media, google, multimedia},
pubstate = {published},
tppubtype = {inproceedings}
}
Peggy Chi, Nathan Frey, Katrina Panovich, Irfan Essa
Automatic Instructional Video Creation from a Markdown-Formatted Tutorial Proceedings Article
In: ACM Symposium on User Interface Software and Technology (UIST), ACM Press, 2021.
Abstract | Links | BibTeX | Tags: google, human-computer interaction, UIST, video editting
@inproceedings{2021-Chi-AIVCFMT,
title = {Automatic Instructional Video Creation from a Markdown-Formatted Tutorial},
author = {Peggy Chi and Nathan Frey and Katrina Panovich and Irfan Essa},
url = {https://doi.org/10.1145/3472749.3474778
https://research.google/pubs/pub50745/
https://youtu.be/WmrZ7PUjyuM},
doi = {10.1145/3472749.3474778},
year = {2021},
date = {2021-10-01},
urldate = {2021-10-01},
booktitle = {ACM Symposium on User Interface Software and Technology (UIST)},
publisher = {ACM Press},
abstract = {We introduce HowToCut, an automatic approach that converts a Markdown-formatted tutorial into an interactive video that presents the visual instructions with a synthesized voiceover for narration. HowToCut extracts instructional content from a multimedia document that describes a step-by-step procedure. Our method selects and converts text instructions to a voiceover. It makes automatic editing decisions to align the narration with edited visual assets, including step images, videos, and text overlays. We derive our video editing strategies from an analysis of 125 web tutorials and apply Computer Vision techniques to the assets. To enable viewers to interactively navigate the tutorial, HowToCut's conversational UI presents instructions in multiple formats upon user commands. We evaluated our automatically-generated video tutorials through user studies (N=20) and validated the video quality via an online survey (N=93). The evaluation shows that our method was able to effectively create informative and useful instructional videos from a web tutorial document for both reviewing and following.},
keywords = {google, human-computer interaction, UIST, video editting},
pubstate = {published},
tppubtype = {inproceedings}
}
Karan Samel, Zelin Zhao, Binghong Chen, Shuang Li, Dharmashankar Subramanian, Irfan Essa, Le Song
Neural Temporal Logic Programming Technical Report
2021.
Abstract | Links | BibTeX | Tags: activity recognition, arXiv, machine learning, openreview
@techreport{2021-Samel-NTLP,
title = {Neural Temporal Logic Programming},
author = {Karan Samel and Zelin Zhao and Binghong Chen and Shuang Li and Dharmashankar Subramanian and Irfan Essa and Le Song},
url = {https://openreview.net/forum?id=i7h4M45tU8},
year = {2021},
date = {2021-09-01},
urldate = {2021-09-01},
abstract = {Events across a timeline are a common data representation, seen in different temporal modalities. Individual atomic events can occur in a certain temporal ordering to compose higher-level composite events. Examples of a composite event are a patient's medical symptom or a baseball player hitting a home run, caused distinct temporal orderings of patient vitals and player movements respectively. Such salient composite events are provided as labels in temporal datasets and most works optimize models to predict these composite event labels directly. We focus uncovering the underlying atomic events and their relations that lead to the composite events within a noisy temporal data setting. We propose Neural Temporal Logic Programming (Neural TLP) which first learns implicit temporal relations between atomic events and then lifts logic rules for composite events, given only the composite events labels for supervision. This is done through efficiently searching through the combinatorial space of all temporal logic rules in an end-to-end differentiable manner. We evaluate our method on video and on healthcare data where it outperforms the baseline methods for rule discovery. },
howpublished = {https://openreview.net/forum?id=i7h4M45tU8},
keywords = {activity recognition, arXiv, machine learning, openreview},
pubstate = {published},
tppubtype = {techreport}
}
Nathan Frey, Peggy Chi, Weilong Yang, Irfan Essa
Automatic Style Transfer for Non-Linear Video Editing Proceedings Article
In: Proceedings of CVPR Workshop on AI for Content Creation (AICC), 2021.
Links | BibTeX | Tags: computational video, CVPR, google, video editing
@inproceedings{2021-Frey-ASTNVE,
title = {Automatic Style Transfer for Non-Linear Video Editing},
author = {Nathan Frey and Peggy Chi and Weilong Yang and Irfan Essa},
url = {https://arxiv.org/abs/2105.06988
https://research.google/pubs/pub50449/},
doi = {10.48550/arXiv.2105.06988},
year = {2021},
date = {2021-06-01},
urldate = {2021-06-01},
booktitle = {Proceedings of CVPR Workshop on AI for Content Creation (AICC)},
keywords = {computational video, CVPR, google, video editing},
pubstate = {published},
tppubtype = {inproceedings}
}
AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa
Unsupervised Discovery of Actions in Instructional Videos Proceedings Article
In: British Machine Vision Conference (BMVC), 2021.
Abstract | Links | BibTeX | Tags: activity recognition, computational video, computer vision, google
@inproceedings{2021-Piergiovanni-UDAIV,
title = {Unsupervised Discovery of Actions in Instructional Videos},
author = {AJ Piergiovanni and Anelia Angelova and Michael S. Ryoo and Irfan Essa},
url = {https://arxiv.org/abs/2106.14733
https://www.bmvc2021-virtualconference.com/assets/papers/0773.pdf},
doi = { https://doi.org/10.48550/arXiv.2106.14733},
year = {2021},
date = {2021-06-01},
urldate = {2021-06-01},
booktitle = {British Machine Vision Conference (BMVC)},
number = {arXiv:2106.14733},
abstract = {In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos. Instructional videos contain complex activities and are a rich source of information for intelligent agents, such as, autonomous robots or virtual assistants, which can, for example, automatically `read' the steps from an instructional video and execute them. However, videos are rarely annotated with atomic activities, their boundaries or duration. We present an unsupervised approach to learn atomic actions of structured human tasks from a variety of instructional videos. We propose a sequential stochastic autoregressive model for temporal segmentation of videos, which learns to represent and discover the sequential relationship between different atomic actions of the task, and which provides automatic and unsupervised self-labeling for videos. Our approach outperforms the state-of-the-art unsupervised methods with large margins. We will open source the code.
},
keywords = {activity recognition, computational video, computer vision, google},
pubstate = {published},
tppubtype = {inproceedings}
}
Harish Haresamudram, Irfan Essa, Thomas Ploetz
Contrastive Predictive Coding for Human Activity Recognition Journal Article
In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 5, no. 2, pp. 1–26, 2021.
Abstract | Links | BibTeX | Tags: activity recognition, IMWUT, machine learning, ubiquitous computing
@article{2021-Haresamudram-CPCHAR,
title = {Contrastive Predictive Coding for Human Activity Recognition},
author = {Harish Haresamudram and Irfan Essa and Thomas Ploetz},
url = {https://doi.org/10.1145/3463506
https://arxiv.org/abs/2012.05333},
doi = {10.1145/3463506},
year = {2021},
date = {2021-06-01},
urldate = {2021-06-01},
booktitle = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies},
journal = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies},
volume = {5},
number = {2},
pages = {1--26},
abstract = {Feature extraction is crucial for human activity recognition (HAR) using body-worn movement sensors. Recently, learned representations have been used successfully, offering promising alternatives to manually engineered features. Our work focuses on effective use of small amounts of labeled data and the opportunistic exploitation of unlabeled data that are straightforward to collect in mobile and ubiquitous computing scenarios. We hypothesize and demonstrate that explicitly considering the temporality of sensor data at representation level plays an important role for effective HAR in challenging scenarios. We introduce the Contrastive Predictive Coding (CPC) framework to human activity recognition, which captures the long-term temporal structure of sensor data streams. Through a range of experimental evaluations on real-life recognition tasks, we demonstrate its effectiveness for improved HAR. CPC-based pre-training is self-supervised, and the resulting learned representations can be integrated into standard activity chains. It leads to significantly improved recognition performance when only small amounts of labeled training data are available, thereby demonstrating the practical value of our approach.},
keywords = {activity recognition, IMWUT, machine learning, ubiquitous computing},
pubstate = {published},
tppubtype = {article}
}
Anh Truong, Peggy Chi, David Salesin, Irfan Essa, Maneesh Agrawala
Automatic Generation of Two-Level Hierarchical Tutorials from Instructional Makeup Videos Proceedings Article
In: ACM CHI Conference on Human factors in Computing Systems, 2021.
Abstract | Links | BibTeX | Tags: CHI, computational video, google, human-computer interaction, video summarization
@inproceedings{2021-Truong-AGTHTFIMV,
title = {Automatic Generation of Two-Level Hierarchical Tutorials from Instructional Makeup Videos},
author = {Anh Truong and Peggy Chi and David Salesin and Irfan Essa and Maneesh Agrawala},
url = {https://dl.acm.org/doi/10.1145/3411764.3445721
https://research.google/pubs/pub50007/
http://anhtruong.org/makeup_breakdown/},
doi = {10.1145/3411764.3445721},
year = {2021},
date = {2021-05-01},
urldate = {2021-05-01},
booktitle = {ACM CHI Conference on Human factors in Computing Systems},
abstract = {We present a multi-modal approach for automatically generating hierarchical tutorials from instructional makeup videos. Our approach is inspired by prior research in cognitive psychology, which suggests that people mentally segment procedural tasks into event hierarchies, where coarse-grained events focus on objects while fine-grained events focus on actions. In the instructional makeup domain, we find that objects correspond to facial parts while fine-grained steps correspond to actions on those facial parts. Given an input instructional makeup video, we apply a set of heuristics that combine computer vision techniques with transcript text analysis to automatically identify the fine-level action steps and group these steps by facial part to form the coarse-level events. We provide a voice-enabled, mixed-media UI to visualize the resulting hierarchy and allow users to efficiently navigate the tutorial (e.g., skip ahead, return to previous steps) at their own pace. Users can navigate the hierarchy at both the facial-part and action-step levels using click-based interactions and voice commands. We demonstrate the effectiveness of segmentation algorithms and the resulting mixed-media UI on a variety of input makeup videos. A user study shows that users prefer following instructional makeup videos in our mixed-media format to the standard video UI and that they find our format much easier to navigate.},
keywords = {CHI, computational video, google, human-computer interaction, video summarization},
pubstate = {published},
tppubtype = {inproceedings}
}
Dan Scarafoni, Irfan Essa, Thomas Ploetz
PLAN-B: Predicting Likely Alternative Next Best Sequences for Action Prediction Technical Report
no. arXiv:2103.15987, 2021.
Abstract | Links | BibTeX | Tags: activity recognition, arXiv, computer vision
@techreport{2021-Scarafoni-PPLANBSAP,
title = {PLAN-B: Predicting Likely Alternative Next Best Sequences for Action Prediction},
author = {Dan Scarafoni and Irfan Essa and Thomas Ploetz},
url = {https://arxiv.org/abs/2103.15987},
doi = {10.48550/arXiv.2103.15987},
year = {2021},
date = {2021-03-01},
urldate = {2021-03-01},
journal = {arXiv},
number = {arXiv:2103.15987},
abstract = {Action prediction focuses on anticipating actions before they happen. Recent works leverage probabilistic approaches to describe future uncertainties and sample future actions. However, these methods cannot easily find all alternative predictions, which are essential given the inherent unpredictability of the future, and current evaluation protocols do not measure a system's ability to find such alternatives. We re-examine action prediction in terms of its ability to predict not only the top predictions, but also top alternatives with the accuracy@k metric. In addition, we propose Choice F1: a metric inspired by F1 score which evaluates a prediction system's ability to find all plausible futures while keeping only the most probable ones. To evaluate this problem, we present a novel method, Predicting the Likely Alternative Next Best, or PLAN-B, for action prediction which automatically finds the set of most likely alternative futures. PLAN-B consists of two novel components: (i) a Choice Table which ensures that all possible futures are found, and (ii) a "Collaborative" RNN system which combines both action sequence and feature information. We demonstrate that our system outperforms state-of-the-art results on benchmark datasets.
},
keywords = {activity recognition, arXiv, computer vision},
pubstate = {published},
tppubtype = {techreport}
}
Vincent Cartillier, Zhile Ren, Neha Jain, Stefan Lee, Irfan Essa, Dhruv Batra
Semantic MapNet: Building Allocentric SemanticMaps and Representations from Egocentric Views Proceedings Article
In: Proceedings of American Association of Artificial Intelligence Conference (AAAI), AAAI, 2021.
Abstract | Links | BibTeX | Tags: AAAI, AI, embodied agents, first-person vision
@inproceedings{2021-Cartillier-SMBASRFEV,
title = {Semantic MapNet: Building Allocentric SemanticMaps and Representations from Egocentric Views},
author = {Vincent Cartillier and Zhile Ren and Neha Jain and Stefan Lee and Irfan Essa and Dhruv Batra},
url = {https://arxiv.org/abs/2010.01191
https://vincentcartillier.github.io/smnet.html
https://ojs.aaai.org/index.php/AAAI/article/view/16180/15987},
doi = {10.48550/arXiv.2010.01191},
year = {2021},
date = {2021-02-01},
urldate = {2021-02-01},
booktitle = {Proceedings of American Association of Artificial Intelligence Conference (AAAI)},
publisher = {AAAI},
abstract = {We study the task of semantic mapping -- specifically, an embodied agent (a robot or an egocentric AI assistant) is given a tour of a new environment and asked to build an allocentric top-down semantic map (`what is where?') from egocentric observations of an RGB-D camera with known pose (via localization sensors). Importantly, our goal is to build neural episodic memories and spatio-semantic representations of 3D spaces that enable the agent to easily learn subsequent tasks in the same space -- navigating to objects seen during the tour (`Find chair') or answering questions about the space (`How many chairs did you see in the house?').
Towards this goal, we present Semantic MapNet (SMNet), which consists of: (1) an Egocentric
Visual Encoder that encodes each egocentric RGB-D frame, (2) a Feature Projector that projects egocentric features to appropriate locations on a floor-plan, (3) a Spatial Memory Tensor of size floor-plan length × width × feature-dims that learns to accumulate projected egocentric features, and (4) a Map Decoder that uses the memory tensor to produce semantic top-down maps. SMNet combines the strengths of (known) projective camera geometry and neural representation learning. On the task of semantic mapping in the Matterport3D dataset, SMNet significantly outperforms competitive baselines by 4.01-16.81% (absolute) on mean-IoU and 3.81-19.69% (absolute) on Boundary-F1 metrics. Moreover, we show how to use the spatio-semantic allocentric representations build by SMNet for the task of ObjectNav and Embodied Question Answering.},
keywords = {AAAI, AI, embodied agents, first-person vision},
pubstate = {published},
tppubtype = {inproceedings}
}
Towards this goal, we present Semantic MapNet (SMNet), which consists of: (1) an Egocentric
Visual Encoder that encodes each egocentric RGB-D frame, (2) a Feature Projector that projects egocentric features to appropriate locations on a floor-plan, (3) a Spatial Memory Tensor of size floor-plan length × width × feature-dims that learns to accumulate projected egocentric features, and (4) a Map Decoder that uses the memory tensor to produce semantic top-down maps. SMNet combines the strengths of (known) projective camera geometry and neural representation learning. On the task of semantic mapping in the Matterport3D dataset, SMNet significantly outperforms competitive baselines by 4.01-16.81% (absolute) on mean-IoU and 3.81-19.69% (absolute) on Boundary-F1 metrics. Moreover, we show how to use the spatio-semantic allocentric representations build by SMNet for the task of ObjectNav and Embodied Question Answering.
Niranjan Kumar, Irfan Essa, Sehoon Ha, C. Karen Liu
Estimating Mass Distribution of Articulated Objects through Non-prehensile Manipulation Proceedings Article
In: Neural Information Processing Systems (NeurIPS) Workshop on Object Representations for Learning and Reasoning, NeurIPS 2020.
Abstract | Links | BibTeX | Tags: reinforcement learning, robotics
@inproceedings{2020-Kumar-EMDAOTNM,
title = {Estimating Mass Distribution of Articulated Objects through Non-prehensile Manipulation},
author = {Niranjan Kumar and Irfan Essa and Sehoon Ha and C. Karen Liu},
url = {https://orlrworkshop.github.io/program/orlr_25.html
http://arxiv.org/abs/1907.03964
https://www.kniranjankumar.com/projects/1_mass_prediction
https://www.youtube.com/watch?v=o3zBdVWvWZw
https://kniranjankumar.github.io/assets/pdf/Estimating_Mass_Distribution_of_Articulated_Objects_using_Non_prehensile_Manipulation.pdf},
year = {2020},
date = {2020-12-01},
urldate = {2020-12-01},
booktitle = {Neural Information Processing Systems (NeurIPS) Workshop on Object Representations for Learning and Reasoning},
organization = {NeurIPS},
abstract = {We explore the problem of estimating the mass distribution of an articulated object by an interactive robotic agent. Our method predicts the mass distribution of an object by using limited sensing and actuating capabilities of a robotic agent that is interacting with the object. We are inspired by the role of exploratory play in human infants. We take the combined approach of supervised and reinforcement learning to train an agent that learns to strategically interact with the object to estimate the object's mass distribution. Our method consists of two neural networks: (i) the policy network which decides how to interact with the object, and (ii) the predictor network that estimates the mass distribution given a history of observations and interactions. Using our method, we train a robotic arm to estimate the mass distribution of an object with moving parts (e.g. an articulated rigid body system) by pushing it on a surface with unknown friction properties. We also demonstrate how our training from simulations can be transferred to real hardware using a small amount of real-world data for fine-tuning. We use a UR10 robot to interact with 3D printed articulated chains with varying mass distributions and show that our method significantly outperforms the baseline system that uses random pushes to interact with the object.},
howpublished = {arXiv preprint arXiv:1907.03964},
keywords = {reinforcement learning, robotics},
pubstate = {published},
tppubtype = {inproceedings}
}
Other Publication Sites
A few more sites that aggregate research publications: Academic.edu, Bibsonomy, CiteULike, Mendeley.
Copyright/About
[Please see the Copyright Statement that may apply to the content listed here.]
This list of publications is produced by using the teachPress plugin for WordPress.