A searchable list of some of my publications is below. You can also access my publications from the following sites.
My ORCID is
Publications:
Gong Zhang, Kihyuk Sohn, Meera Hahn, Humphrey Shi, Irfan Essa
FineStyle: Fine-grained Controllable Style Personalization for Text-to-image Models Proceedings Article
In: Advances in Neural Information Processing Systems (NeurIPS), 2024.
Abstract | Links | BibTeX | Tags: computer vision, generative AI, generative media, machine learning, NeurIPS
@inproceedings{2024-Zhang-FFCSPTM,
title = {FineStyle: Fine-grained Controllable Style Personalization for Text-to-image Models},
author = {Gong Zhang and Kihyuk Sohn and Meera Hahn and Humphrey Shi and Irfan Essa},
url = {https://neurips.cc/virtual/2024/poster/96863
https://openreview.net/forum?id=1SmXUGzrH8},
year = {2024},
date = {2024-12-11},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
abstract = {Few-shot fine-tuning of text-to-image (T2I) generation models enables people to create unique images in their own style using natural languages without requiring extensive prompt engineering. However, fine-tuning with only a handful, as little as one, of image-text paired data prevents fine-grained control of style attributes at generation. In this paper, we present FineStyle, a few-shot fine-tuning method that allows enhanced controllability for style personalized text-to-image generation. To overcome the lack of training data for fine-tuning, we propose a novel concept-oriented data scaling that amplifies the number of image-text pair, each of which focuses on different concepts (e.g., objects) in the style reference image. We also identify the benefit of parameter-efficient adapter tuning of key and value kernels of cross-attention layers. Extensive experiments show the effectiveness of FineStyle at following fine-grained text prompts and delivering visual quality faithful to the specified style, measured by CLIP scores and human raters.
},
keywords = {computer vision, generative AI, generative media, machine learning, NeurIPS},
pubstate = {published},
tppubtype = {inproceedings}
}
Erik Wijmans, Manolis Savva, Irfan Essa, Stefan Lee, Ari S. Morcos, Dhruv Batra
Emergence of Maps in the Memories of Blind Navigation Agents Best Paper Proceedings Article
In: Proceedings of International Conference on Learning Representations (ICLR), 2023.
Abstract | Links | BibTeX | Tags: awards, best paper award, computer vision, google, ICLR, machine learning, robotics
@inproceedings{2023-Wijmans-EMMBNA,
title = {Emergence of Maps in the Memories of Blind Navigation Agents},
author = {Erik Wijmans and Manolis Savva and Irfan Essa and Stefan Lee and Ari S. Morcos and Dhruv Batra},
url = {https://arxiv.org/abs/2301.13261
https://wijmans.xyz/publication/eom/
https://openreview.net/forum?id=lTt4KjHSsyl
https://blog.iclr.cc/2023/03/21/announcing-the-iclr-2023-outstanding-paper-award-recipients/},
doi = {10.48550/ARXIV.2301.13261},
year = {2023},
date = {2023-05-01},
urldate = {2023-05-01},
booktitle = {Proceedings of International Conference on Learning Representations (ICLR)},
abstract = {Animal navigation research posits that organisms build and maintain internal spatial representations, or maps, of their environment. We ask if machines -- specifically, artificial intelligence (AI) navigation agents -- also build implicit (or 'mental') maps. A positive answer to this question would (a) explain the surprising phenomenon in recent literature of ostensibly map-free neural-networks achieving strong performance, and (b) strengthen the evidence of mapping as a fundamental mechanism for navigation by intelligent embodied agents, whether they be biological or artificial. Unlike animal navigation, we can judiciously design the agent's perceptual system and control the learning paradigm to nullify alternative navigation mechanisms. Specifically, we train 'blind' agents -- with sensing limited to only egomotion and no other sensing of any kind -- to perform PointGoal navigation ('go to Δ x, Δ y') via reinforcement learning. Our agents are composed of navigation-agnostic components (fully-connected and recurrent neural networks), and our experimental setup provides no inductive bias towards mapping. Despite these harsh conditions, we find that blind agents are (1) surprisingly effective navigators in new environments (~95% success); (2) they utilize memory over long horizons (remembering ~1,000 steps of past experience in an episode); (3) this memory enables them to exhibit intelligent behavior (following walls, detecting collisions, taking shortcuts); (4) there is emergence of maps and collision detection neurons in the representations of the environment built by a blind agent as it navigates; and (5) the emergent maps are selective and task dependent (e.g. the agent 'forgets' exploratory detours). Overall, this paper presents no new techniques for the AI audience, but a surprising finding, an insight, and an explanation.},
keywords = {awards, best paper award, computer vision, google, ICLR, machine learning, robotics},
pubstate = {published},
tppubtype = {inproceedings}
}
José Lezama, Tim Salimans, Lu Jiang, Huiwen Chang, Jonathan Ho, Irfan Essa
Discrete Predictor-Corrector Diffusion Models for Image Synthesis Proceedings Article
In: International Conference on Learning Representations (ICLR), 2023.
Abstract | Links | BibTeX | Tags: computer vision, generative AI, generative media, google, ICLR, machine learning
@inproceedings{2023-Lezama-DPDMIS,
title = {Discrete Predictor-Corrector Diffusion Models for Image Synthesis},
author = {José Lezama and Tim Salimans and Lu Jiang and Huiwen Chang and Jonathan Ho and Irfan Essa},
url = {https://openreview.net/forum?id=VM8batVBWvg},
year = {2023},
date = {2023-05-01},
urldate = {2023-05-01},
booktitle = {International Conference on Learning Representations (ICLR)},
abstract = {We introduce Discrete Predictor-Corrector diffusion models (DPC), extending predictor-corrector samplers in Gaussian diffusion models to the discrete case. Predictor-corrector samplers are a class of samplers for diffusion models, which improve on ancestral samplers by correcting the sampling distribution of intermediate diffusion states using MCMC methods. In DPC, the Langevin corrector, which does not have a direct counterpart in discrete space, is replaced with a discrete MCMC transition defined by a learned corrector kernel. The corrector kernel is trained to make the correction steps achieve asymptotic convergence, in distribution, to the correct marginal of the intermediate diffusion states. Equipped with DPC, we revisit recent transformer-based non-autoregressive generative models through the lens of discrete diffusion, and find that DPC can alleviate the compounding decoding error due to the parallel sampling of visual tokens. Our experiments show that DPC improves upon existing discrete latent space models for class-conditional image generation on ImageNet, and outperforms continuous diffusion models and GANs, according to standard metrics and user preference studies},
keywords = {computer vision, generative AI, generative media, google, ICLR, machine learning},
pubstate = {published},
tppubtype = {inproceedings}
}
Erik Wijmans, Irfan Essa, Dhruv Batra
VER: Scaling On-Policy RL Leads to the Emergence of Navigation in Embodied Rearrangement Proceedings Article
In: Oh, Alice H., Agarwal, Alekh, Belgrave, Danielle, Cho, Kyunghyun (Ed.): Advances in Neural Information Processing Systems (NeurIPS), 2022.
Abstract | Links | BibTeX | Tags: machine learning, NeurIPS, reinforcement learning, robotics
@inproceedings{2022-Wijmans-SOLENER,
title = {VER: Scaling On-Policy RL Leads to the Emergence of Navigation in Embodied Rearrangement},
author = {Erik Wijmans and Irfan Essa and Dhruv Batra},
editor = {Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},
url = {https://arxiv.org/abs/2210.05064
https://openreview.net/forum?id=VrJWseIN98},
doi = {10.48550/ARXIV.2210.05064},
year = {2022},
date = {2022-12-01},
urldate = {2022-12-01},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
abstract = {We present Variable Experience Rollout (VER), a technique for efficiently scaling batched on-policy reinforcement learning in heterogenous environments (where different environments take vastly different times to generate rollouts) to many GPUs residing on, potentially, many machines. VER combines the strengths of and blurs the line between synchronous and asynchronous on-policy RL methods (SyncOnRL and AsyncOnRL, respectively). Specifically, it learns from on-policy experience (like SyncOnRL) and has no synchronization points (like AsyncOnRL) enabling high throughput.
We find that VER leads to significant and consistent speed-ups across a broad range of embodied navigation and mobile manipulation tasks in photorealistic 3D simulation environments. Specifically, for PointGoal navigation and ObjectGoal navigation in Habitat 1.0, VER is 60-100% faster (1.6-2x speedup) than DD-PPO, the current state of art for distributed SyncOnRL, with similar sample efficiency. For mobile manipulation tasks (open fridge/cabinet, pick/place objects) in Habitat 2.0 VER is 150% faster (2.5x speedup) on 1 GPU and 170% faster (2.7x speedup) on 8 GPUs than DD-PPO. Compared to SampleFactory (the current state-of-the-art AsyncOnRL), VER matches its speed on 1 GPU, and is 70% faster (1.7x speedup) on 8 GPUs with better sample efficiency.
We leverage these speed-ups to train chained skills for GeometricGoal rearrangement tasks in the Home Assistant Benchmark (HAB). We find a surprising emergence of navigation in skills that do not ostensible require any navigation. Specifically, the Pick skill involves a robot picking an object from a table. During training the robot was always spawned close to the table and never needed to navigate. However, we find that if base movement is part of the action space, the robot learns to navigate then pick an object in new environments with 50% success, demonstrating surprisingly high out-of-distribution generalization.},
keywords = {machine learning, NeurIPS, reinforcement learning, robotics},
pubstate = {published},
tppubtype = {inproceedings}
}
We find that VER leads to significant and consistent speed-ups across a broad range of embodied navigation and mobile manipulation tasks in photorealistic 3D simulation environments. Specifically, for PointGoal navigation and ObjectGoal navigation in Habitat 1.0, VER is 60-100% faster (1.6-2x speedup) than DD-PPO, the current state of art for distributed SyncOnRL, with similar sample efficiency. For mobile manipulation tasks (open fridge/cabinet, pick/place objects) in Habitat 2.0 VER is 150% faster (2.5x speedup) on 1 GPU and 170% faster (2.7x speedup) on 8 GPUs than DD-PPO. Compared to SampleFactory (the current state-of-the-art AsyncOnRL), VER matches its speed on 1 GPU, and is 70% faster (1.7x speedup) on 8 GPUs with better sample efficiency.
We leverage these speed-ups to train chained skills for GeometricGoal rearrangement tasks in the Home Assistant Benchmark (HAB). We find a surprising emergence of navigation in skills that do not ostensible require any navigation. Specifically, the Pick skill involves a robot picking an object from a table. During training the robot was always spawned close to the table and never needed to navigate. However, we find that if base movement is part of the action space, the robot learns to navigate then pick an object in new environments with 50% success, demonstrating surprisingly high out-of-distribution generalization.
Niranjan Kumar, Irfan Essa, Sehoon Ha
Graph-based Cluttered Scene Generation and Interactive Exploration using Deep Reinforcement Learning Proceedings Article
In: Proceedings International Conference on Robotics and Automation (ICRA), pp. 7521-7527, 2022.
Abstract | Links | BibTeX | Tags: ICRA, machine learning, reinforcement learning, robotics
@inproceedings{2021-Kumar-GCSGIEUDRL,
title = {Graph-based Cluttered Scene Generation and Interactive Exploration using Deep Reinforcement Learning},
author = {Niranjan Kumar and Irfan Essa and Sehoon Ha},
url = {https://doi.org/10.1109/ICRA46639.2022.9811874
https://arxiv.org/abs/2109.10460
https://arxiv.org/pdf/2109.10460
https://www.kniranjankumar.com/projects/5_clutr
https://kniranjankumar.github.io/assets/pdf/graph_based_clutter.pdf
https://youtu.be/T2Jo7wwaXss},
doi = {10.1109/ICRA46639.2022.9811874},
year = {2022},
date = {2022-05-01},
urldate = {2022-05-01},
booktitle = {Proceedings International Conference on Robotics and Automation (ICRA)},
journal = {arXiv},
number = {2109.10460},
pages = {7521-7527},
abstract = {We introduce a novel method to teach a robotic agent to interactively explore cluttered yet structured scenes, such as kitchen pantries and grocery shelves, by leveraging the physical plausibility of the scene. We propose a novel learning framework to train an effective scene exploration policy to discover hidden objects with minimal interactions. First, we define a novel scene grammar to represent structured clutter. Then we train a Graph Neural Network (GNN) based Scene Generation agent using deep reinforcement learning (deep RL), to manipulate this Scene Grammar to create a diverse set of stable scenes, each containing multiple hidden objects. Given such cluttered scenes, we then train a Scene Exploration agent, using deep RL, to uncover hidden objects by interactively rearranging the scene.
},
keywords = {ICRA, machine learning, reinforcement learning, robotics},
pubstate = {published},
tppubtype = {inproceedings}
}
Karan Samel, Zelin Zhao, Binghong Chen, Shuang Li, Dharmashankar Subramanian, Irfan Essa, Le Song
Learning Temporal Rules from Noisy Timeseries Data Journal Article
In: arXiv preprint arXiv:2202.05403, 2022.
Abstract | Links | BibTeX | Tags: activity recognition, machine learning
@article{2022-Samel-LTRFNTD,
title = {Learning Temporal Rules from Noisy Timeseries Data},
author = {Karan Samel and Zelin Zhao and Binghong Chen and Shuang Li and Dharmashankar Subramanian and Irfan Essa and Le Song},
url = {https://arxiv.org/abs/2202.05403
https://arxiv.org/pdf/2202.05403},
year = {2022},
date = {2022-02-01},
urldate = {2022-02-01},
journal = {arXiv preprint arXiv:2202.05403},
abstract = {Events across a timeline are a common data representation, seen in different temporal modalities. Individual atomic events can occur in a certain temporal ordering to compose higher level composite events. Examples of a composite event are a patient's medical symptom or a baseball player hitting a home run, caused distinct temporal orderings of patient vitals and player movements respectively. Such salient composite events are provided as labels in temporal datasets and most works optimize models to predict these composite event labels directly. We focus on uncovering the underlying atomic events and their relations that lead to the composite events within a noisy temporal data setting. We propose Neural Temporal Logic Programming (Neural TLP) which first learns implicit temporal relations between atomic events and then lifts logic rules for composite events, given only the composite events labels for supervision. This is done through efficiently searching through the combinatorial space of all temporal logic rules in an end-to-end differentiable manner. We evaluate our method on video and healthcare datasets where it outperforms the baseline methods for rule discovery.
},
keywords = {activity recognition, machine learning},
pubstate = {published},
tppubtype = {article}
}
Chengzhi Mao, Lu Jiang, Mostafa Dehghani, Carl Vondrick, Rahul Sukthankar, Irfan Essa
Discrete Representations Strengthen Vision Transformer Robustness Proceedings Article
In: Proceedings of International Conference on Learning Representations (ICLR), 2022.
Abstract | Links | BibTeX | Tags: computer vision, google, machine learning, vision transformer
@inproceedings{2022-Mao-DRSVTR,
title = {Discrete Representations Strengthen Vision Transformer Robustness},
author = {Chengzhi Mao and Lu Jiang and Mostafa Dehghani and Carl Vondrick and Rahul Sukthankar and Irfan Essa},
url = {https://iclr.cc/virtual/2022/poster/6647
https://arxiv.org/abs/2111.10493
https://research.google/pubs/pub51388/
https://openreview.net/forum?id=8hWs60AZcWk},
doi = {10.48550/arXiv.2111.10493},
year = {2022},
date = {2022-01-28},
urldate = {2022-04-01},
booktitle = {Proceedings of International Conference on Learning Representations (ICLR)},
journal = {arXiv preprint arXiv:2111.10493},
abstract = {Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition. While recent studies suggest that ViTs are more robust than their convolutional counterparts, our experiments find that ViTs trained on ImageNet are overly reliant on local textures and fail to make adequate use of shape information. ViTs thus have difficulties generalizing to out-of-distribution, real-world data. To address this deficiency, we present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder. Different from the standard continuous pixel tokens, discrete tokens are invariant under small perturbations and contain less information individually, which promote ViTs to learn global information that is invariant. Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks while maintaining the performance on ImageNet.},
keywords = {computer vision, google, machine learning, vision transformer},
pubstate = {published},
tppubtype = {inproceedings}
}
Steven Hickson, Karthik Raveendran, Irfan Essa
Sharing Decoders: Network Fission for Multi-Task Pixel Prediction Proceedings Article
In: IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3771–3780, 2022.
Abstract | Links | BibTeX | Tags: computer vision, google, machine learning
@inproceedings{2022-Hickson-SDNFMPP,
title = {Sharing Decoders: Network Fission for Multi-Task Pixel Prediction},
author = {Steven Hickson and Karthik Raveendran and Irfan Essa},
url = {https://openaccess.thecvf.com/content/WACV2022/papers/Hickson_Sharing_Decoders_Network_Fission_for_Multi-Task_Pixel_Prediction_WACV_2022_paper.pdf
https://openaccess.thecvf.com/content/WACV2022/supplemental/Hickson_Sharing_Decoders_Network_WACV_2022_supplemental.pdf
https://youtu.be/qqYODA4C6AU},
doi = {10.1109/WACV51458.2022.00371},
year = {2022},
date = {2022-01-01},
urldate = {2022-01-01},
booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision},
pages = {3771--3780},
abstract = {We examine the benefits of splitting encoder-decoders for multitask learning and showcase results on three tasks (semantics, surface normals, and depth) while adding very few FLOPS per task. Current hard parameter sharing methods for multi-task pixel-wise labeling use one shared encoder with separate decoders for each task. We generalize this notion and term the splitting of encoder-decoder architectures at different points as fission. Our ablation studies on fission show that sharing most of the decoder layers in multi-task encoder-decoder networks results in improvement while adding far fewer parameters per task. Our proposed method trains faster, uses less memory, results in better accuracy, and uses significantly fewer floating point operations (FLOPS) than conventional multi-task methods, with additional tasks only requiring 0.017% more FLOPS than the single-task network.},
keywords = {computer vision, google, machine learning},
pubstate = {published},
tppubtype = {inproceedings}
}
Karan Samel, Zelin Zhao, Binghong Chen, Shuang Li, Dharmashankar Subramanian, Irfan Essa, Le Song
Neural Temporal Logic Programming Technical Report
2021.
Abstract | Links | BibTeX | Tags: activity recognition, arXiv, machine learning, openreview
@techreport{2021-Samel-NTLP,
title = {Neural Temporal Logic Programming},
author = {Karan Samel and Zelin Zhao and Binghong Chen and Shuang Li and Dharmashankar Subramanian and Irfan Essa and Le Song},
url = {https://openreview.net/forum?id=i7h4M45tU8},
year = {2021},
date = {2021-09-01},
urldate = {2021-09-01},
abstract = {Events across a timeline are a common data representation, seen in different temporal modalities. Individual atomic events can occur in a certain temporal ordering to compose higher-level composite events. Examples of a composite event are a patient's medical symptom or a baseball player hitting a home run, caused distinct temporal orderings of patient vitals and player movements respectively. Such salient composite events are provided as labels in temporal datasets and most works optimize models to predict these composite event labels directly. We focus uncovering the underlying atomic events and their relations that lead to the composite events within a noisy temporal data setting. We propose Neural Temporal Logic Programming (Neural TLP) which first learns implicit temporal relations between atomic events and then lifts logic rules for composite events, given only the composite events labels for supervision. This is done through efficiently searching through the combinatorial space of all temporal logic rules in an end-to-end differentiable manner. We evaluate our method on video and on healthcare data where it outperforms the baseline methods for rule discovery. },
howpublished = {https://openreview.net/forum?id=i7h4M45tU8},
keywords = {activity recognition, arXiv, machine learning, openreview},
pubstate = {published},
tppubtype = {techreport}
}
Harish Haresamudram, Irfan Essa, Thomas Ploetz
Contrastive Predictive Coding for Human Activity Recognition Journal Article
In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 5, no. 2, pp. 1–26, 2021.
Abstract | Links | BibTeX | Tags: activity recognition, IMWUT, machine learning, ubiquitous computing
@article{2021-Haresamudram-CPCHAR,
title = {Contrastive Predictive Coding for Human Activity Recognition},
author = {Harish Haresamudram and Irfan Essa and Thomas Ploetz},
url = {https://doi.org/10.1145/3463506
https://arxiv.org/abs/2012.05333},
doi = {10.1145/3463506},
year = {2021},
date = {2021-06-01},
urldate = {2021-06-01},
booktitle = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies},
journal = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies},
volume = {5},
number = {2},
pages = {1--26},
abstract = {Feature extraction is crucial for human activity recognition (HAR) using body-worn movement sensors. Recently, learned representations have been used successfully, offering promising alternatives to manually engineered features. Our work focuses on effective use of small amounts of labeled data and the opportunistic exploitation of unlabeled data that are straightforward to collect in mobile and ubiquitous computing scenarios. We hypothesize and demonstrate that explicitly considering the temporality of sensor data at representation level plays an important role for effective HAR in challenging scenarios. We introduce the Contrastive Predictive Coding (CPC) framework to human activity recognition, which captures the long-term temporal structure of sensor data streams. Through a range of experimental evaluations on real-life recognition tasks, we demonstrate its effectiveness for improved HAR. CPC-based pre-training is self-supervised, and the resulting learned representations can be integrated into standard activity chains. It leads to significantly improved recognition performance when only small amounts of labeled training data are available, thereby demonstrating the practical value of our approach.},
keywords = {activity recognition, IMWUT, machine learning, ubiquitous computing},
pubstate = {published},
tppubtype = {article}
}
Harish Haresamudram, Apoorva Beedu, Varun Agrawal, Patrick L Grady, Irfan Essa, Judy Hoffman, Thomas Plötz
Masked reconstruction based self-supervision for human activity recognition Proceedings Article
In: Proceedings of the International Symposium on Wearable Computers (ISWC), pp. 45–49, 2020.
Abstract | Links | BibTeX | Tags: activity recognition, ISWC, machine learning, wearable computing
@inproceedings{2020-Haresamudram-MRBSHAR,
title = {Masked reconstruction based self-supervision for human activity recognition},
author = {Harish Haresamudram and Apoorva Beedu and Varun Agrawal and Patrick L Grady and Irfan Essa and Judy Hoffman and Thomas Plötz},
url = {https://dl.acm.org/doi/10.1145/3410531.3414306
https://harkash.github.io/publication/masked-reconstruction
https://arxiv.org/abs/2202.12938},
doi = {10.1145/3410531.3414306},
year = {2020},
date = {2020-09-01},
urldate = {2020-09-01},
booktitle = {Proceedings of the International Symposium on Wearable Computers (ISWC)},
pages = {45--49},
abstract = {The ubiquitous availability of wearable sensing devices has rendered large scale collection of movement data a straightforward endeavor. Yet, annotation of these data remains a challenge and as such, publicly available datasets for human activity recognition (HAR) are typically limited in size as well as in variability, which constrains HAR model training and effectiveness. We introduce masked reconstruction as a viable self-supervised pre-training objective for human activity recognition and explore its effectiveness in comparison to state-of-the-art unsupervised learning techniques. In scenarios with small labeled datasets, the pre-training results in improvements over end-to-end learning on two of the four benchmark datasets. This is promising because the pre-training objective can be integrated "as is" into state-of-the-art recognition pipelines to effectively facilitate improved model robustness, and thus, ultimately, leading to better recognition performance.
},
keywords = {activity recognition, ISWC, machine learning, wearable computing},
pubstate = {published},
tppubtype = {inproceedings}
}
Steven Hickson, Anelia Angelova, Irfan Essa, Rahul Sukthankar
Category learning neural networks Patent
2020.
Abstract | Links | BibTeX | Tags: google, machine learning, patents
@patent{2020-Hickson-CLNN,
title = {Category learning neural networks},
author = {Steven Hickson and Anelia Angelova and Irfan Essa and Rahul Sukthankar},
url = {https://patents.google.com/patent/US10635979},
year = {2020},
date = {2020-04-28},
urldate = {2020-04-28},
publisher = {(US Patent # 10635979)},
abstract = {Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for determining a clustering of images into a plurality of semantic categories. In one aspect, a method comprises: training a categorization neural network, comprising, at each of a plurality of iterations: processing an image depicting an object using the categorization neural network to generate (i) a current prediction for whether the image depicts an object or a background region, and (ii) a current embedding of the image; determining a plurality of current cluster centers based on the current values of the categorization neural network parameters, wherein each cluster center represents a respective semantic category; and determining a gradient of an objective function that includes a classification loss and a clustering loss, wherein the clustering loss depends on a similarity between the current embedding of the image and the current cluster centers.
},
howpublished = {US Patent #10635979},
keywords = {google, machine learning, patents},
pubstate = {published},
tppubtype = {patent}
}
Unaiza Ahsan, Rishi Madhok, Irfan Essa
Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition Proceedings Article
In: IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 179-189, 2019, ISSN: 1550-5790.
Links | BibTeX | Tags: activity recognition, computer vision, machine learning, WACV
@inproceedings{2019-Ahsan-VJULSCVAR,
title = {Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition},
author = {Unaiza Ahsan and Rishi Madhok and Irfan Essa},
url = {https://ieeexplore.ieee.org/abstract/document/8659002},
doi = {10.1109/WACV.2019.00025},
issn = {1550-5790},
year = {2019},
date = {2019-01-01},
urldate = {2019-01-01},
booktitle = {IEEE Winter Conference on Applications of Computer Vision (WACV)},
pages = {179-189},
keywords = {activity recognition, computer vision, machine learning, WACV},
pubstate = {published},
tppubtype = {inproceedings}
}
Unaiza Ahsan, Rishi Madhok, Irfan Essa
Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition Journal Article
In: arXiv, no. arXiv:1808.07507, 2018.
BibTeX | Tags: activity recognition, computer vision, machine learning
@article{2018-Ahsan-VJULSCVAR,
title = {Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition},
author = {Unaiza Ahsan and Rishi Madhok and Irfan Essa},
year = {2018},
date = {2018-08-01},
journal = {arXiv},
number = {arXiv:1808.07507},
keywords = {activity recognition, computer vision, machine learning},
pubstate = {published},
tppubtype = {article}
}
Steven Hickson, Anelia Angelova, Irfan Essa, Rahul Sukthankar
Object category learning and retrieval with weak supervision Technical Report
no. arXiv:1801.08985, 2018.
Abstract | Links | BibTeX | Tags: arXiv, computer vision, machine learning, object detection
@techreport{2018-Hickson-OCLRWWS,
title = {Object category learning and retrieval with weak supervision},
author = {Steven Hickson and Anelia Angelova and Irfan Essa and Rahul Sukthankar},
url = {https://arxiv.org/abs/1801.08985
https://arxiv.org/pdf/1801.08985},
doi = {10.48550/arXiv.1801.08985},
year = {2018},
date = {2018-07-01},
urldate = {2018-07-01},
journal = {arXiv},
number = {arXiv:1801.08985},
abstract = {We consider the problem of retrieving objects from image data and learning to classify them into meaningful semantic categories with minimal supervision. To that end, we propose a fully differentiable unsupervised deep clustering approach to learn semantic classes in an end-to-end fashion without individual class labeling using only unlabeled object proposals. The key contributions of our work are 1) a kmeans clustering objective where the clusters are learned as parameters of the network and are represented as memory units, and 2) simultaneously building a feature representation, or embedding, while learning to cluster it. This approach shows promising results on two popular computer vision datasets: on CIFAR10 for clustering objects, and on the more complex and challenging Cityscapes dataset for semantically discovering classes which visually correspond to cars, people, and bicycles. Currently, the only supervision provided is segmentation objectness masks, but this method can be extended to use an unsupervised objectness-based object generation mechanism which will make the approach completely unsupervised.
},
howpublished = {arXiv:1801.08985},
keywords = {arXiv, computer vision, machine learning, object detection},
pubstate = {published},
tppubtype = {techreport}
}
Unaiza Ahsan, Chen Sun, Irfan Essa
DiscrimNet: Semi-Supervised Action Recognition from Videos using Generative Adversarial Networks Journal Article
In: arXiv, no. arXiv:1801.07230, 2018.
BibTeX | Tags: activity recognition, computer vision, machine learning
@article{2018-Ahsan-DSARFVUGAN,
title = {DiscrimNet: Semi-Supervised Action Recognition from Videos using Generative Adversarial Networks},
author = {Unaiza Ahsan and Chen Sun and Irfan Essa},
year = {2018},
date = {2018-01-01},
journal = {arXiv},
number = {arXiv:1801.07230},
keywords = {activity recognition, computer vision, machine learning},
pubstate = {published},
tppubtype = {article}
}
Unaiza Ahsan, Munmun De Choudhury, Irfan Essa
Towards Using Visual Attributes to Infer Image Sentiment Of Social Events Proceedings Article
In: Proceedings of The International Joint Conference on Neural Networks, International Neural Network Society, Anchorage, Alaska, US, 2017.
Abstract | Links | BibTeX | Tags: computational journalism, computer vision, IJNN, machine learning
@inproceedings{2017-Ahsan-TUVAIISSE,
title = {Towards Using Visual Attributes to Infer Image Sentiment Of Social Events},
author = {Unaiza Ahsan and Munmun De Choudhury and Irfan Essa},
url = {https://ieeexplore.ieee.org/abstract/document/7966013},
doi = {10.1109/IJCNN.2017.7966013},
year = {2017},
date = {2017-05-01},
urldate = {2017-05-01},
booktitle = {Proceedings of The International Joint Conference on Neural Networks},
publisher = {International Neural Network Society},
address = {Anchorage, Alaska, US},
abstract = {Widespread and pervasive adoption of smartphones has led to instant sharing of photographs that capture events ranging from mundane to life-altering happenings. We propose to capture sentiment information of such social event images leveraging their visual content. Our method extracts an intermediate visual representation of social event images based on the visual attributes that occur in the images going beyond sentiment-specific attributes. We map the top predicted attributes to sentiments and extract the dominant emotion associated with a picture of a social event. Unlike recent approaches, our method generalizes to a variety of social events and even to unseen events, which are not available at training time. We demonstrate the effectiveness of our approach on a challenging social event image dataset and our method outperforms state-of-the-art approaches for classifying complex event images into sentiments.
},
keywords = {computational journalism, computer vision, IJNN, machine learning},
pubstate = {published},
tppubtype = {inproceedings}
}
Unaiza Ahsan, Chen Sun, James Hays, Irfan Essa
Complex Event Recognition from Images with Few Training Examples Proceedings Article
In: IEEE Winter Conference on Applications of Computer Vision (WACV), 2017.
Abstract | Links | BibTeX | Tags: activity recognition, computer vision, machine learning, WACV
@inproceedings{2017-Ahsan-CERFIWTE,
title = {Complex Event Recognition from Images with Few Training Examples},
author = {Unaiza Ahsan and Chen Sun and James Hays and Irfan Essa},
url = {https://arxiv.org/abs/1701.04769
https://www.computer.org/csdl/proceedings-article/wacv/2017/07926663/12OmNzZEAzy},
doi = {10.1109/WACV.2017.80},
year = {2017},
date = {2017-03-01},
urldate = {2017-03-01},
booktitle = {IEEE Winter Conference on Applications of Computer Vision (WACV)},
abstract = {We propose to leverage concept-level representations for complex event recognition in photographs given limited training examples. We introduce a novel framework to discover event concept attributes from the web and use that to extract semantic features from images and classify them into social event categories with few training examples. Discovered concepts include a variety of objects, scenes, actions and event sub-types, leading to a discriminative and compact representation for event images. Web images are obtained for each discovered event concept and we use (pretrained) CNN features to train concept classifiers. Extensive experiments on challenging event datasets demonstrate that our proposed method outperforms several baselines using deep CNN features directly in classifying images into events with limited training examples. We also demonstrate that our method achieves the best overall accuracy on a dataset with unseen event categories using a single training example.
},
keywords = {activity recognition, computer vision, machine learning, WACV},
pubstate = {published},
tppubtype = {inproceedings}
}
Daniel Castro, Steven Hickson, Vinay Bettadapura, Edison Thomaz, Gregory Abowd, Henrik Christensen, Irfan Essa
Predicting Daily Activities from Egocentric Images Using Deep Learning Proceedings Article
In: Proceedings of International Symposium on Wearable Computers (ISWC), 2015.
Abstract | Links | BibTeX | Tags: activity recognition, computer vision, ISWC, machine learning, wearable computing
@inproceedings{2015-Castro-PDAFEIUDL,
title = {Predicting Daily Activities from Egocentric Images Using Deep Learning},
author = {Daniel Castro and Steven Hickson and Vinay Bettadapura and Edison Thomaz and Gregory Abowd and Henrik Christensen and Irfan Essa},
url = {https://dl.acm.org/doi/10.1145/2802083.2808398
https://arxiv.org/abs/1510.01576
http://www.cc.gatech.edu/cpl/projects/dailyactivities/
},
doi = {10.1145/2802083.2808398},
year = {2015},
date = {2015-09-01},
urldate = {2015-09-01},
booktitle = {Proceedings of International Symposium on Wearable Computers (ISWC)},
abstract = {We present a method to analyze images taken from a passive egocentric wearable camera along with contextual information, such as time and day of the week, to learn and predict the everyday activities of an individual. We collected a dataset of 40,103 egocentric images over 6 months with 19 activity classes and demonstrate the benefit of state-of-the-art deep learning techniques for learning and predicting daily activities. Classification is conducted using a Convolutional Neural Network (CNN) with a classification method we introduce called a late fusion ensemble. This late fusion ensemble incorporates relevant contextual information and increases our classification accuracy. Our technique achieves an overall accuracy of 83.07% in predicting a person's activity across the 19 activity classes. We also demonstrate some promising results from two additional users by fine-tuning the classifier with one day of training data.},
keywords = {activity recognition, computer vision, ISWC, machine learning, wearable computing},
pubstate = {published},
tppubtype = {inproceedings}
}
Edison Thomaz, Irfan Essa, Gregory Abowd
A Practical Approach for Recognizing Eating Moments with Wrist-Mounted Inertial Sensing Proceedings Article
In: ACM International Conference on Ubiquitous Computing (UBICOMP), 2015.
Abstract | Links | BibTeX | Tags: activity recognition, computational health, machine learning, Ubicomp, ubiquitous computing
@inproceedings{2015-Thomaz-PAREMWWIS,
title = {A Practical Approach for Recognizing Eating Moments with Wrist-Mounted Inertial Sensing},
author = {Edison Thomaz and Irfan Essa and Gregory Abowd},
url = {https://dl.acm.org/doi/10.1145/2750858.2807545},
doi = {10.1145/2750858.2807545},
year = {2015},
date = {2015-09-01},
urldate = {2015-09-01},
booktitle = {ACM International Conference on Ubiquitous Computing (UBICOMP)},
abstract = {Recognizing when eating activities take place is one of the key challenges in automated food intake monitoring. Despite progress over the years, most proposed approaches have been largely impractical for everyday usage, requiring multiple on-body sensors or specialized devices such as neck collars for swallow detection. In this paper, we describe the implementation and evaluation of an approach for inferring eating moments based on 3-axis accelerometry collected with a popular off-the-shelf smartwatch. Trained with data collected in a semi-controlled laboratory setting with 20 subjects, our system recognized eating moments in two free-living condition studies (7 participants, 1 day; 1 participant, 31 days), with F-scores of 76.1% (66.7% Precision, 88.8% Recall), and 71.3% (65.2% Precision, 78.6% Recall). This work represents a contribution towards the implementation of a practical, automated system for everyday food intake monitoring, with applicability in areas ranging from health research and food journaling.
},
keywords = {activity recognition, computational health, machine learning, Ubicomp, ubiquitous computing},
pubstate = {published},
tppubtype = {inproceedings}
}
Other Publication Sites
A few more sites that aggregate research publications: Academic.edu, Bibsonomy, CiteULike, Mendeley.
Copyright/About
[Please see the Copyright Statement that may apply to the content listed here.]
This list of publications is produced by using the teachPress plugin for WordPress.