A searchable list of some of my publications is below. You can also access my publications from the following sites.
My ORCID is
Publications:
Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, José Lezama
Photorealistic Video Generation with Diffusion Models Proceedings Article
In: European Conference on Computer Vision (ECCV), 2024.
Abstract | Links | BibTeX | Tags: arXiv, computational video, computer vision, generative AI, google
@inproceedings{2024-Gupta-PVGWDM,
title = {Photorealistic Video Generation with Diffusion Models},
author = {Agrim Gupta and Lijun Yu and Kihyuk Sohn and Xiuye Gu and Meera Hahn and Li Fei-Fei and Irfan Essa and Lu Jiang and José Lezama
},
url = {https://walt-video-diffusion.github.io/
https://arxiv.org/abs/2312.06662
https://arxiv.org/pdf/2312.06662
},
doi = {10.48550/arXiv.2312.06662},
year = {2024},
date = {2024-07-25},
urldate = {2024-07-25},
booktitle = {European Conference on Computer Vision (ECCV)},
abstract = {We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of 512×896 resolution at 8 frames per second.},
keywords = {arXiv, computational video, computer vision, generative AI, google},
pubstate = {published},
tppubtype = {inproceedings}
}
Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Josh Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam, Ming-Hsuan Yang, Irfan Essa, Huisheng Wang, David A. Ross, Bryan Seybold, Lu Jiang
VideoPoet: A large language model for zero-shot video generation Best Paper Proceedings Article
In: Proceedings of International Conference on Machine Learning (ICML), 2024.
Abstract | Links | BibTeX | Tags: arXiv, best paper award, computational video, computer vision, generative AI, google, ICML
@inproceedings{2024-Kondratyuk-VLLMZVG,
title = {VideoPoet: A large language model for zero-shot video generation},
author = {Dan Kondratyuk and Lijun Yu and Xiuye Gu and José Lezama and Jonathan Huang and Grant Schindler and Rachel Hornung and Vighnesh Birodkar and Jimmy Yan and Ming-Chang Chiu and Krishna Somandepalli and Hassan Akbari and Yair Alon and Yong Cheng and Josh Dillon and Agrim Gupta and Meera Hahn and Anja Hauth and David Hendon and Alonso Martinez and David Minnen and Mikhail Sirotenko and Kihyuk Sohn and Xuan Yang and Hartwig Adam and Ming-Hsuan Yang and Irfan Essa and Huisheng Wang and David A. Ross and Bryan Seybold and Lu Jiang
},
url = {https://arxiv.org/pdf/2312.14125
https://arxiv.org/abs/2312.14125
https://sites.research.google/videopoet/},
doi = {10.48550/arXiv.2312.14125},
year = {2024},
date = {2024-07-23},
urldate = {2024-07-23},
booktitle = {Proceedings of International Conference on Machine Learning (ICML)},
abstract = {We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/
},
keywords = {arXiv, best paper award, computational video, computer vision, generative AI, google, ICML},
pubstate = {published},
tppubtype = {inproceedings}
}
Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming-Hsuan Yang, Kevin Murphy, Alexander G. Hauptmann, Lu Jiang
SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs Proceedings Article
In: Advances in Neural Information Processing Systems (NeurIPS), 2023.
Abstract | Links | BibTeX | Tags: arXiv, computational video, computer vision, generative AI, NeurIPS
@inproceedings{2023-Yu-SSPAMGWFL,
title = {SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs},
author = {Lijun Yu and Yong Cheng and Zhiruo Wang and Vivek Kumar and Wolfgang Macherey and Yanping Huang and David A. Ross and Irfan Essa and Yonatan Bisk and Ming-Hsuan Yang and Kevin Murphy and Alexander G. Hauptmann and Lu Jiang},
url = {https://arxiv.org/abs/2306.17842
https://openreview.net/forum?id=CXPUg86A1D
https://proceedings.neurips.cc/paper_files/paper/2023/hash/a526cc8f6ffb74bedb6ff313e3fdb450-Abstract-Conference.html},
doi = {10.48550/arXiv.2306.17842},
year = {2023},
date = {2023-12-11},
urldate = {2023-12-11},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
abstract = {In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.},
howpublished = {Advances in Neural Information Processing Systems (NeurIPS) (arXiv:2306.17842v2)},
keywords = {arXiv, computational video, computer vision, generative AI, NeurIPS},
pubstate = {published},
tppubtype = {inproceedings}
}
Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang
MAGVIT: Masked Generative Video Transformer Proceedings Article
In: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023.
Abstract | Links | BibTeX | Tags: computational video, computer vision, CVPR, generative AI, generative media, google
@inproceedings{2023-Yu-MMGVT,
title = {MAGVIT: Masked Generative Video Transformer},
author = {Lijun Yu and Yong Cheng and Kihyuk Sohn and José Lezama and Han Zhang and Huiwen Chang and Alexander G. Hauptmann and Ming-Hsuan Yang and Yuan Hao and Irfan Essa and Lu Jiang},
url = {https://arxiv.org/abs/2212.05199
https://magvit.cs.cmu.edu/
https://openaccess.thecvf.com/content/CVPR2023/papers/Yu_MAGVIT_Masked_Generative_Video_Transformer_CVPR_2023_paper.pdf
https://openaccess.thecvf.com/content/CVPR2023/supplemental/Yu_MAGVIT_Masked_Generative_CVPR_2023_supplemental.pdf},
doi = {10.48550/ARXIV.2212.05199},
year = {2023},
date = {2023-06-01},
urldate = {2023-06-01},
booktitle = {IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)},
abstract = {We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at this https URL.},
keywords = {computational video, computer vision, CVPR, generative AI, generative media, google},
pubstate = {published},
tppubtype = {inproceedings}
}
Huda Alamri, Anthony Bilic, Michael Hu, Apoorva Beedu, Irfan Essa
End-to-end Multimodal Representation Learning for Video Dialog Proceedings Article
In: NeuRIPS Workshop on Vision Transformers: Theory and applications, 2022.
Abstract | Links | BibTeX | Tags: computational video, computer vision, vision transformers
@inproceedings{2022-Alamri-EMRLVD,
title = {End-to-end Multimodal Representation Learning for Video Dialog},
author = {Huda Alamri and Anthony Bilic and Michael Hu and Apoorva Beedu and Irfan Essa},
url = {https://arxiv.org/abs/2210.14512},
doi = {10.48550/arXiv.2210.14512},
year = {2022},
date = {2022-12-01},
urldate = {2022-12-01},
booktitle = {NeuRIPS Workshop on Vision Transformers: Theory and applications},
abstract = {Video-based dialog task is a challenging multimodal learning task that has received increasing attention over the past few years with state-of-the-art obtaining new performance records. This progress is largely powered by the adaptation of the more powerful transformer-based language encoders. Despite this progress, existing approaches do not effectively utilize visual features to help solve tasks. Recent studies show that state-of-the-art models are biased towards textual information rather than visual cues. In order to better leverage the available visual information, this study proposes a new framework that combines 3D-CNN network and transformer-based networks into a single visual encoder to extract more robust semantic representations from videos. The visual encoder is jointly trained end-to-end with other input modalities such as text and audio. Experiments on the AVSD task show significant improvement over baselines in both generative and retrieval tasks.},
keywords = {computational video, computer vision, vision transformers},
pubstate = {published},
tppubtype = {inproceedings}
}
Peggy Chi, Tao Dong, Christian Frueh, Brian Colonna, Vivek Kwatra, Irfan Essa
Synthesis-Assisted Video Prototyping From a Document Proceedings Article
In: Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pp. 1–10, 2022.
Abstract | Links | BibTeX | Tags: computational video, generative media, google, human-computer interaction, UIST, video editing
@inproceedings{2022-Chi-SVPFD,
title = {Synthesis-Assisted Video Prototyping From a Document},
author = {Peggy Chi and Tao Dong and Christian Frueh and Brian Colonna and Vivek Kwatra and Irfan Essa},
url = {https://research.google/pubs/pub51631/
https://dl.acm.org/doi/abs/10.1145/3526113.3545676},
doi = {10.1145/3526113.3545676},
year = {2022},
date = {2022-10-01},
urldate = {2022-10-01},
booktitle = {Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology},
pages = {1--10},
abstract = {Video productions commonly start with a script, especially for talking head videos that feature a speaker narrating to the camera. When the source materials come from a written document -- such as a web tutorial, it takes iterations to refine content from a text article to a spoken dialogue, while considering visual compositions in each scene. We propose Doc2Video, a video prototyping approach that converts a document to interactive scripting with a preview of synthetic talking head videos. Our pipeline decomposes a source document into a series of scenes, each automatically creating a synthesized video of a virtual instructor. Designed for a specific domain -- programming cookbooks, we apply visual elements from the source document, such as a keyword, a code snippet or a screenshot, in suitable layouts. Users edit narration sentences, break or combine sections, and modify visuals to prototype a video in our Editing UI. We evaluated our pipeline with public programming cookbooks. Feedback from professional creators shows that our method provided a reasonable starting point to engage them in interactive scripting for a narrated instructional video.},
keywords = {computational video, generative media, google, human-computer interaction, UIST, video editing},
pubstate = {published},
tppubtype = {inproceedings}
}
Nathan Frey, Peggy Chi, Weilong Yang, Irfan Essa
Automatic Style Transfer for Non-Linear Video Editing Proceedings Article
In: Proceedings of CVPR Workshop on AI for Content Creation (AICC), 2021.
Links | BibTeX | Tags: computational video, CVPR, google, video editing
@inproceedings{2021-Frey-ASTNVE,
title = {Automatic Style Transfer for Non-Linear Video Editing},
author = {Nathan Frey and Peggy Chi and Weilong Yang and Irfan Essa},
url = {https://arxiv.org/abs/2105.06988
https://research.google/pubs/pub50449/},
doi = {10.48550/arXiv.2105.06988},
year = {2021},
date = {2021-06-01},
urldate = {2021-06-01},
booktitle = {Proceedings of CVPR Workshop on AI for Content Creation (AICC)},
keywords = {computational video, CVPR, google, video editing},
pubstate = {published},
tppubtype = {inproceedings}
}
AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa
Unsupervised Discovery of Actions in Instructional Videos Proceedings Article
In: British Machine Vision Conference (BMVC), 2021.
Abstract | Links | BibTeX | Tags: activity recognition, computational video, computer vision, google
@inproceedings{2021-Piergiovanni-UDAIV,
title = {Unsupervised Discovery of Actions in Instructional Videos},
author = {AJ Piergiovanni and Anelia Angelova and Michael S. Ryoo and Irfan Essa},
url = {https://arxiv.org/abs/2106.14733
https://www.bmvc2021-virtualconference.com/assets/papers/0773.pdf},
doi = { https://doi.org/10.48550/arXiv.2106.14733},
year = {2021},
date = {2021-06-01},
urldate = {2021-06-01},
booktitle = {British Machine Vision Conference (BMVC)},
number = {arXiv:2106.14733},
abstract = {In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos. Instructional videos contain complex activities and are a rich source of information for intelligent agents, such as, autonomous robots or virtual assistants, which can, for example, automatically `read' the steps from an instructional video and execute them. However, videos are rarely annotated with atomic activities, their boundaries or duration. We present an unsupervised approach to learn atomic actions of structured human tasks from a variety of instructional videos. We propose a sequential stochastic autoregressive model for temporal segmentation of videos, which learns to represent and discover the sequential relationship between different atomic actions of the task, and which provides automatic and unsupervised self-labeling for videos. Our approach outperforms the state-of-the-art unsupervised methods with large margins. We will open source the code.
},
keywords = {activity recognition, computational video, computer vision, google},
pubstate = {published},
tppubtype = {inproceedings}
}
Anh Truong, Peggy Chi, David Salesin, Irfan Essa, Maneesh Agrawala
Automatic Generation of Two-Level Hierarchical Tutorials from Instructional Makeup Videos Proceedings Article
In: ACM CHI Conference on Human factors in Computing Systems, 2021.
Abstract | Links | BibTeX | Tags: CHI, computational video, google, human-computer interaction, video summarization
@inproceedings{2021-Truong-AGTHTFIMV,
title = {Automatic Generation of Two-Level Hierarchical Tutorials from Instructional Makeup Videos},
author = {Anh Truong and Peggy Chi and David Salesin and Irfan Essa and Maneesh Agrawala},
url = {https://dl.acm.org/doi/10.1145/3411764.3445721
https://research.google/pubs/pub50007/
http://anhtruong.org/makeup_breakdown/},
doi = {10.1145/3411764.3445721},
year = {2021},
date = {2021-05-01},
urldate = {2021-05-01},
booktitle = {ACM CHI Conference on Human factors in Computing Systems},
abstract = {We present a multi-modal approach for automatically generating hierarchical tutorials from instructional makeup videos. Our approach is inspired by prior research in cognitive psychology, which suggests that people mentally segment procedural tasks into event hierarchies, where coarse-grained events focus on objects while fine-grained events focus on actions. In the instructional makeup domain, we find that objects correspond to facial parts while fine-grained steps correspond to actions on those facial parts. Given an input instructional makeup video, we apply a set of heuristics that combine computer vision techniques with transcript text analysis to automatically identify the fine-level action steps and group these steps by facial part to form the coarse-level events. We provide a voice-enabled, mixed-media UI to visualize the resulting hierarchy and allow users to efficiently navigate the tutorial (e.g., skip ahead, return to previous steps) at their own pace. Users can navigate the hierarchy at both the facial-part and action-step levels using click-based interactions and voice commands. We demonstrate the effectiveness of segmentation algorithms and the resulting mixed-media UI on a variety of input makeup videos. A user study shows that users prefer following instructional makeup videos in our mixed-media format to the standard video UI and that they find our format much easier to navigate.},
keywords = {CHI, computational video, google, human-computer interaction, video summarization},
pubstate = {published},
tppubtype = {inproceedings}
}
Peggy Chi, Zheng Sun, Katrina Panovich, Irfan Essa
Automatic Video Creation From a Web Page Proceedings Article
In: Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, pp. 279–292, ACM CHI 2020.
Abstract | Links | BibTeX | Tags: computational video, google, human-computer interaction, UIST, video editing
@inproceedings{2020-Chi-AVCFP,
title = {Automatic Video Creation From a Web Page},
author = {Peggy Chi and Zheng Sun and Katrina Panovich and Irfan Essa},
url = {https://dl.acm.org/doi/abs/10.1145/3379337.3415814
https://research.google/pubs/pub49618/
https://ai.googleblog.com/2020/10/experimenting-with-automatic-video.html
https://www.youtube.com/watch?v=3yFYc-Wet8k},
doi = {10.1145/3379337.3415814},
year = {2020},
date = {2020-10-01},
urldate = {2020-10-01},
booktitle = {Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology},
pages = {279--292},
organization = {ACM CHI},
abstract = {Creating marketing videos from scratch can be challenging, especially when designing for multiple platforms with different viewing criteria. We present URL2Video, an automatic approach that converts a web page into a short video given temporal and visual constraints. URL2Video captures quality materials and design styles extracted from a web page, including fonts, colors, and layouts. Using constraint programming, URL2Video's design engine organizes the visual assets into a sequence of shots and renders to a video with user-specified aspect ratio and duration. Creators can review the video composition, modify constraints, and generate video variation through a user interface. We learned the design process from designers and compared our automatically generated results with their creation through interviews and an online survey. The evaluation shows that URL2Video effectively extracted design elements from a web page and supported designers by bootstrapping the video creation process.},
keywords = {computational video, google, human-computer interaction, UIST, video editing},
pubstate = {published},
tppubtype = {inproceedings}
}
Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh
Audio Visual Scene-Aware Dialog Proceedings Article
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
Abstract | Links | BibTeX | Tags: computational video, computer vision, CVPR, embodied agents, vision & language
@inproceedings{2019-Alamri-AVSD,
title = {Audio Visual Scene-Aware Dialog},
author = {Huda Alamri and Vincent Cartillier and Abhishek Das and Jue Wang and Anoop Cherian and Irfan Essa and Dhruv Batra and Tim K. Marks and Chiori Hori and Peter Anderson and Stefan Lee and Devi Parikh},
url = {https://openaccess.thecvf.com/content_CVPR_2019/papers/Alamri_Audio_Visual_Scene-Aware_Dialog_CVPR_2019_paper.pdf
https://video-dialog.com/
https://arxiv.org/abs/1901.09107},
doi = {10.1109/CVPR.2019.00774},
year = {2019},
date = {2019-06-01},
urldate = {2019-06-01},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
abstract = {We introduce the task of scene-aware dialog. Our goal is to generate a complete and natural response to a question about a scene, given video and audio of the scene and the history of previous turns in the dialog. To answer successfully, agents must ground concepts from the question in the video while leveraging contextual cues from the dialog history. To benchmark this task, we introduce the Audio Visual Scene-Aware Dialog (AVSD) Dataset. For each of more than 11,000 videos of human actions from the Charades dataset, our dataset contains a dialog about the video, plus a final summary of the video by one of the dialog participants. We train several baseline systems for this task and evaluate the performance of the trained models using both qualitative and quantitative metrics. Our results indicate that models must utilize all the available inputs (video, audio, question, and dialog history) to perform best on this dataset.
},
keywords = {computational video, computer vision, CVPR, embodied agents, vision & language},
pubstate = {published},
tppubtype = {inproceedings}
}
Vinay Bettadapura, Caroline Pantofaru, Irfan Essa
Leveraging Contextual Cues for Generating Basketball Highlights Proceedings Article
In: ACM International Conference on Multimedia (ACM-MM), ACM 2016.
Abstract | Links | BibTeX | Tags: ACM, ACMMM, activity recognition, computational video, computer vision, sports visualization, video summarization
@inproceedings{2016-Bettadapura-LCCGBH,
title = {Leveraging Contextual Cues for Generating Basketball Highlights},
author = {Vinay Bettadapura and Caroline Pantofaru and Irfan Essa},
url = {https://dl.acm.org/doi/10.1145/2964284.2964286
http://www.vbettadapura.com/highlights/basketball/index.htm},
doi = {10.1145/2964284.2964286},
year = {2016},
date = {2016-10-01},
urldate = {2016-10-01},
booktitle = {ACM International Conference on Multimedia (ACM-MM)},
organization = {ACM},
abstract = {The massive growth of sports videos has resulted in a need for automatic generation of sports highlights that are comparable in quality to the hand-edited highlights produced by broadcasters such as ESPN. Unlike previous works that mostly use audio-visual cues derived from the video, we propose an approach that additionally leverages contextual cues derived from the environment that the game is being played in. The contextual cues provide information about the excitement levels in the game, which can be ranked and selected to automatically produce high-quality basketball highlights. We introduce a new dataset of 25 NCAA games along with their play-by-play stats and the ground-truth excitement data for each basket. We explore the informativeness of five different cues derived from the video and from the environment through user studies. Our experiments show that for our study participants, the highlights produced by our system are comparable to the ones produced by ESPN for the same games.},
keywords = {ACM, ACMMM, activity recognition, computational video, computer vision, sports visualization, video summarization},
pubstate = {published},
tppubtype = {inproceedings}
}
Daniel Castro, Vinay Bettadapura, Irfan Essa
Discovering Picturesque Highlights from Egocentric Vacation Video Proceedings Article
In: IEEE Winter Conference on Applications of Computer Vision (WACV), 2016.
Abstract | Links | BibTeX | Tags: computational photography, computational video, computer vision, WACV
@inproceedings{2016-Castro-DPHFEVV,
title = {Discovering Picturesque Highlights from Egocentric Vacation Video},
author = {Daniel Castro and Vinay Bettadapura and Irfan Essa},
url = {https://ieeexplore.ieee.org/document/7477707
http://www.cc.gatech.edu/cpl/projects/egocentrichighlights/
https://youtu.be/lIONi21y-mk},
doi = {10.1109/WACV.2016.7477707},
year = {2016},
date = {2016-03-01},
urldate = {2016-03-01},
booktitle = {IEEE Winter Conference on Applications of Computer Vision (WACV)},
abstract = {We present an approach for identifying picturesque highlights from large amounts of egocentric video data. Given a set of egocentric videos captured over the course of a vacation, our method analyzes the videos and looks for images that have good picturesque and artistic properties. We introduce novel techniques to automatically determine aesthetic features such as composition, symmetry and color vibrancy in egocentric videos and rank the video frames based on their photographic qualities to generate highlights. Our approach also uses contextual information such as GPS, when available, to assess the relative importance of each geographic location where the vacation videos were shot. Furthermore, we specifically leverage the properties of egocentric videos to improve our highlight detection. We demonstrate results on a new egocentric vacation dataset which includes 26.5 hours of videos taken over a 14 day vacation that spans many famous tourist destinations and also provide results from a user-study to access our results.
},
keywords = {computational photography, computational video, computer vision, WACV},
pubstate = {published},
tppubtype = {inproceedings}
}
Steven Hickson, Stan Birchfield, Irfan Essa, Henrik Christensen
Efficient Hierarchical Graph-Based Segmentation of RGBD Videos Proceedings Article
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society 2014.
Links | BibTeX | Tags: computational video, computer vision, CVPR, video segmentation
@inproceedings{2014-Hickson-EHGSRV,
title = {Efficient Hierarchical Graph-Based Segmentation of RGBD Videos},
author = {Steven Hickson and Stan Birchfield and Irfan Essa and Henrik Christensen},
url = {http://www.cc.gatech.edu/cpl/projects/4dseg},
year = {2014},
date = {2014-06-01},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
organization = {IEEE Computer Society},
keywords = {computational video, computer vision, CVPR, video segmentation},
pubstate = {published},
tppubtype = {inproceedings}
}
Syed Hussain Raza, Matthias Grundmann, Irfan Essa
Geoemetric Context from Video Proceedings Article
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society 2013.
Links | BibTeX | Tags: computational video, computer vision, CVPR, video segmentation
@inproceedings{2013-Raza-GCFV,
title = {Geoemetric Context from Video},
author = {Syed Hussain Raza and Matthias Grundmann and Irfan Essa},
url = {http://www.cc.gatech.edu/cpl/projects/videogeometriccontext/},
doi = {10.1109/CVPR.2013.396},
year = {2013},
date = {2013-06-01},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
organization = {IEEE Computer Society},
keywords = {computational video, computer vision, CVPR, video segmentation},
pubstate = {published},
tppubtype = {inproceedings}
}
Vinay Bettadapura, Grant Schindler, Thomas Ploetz, Irfan Essa
Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition Proceedings Article
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society 2013.
Links | BibTeX | Tags: activity recognition, computational video, computer vision, CVPR
@inproceedings{2013-Bettadapura-ABDDTSIAR,
title = {Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition},
author = {Vinay Bettadapura and Grant Schindler and Thomas Ploetz and Irfan Essa},
url = {http://www.cc.gatech.edu/cpl/projects/abow/},
doi = {10.1109/CVPR.2013.338},
year = {2013},
date = {2013-06-01},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
organization = {IEEE Computer Society},
keywords = {activity recognition, computational video, computer vision, CVPR},
pubstate = {published},
tppubtype = {inproceedings}
}
Matthias Grundmann, Vivek Kwatra, Daniel Castro, Irfan Essa
Calibration-Free Rolling Shutter Removal Best Paper Proceedings Article
In: IEEE Conference on Computational Photography (ICCP), IEEE Computer Society, 2012.
Abstract | Links | BibTeX | Tags: awards, best paper award, computational photography, computational video, computer graphics, computer vision, ICCP
@inproceedings{2012-Grundmann-CRSR,
title = {Calibration-Free Rolling Shutter Removal},
author = {Matthias Grundmann and Vivek Kwatra and Daniel Castro and Irfan Essa},
url = {http://www.cc.gatech.edu/cpl/projects/rollingshutter/
https://research.google.com/pubs/archive/37744.pdf
https://youtu.be/_Pr_fpbAok8},
doi = {10.1109/ICCPhot.2012.6215213},
year = {2012},
date = {2012-01-01},
urldate = {2012-01-01},
booktitle = {IEEE Conference on Computational Photography (ICCP)},
publisher = {IEEE Computer Society},
abstract = {We present a novel algorithm for efficient removal of rolling shutter distortions in uncalibrated streaming videos. Our proposed method is calibration free as it does not need any knowledge of the camera used, nor does it require calibration using specially recorded calibration sequences. Our algorithm can perform rolling shutter removal under varying focal lengths, as in videos from CMOS cameras equipped with an optical zoom. We evaluate our approach across a broad range of cameras and video sequences demonstrating robustness, scaleability, and repeatability. We also conducted a user study, which demonstrates preference for the output of our algorithm over other state-of-the art methods. Our algorithm is computationally efficient, easy to parallelize, and robust to challenging artifacts introduced by various cameras with differing technologies.
},
keywords = {awards, best paper award, computational photography, computational video, computer graphics, computer vision, ICCP},
pubstate = {published},
tppubtype = {inproceedings}
}
M. Grundmann, V. Kwatra, I. Essa
Auto-Directed Video Stabilization with Robust L1 Optimal Camera Paths Proceedings Article
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society, 2011.
Links | BibTeX | Tags: computational video, computer vision, CVPR
@inproceedings{2011-Grundmann-AVSWROCP,
title = {Auto-Directed Video Stabilization with Robust L1 Optimal Camera Paths},
author = {M. Grundmann and V. Kwatra and I. Essa},
url = {http://www.cc.gatech.edu/cpl/projects/videostabilization/},
doi = {10.1109/CVPR.2011.5995525},
year = {2011},
date = {2011-06-01},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
publisher = {IEEE Computer Society},
keywords = {computational video, computer vision, CVPR},
pubstate = {published},
tppubtype = {inproceedings}
}
M. Grundmann, V. Kwatra, M. Han, I. Essa
Efficient Hierarchical Graph-Based Video Segmentation Proceedings Article
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.
Links | BibTeX | Tags: computational video, computer vision, CVPR, video segmentation
@inproceedings{2010-Grundmann-EHGVS,
title = {Efficient Hierarchical Graph-Based Video Segmentation},
author = {M. Grundmann and V. Kwatra and M. Han and I. Essa},
url = {http://www.cc.gatech.edu/cpl/projects/videosegmentation/},
doi = {10.1109/CVPR.2010.5539893},
year = {2010},
date = {2010-06-01},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
keywords = {computational video, computer vision, CVPR, video segmentation},
pubstate = {published},
tppubtype = {inproceedings}
}
M. Grundmann, V. Kwatra, M. Han, I. Essa
Discontinuous Seam-Carving for Video Retargeting Proceedings Article
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society, 2010.
Links | BibTeX | Tags: computational video, computer vision, CVPR
@inproceedings{2010-Grundmann-DSVR,
title = {Discontinuous Seam-Carving for Video Retargeting},
author = {M. Grundmann and V. Kwatra and M. Han and I. Essa},
url = {http://www.cc.gatech.edu/cpl/projects/videoretargeting/},
doi = {10.1109/CVPR.2010.5540165},
year = {2010},
date = {2010-06-01},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
publisher = {IEEE Computer Society},
keywords = {computational video, computer vision, CVPR},
pubstate = {published},
tppubtype = {inproceedings}
}
Other Publication Sites
A few more sites that aggregate research publications: Academic.edu, Bibsonomy, CiteULike, Mendeley.
Copyright/About
[Please see the Copyright Statement that may apply to the content listed here.]
This list of publications is produced by using the teachPress plugin for WordPress.