A searchable list of some of my publications is below. You can also access my publications from the following sites.
My ORCID is
https://orcid.org/0000-0002-6236-2969
Publications:
1.
Huda Alamri, Anthony Bilic, Michael Hu, Apoorva Beedu, Irfan Essa
End-to-end Multimodal Representation Learning for Video Dialog Proceedings Article
In: NeuRIPS Workshop on Vision Transformers: Theory and applications, 2022.
@inproceedings{2022-Alamri-EMRLVD,
title = {End-to-end Multimodal Representation Learning for Video Dialog},
author = {Huda Alamri and Anthony Bilic and Michael Hu and Apoorva Beedu and Irfan Essa},
url = {https://arxiv.org/abs/2210.14512},
doi = {10.48550/arXiv.2210.14512},
year = {2022},
date = {2022-12-01},
urldate = {2022-12-01},
booktitle = {NeuRIPS Workshop on Vision Transformers: Theory and applications},
abstract = {Video-based dialog task is a challenging multimodal learning task that has received increasing attention over the past few years with state-of-the-art obtaining new performance records. This progress is largely powered by the adaptation of the more powerful transformer-based language encoders. Despite this progress, existing approaches do not effectively utilize visual features to help solve tasks. Recent studies show that state-of-the-art models are biased towards textual information rather than visual cues. In order to better leverage the available visual information, this study proposes a new framework that combines 3D-CNN network and transformer-based networks into a single visual encoder to extract more robust semantic representations from videos. The visual encoder is jointly trained end-to-end with other input modalities such as text and audio. Experiments on the AVSD task show significant improvement over baselines in both generative and retrieval tasks.},
keywords = {computational video, computer vision, vision transformers},
pubstate = {published},
tppubtype = {inproceedings}
}
Video-based dialog task is a challenging multimodal learning task that has received increasing attention over the past few years with state-of-the-art obtaining new performance records. This progress is largely powered by the adaptation of the more powerful transformer-based language encoders. Despite this progress, existing approaches do not effectively utilize visual features to help solve tasks. Recent studies show that state-of-the-art models are biased towards textual information rather than visual cues. In order to better leverage the available visual information, this study proposes a new framework that combines 3D-CNN network and transformer-based networks into a single visual encoder to extract more robust semantic representations from videos. The visual encoder is jointly trained end-to-end with other input modalities such as text and audio. Experiments on the AVSD task show significant improvement over baselines in both generative and retrieval tasks.
2.
Apoorva Beedu, Huda Alamri, Irfan Essa
Video based Object 6D Pose Estimation using Transformers Proceedings Article
In: NeuRIPS Workshop on Vision Transformers: Theory and applications, 2022.
@inproceedings{2022-Beedu-VBOPEUT,
title = {Video based Object 6D Pose Estimation using Transformers},
author = {Apoorva Beedu and Huda Alamri and Irfan Essa},
url = {https://arxiv.org/abs/2210.13540},
doi = {10.48550/arXiv.2210.13540},
year = {2022},
date = {2022-12-01},
urldate = {2022-12-01},
booktitle = {NeuRIPS Workshop on Vision Transformers: Theory and applications},
abstract = {We introduce a Transformer based 6D Object Pose Estimation framework VideoPose, comprising an end-to-end attention based modelling architecture, that attends to previous frames in order to estimate accurate 6D Object Poses in videos. Our approach leverages the temporal information from a video sequence for pose refinement, along with being computationally efficient and robust. Compared to existing methods, our architecture is able to capture and reason from long-range dependencies efficiently, thus iteratively refining over video sequences.Experimental evaluation on the YCB-Video dataset shows that our approach is on par with the state-of-the-art Transformer methods, and performs significantly better relative to CNN based approaches. Further, with a speed of 33 fps, it is also more efficient and therefore applicable to a variety of applications that require real-time object pose estimation. Training code and pretrained models are available at https://anonymous.4open.science/r/VideoPose-3C8C.},
keywords = {computer vision, vision transformers},
pubstate = {published},
tppubtype = {inproceedings}
}
We introduce a Transformer based 6D Object Pose Estimation framework VideoPose, comprising an end-to-end attention based modelling architecture, that attends to previous frames in order to estimate accurate 6D Object Poses in videos. Our approach leverages the temporal information from a video sequence for pose refinement, along with being computationally efficient and robust. Compared to existing methods, our architecture is able to capture and reason from long-range dependencies efficiently, thus iteratively refining over video sequences.Experimental evaluation on the YCB-Video dataset shows that our approach is on par with the state-of-the-art Transformer methods, and performs significantly better relative to CNN based approaches. Further, with a speed of 33 fps, it is also more efficient and therefore applicable to a variety of applications that require real-time object pose estimation. Training code and pretrained models are available at https://anonymous.4open.science/r/VideoPose-3C8C.
Other Publication Sites
Copyright/About
[Please see the Copyright Statement that may apply to the content listed here.]
This list of publications is produced by using the teachPress plugin for WordPress.