video-question-answering · GitHub Topics

#大语言模型#[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.

captioning-videos ChatGPT gradio langchain video-question-answering video-understanding stablelm chat Video big-model foundation-models large-language-models

Python 3.3 k

8 个月前

OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding

foundation-models video-understanding vision-transformer action-recognition multimodal temporal-action-localization video-question-answering zero-shot-classification benchmark contrastive-learning self-supervised instruction-tuning video-clip

Python 2.05 k

1 个月前

jayleicn / ClipBERT

[CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks.

PyTorch video-question-answering vqa vision-and-language cvpr2021

Python 722

2 年前

Vision-CAIR / MiniGPT4-video

Official code for Goldfish model for long video understanding and MiniGPT4-video for short video understanding

video-question-answering video-understanding

Python 627

9 个月前

X-PLUG / Youku-mPLUG

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks

benchmark 中文 dataset mllm multimodal multimodal-large-language-models multimodal-pretraining Video video-question-answering youku

Python 301

2 年前

apple / ml-slowfast-llava

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

multimodal-large-language-models video-question-answering

Python 263

1 年前

X-PLUG / mPLUG-2

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video (ICML 2023)

foundation-models mllm multimodal multimodal-pretraining Video image-retrieval mplug video-question-answering vqa

Python 229

2 年前

Yui010206 / SeViLA

[NeurIPS 2023] Self-Chained Image-Language Model for Video Localization and Question Answering

mllm video-question-answering

Python 188

2 年前

salesforce / ALPRO

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

vision-and-language video-question-answering representation-learning prompt-learning

Python 188

4 个月前

doc-doc / NExT-QA

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR'21)

vision-language video-question-answering video-understanding

Python 169

1 个月前

antoyang / FrozenBiLM

[NeurIPS 2022] Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

multimodal-learning video-understanding vqa large-language-models pre-training video-question-answering vision-and-language visual-question-answering

Python 158

9 个月前

bytedance / Shot2Story

A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.

benchmark dataset large-language-models video-language-pretraining video-question-answering vision-language video-captioning research

Python 155

7 个月前

jpthu17 / EMCL

[NeurIPS 2022 Spotlight] Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

cross-modal-retrieval neurips video-captioning video-question-answering

Python 139

1 年前

tsujuifu / pytorch_violet

A PyTorch implementation of VIOLET

PyTorch vision-and-language pre-training video-question-answering

Python 138

2 年前

jayleicn / TVQAplus

[ACL 2020] PyTorch code for TVQA+: Spatio-Temporal Grounding for Video Question Answering

video-question-answering dataset PyTorch

Python 129

3 年前

antoyang / just-ask

[ICCV 2021 Oral + TPAMI] Just Ask: Learning to Answer Questions from Millions of Narrated Videos

vqa visual-question-answering video-question-answering video-understanding vision-and-language pre-training multimodal-learning

Jupyter Notebook 123

2 年前

jpthu17 / HBI

[CVPR 2023 Highlight & TPAMI] Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

cross-modal-retrieval cvpr video-question-answering

Python 122

9 个月前

doc-doc / NExT-GQA

Can I Trust Your Answer? Visually Grounded Video Question Answering (CVPR'24, Highlight)

video-question-answering

Python 79

1 年前

mlvlab / Flipped-VQA

Large Language Models are Temporal and Causal Reasoners for Video Question Answering (EMNLP 2023)

emnlp2023 large-language-models multi-modal video-question-answering visual-question-answering

Python 76

6 个月前

bcmi / Causal-VidQA

[CVPR 2022] A large-scale public benchmark dataset for video question-answering, especially about evidence and commonsense reasoning. The code used in our paper "From Representation to Reasoning: Towa...

commonsense-reasoning video-question-answering

Python 72

3 个月前