multimodal-large-language-models

BradyFU / Awesome-Multimodal-Large-Language-Models

✨✨Latest Advances on Multimodal Large Language Models

instruction-tuning instruction-following large-vision-language-model visual-instruction-tuning multi-modality in-context-learning large-language-models large-vision-language-models multimodal-chain-of-thought multimodal-in-context-learning multimodal-large-language-models chain-of-thought

16.25 k

10 天前

X-PLUG / MobileAgent

#安卓# Mobile-Agent: The Powerful GUI Agent Family

agent mllm mobile-agents multimodal multimodal-large-language-models multimodal-agent Android App GUI 移动自动化 copilot

Python 5.61 k

10 小时前

joanrod / star-vector

#大语言模型#StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and textu...

大语言模型 multimodal-large-language-models SVG vlm

Python 4.03 k

5 个月前

modelscope / ms-agent

#大语言模型#MS-Agent: Lightweight Framework for Empowering Agents with Autonomous Exploration in Complex Task Scenarios

agent gpts 大语言模型 qwen open-gpts multi-agents assistantapi 聊天机器人 multimodal-large-language-models rag Code 数据科学 deep-research

Python 3.42 k

5 天前

ictnlp / LLaMA-Omni

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.

large-language-models multimodal-large-language-models speech-to-text

Python 3.07 k

4 个月前

VITA-MLLM / VITA

✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

large-multimodal-models multimodal-large-language-models

Python 2.4 k

6 个月前

X-PLUG / mPLUG-DocOwl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

chart-understanding document-understanding mllm multimodal multimodal-large-language-models table-understanding

Python 2.25 k

3 个月前

cambrian-mllm / cambrian

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.

聊天机器人 clip 机器视觉 dino instruction-tuning large-language-models 大语言模型 mllm multimodal-large-language-models representation-learning

Python 1.95 k

10 个月前

YangLing0818 / RPG-DiffusionMaster

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)

large-language-models multimodal-large-language-models image-editting text-to-image

Jupyter Notebook 1.82 k

7 个月前

sherlockchou86 / VideoPipe

#人脸识别#A cross-platform video structuring (video analysis) framework. If you find it helpful, please give it a star: ) 跨平台的视频结构化（视频分析）框架，觉得有帮助的请给个星星 : )

C++ 1.82 k

9 天前

ByteDance-Seed / Seed1.5-VL

Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.

cookbook 大语言模型 multimodal-large-language-models vision-language-model

Jupyter Notebook 1.43 k

3 个月前

AIDC-AI / Ovis

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

聊天机器人 llama3 multimodal multimodal-large-language-models multimodality qwen vision-language-model

Python 1.34 k

6 天前

Henry-23 / VideoChat

实时语音交互数字人，支持端到端语音方案（GLM-4-Voice - THG）和级联方案（ASR-LLM-TTS-THG）。可自定义形象与音色，无须训练，支持音色克隆，首包延迟低至3s。Real-time voice interactive digital human, supporting end-to-end voice solutions (GLM-4-Voice - THG) and cas...

dialogue-systems real-time digital-human lip-sync musetalk streaming talking-head asr tts end-to-end multimodal-large-language-models

Python 1.08 k

6 个月前