Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!
#自然语言处理#An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)
Official repository for VisionZip (CVPR 2025)
#大语言模型#[CVPR'24] HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
#大语言模型#Scala client for OpenAI API and other major LLM providers
[CVPR 2024] Official implementation of "ViTamin: Designing Scalable Vision Models in the Vision-language Era"
[NeurIPS 2024] AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation
This repository collects research papers of large Foundation Models for Scenario Generation and Analysis in Autonomous Driving. The repository will be continuously updated to track the latest update.
[CVPR2025] SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories
[NeurIPS'24] Official PyTorch Implementation of Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment
[ACL 2025 🔥] A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding
#计算机科学#Hub for researchers exploring VLMs and Multimodal Learning:)
Benchmarking Vision-Language Models on OCR tasks in Dynamic Video Environments
[ICASSP 2024] The official repo for Harnessing the Power of Large Vision Language Models for Synthetic Image Detection
[COLM 2025] JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model
Convert documents, images to high-quality Markdown using Vision LLMs. Built for RAG ingestion pipelines.
[NAACL 2025] Guiding Large Language Models in Code Execution with Fine-grained Multimodal Chain-of-Thought Reasoning
We introduce VLM-Mamba, the first Vision-Language Model built entirely on State Space Models (SSMs), specifically leveraging the Mamba architecture.