Efficient vision foundation models for high-resolution generation and perception.
State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!