Vision-Language Models (VLMs)

Qwen-VL-Chat-1.1: 🤗, 🗄️
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
LLaVA-NeXT (LLaVA-1.6)
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models 🤗
DeepSeek-VL-7B-Chat: 🤗
DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding, sum
UReader: Universal OCR-free Visually-situated Language Understanding
MME: A Comprehensive Evaluation Benchmark for Multimodal LLM
VILA-2.7b: Vision Model from NVIDIA and MIT, demo, 🤗