Vision-Language Models (VLMs)
- Qwen-VL-Chat-1.1: 🤗, 🗄️
- TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
- LLaVA-NeXT (LLaVA-1.6)
- MoE-LLaVA: Mixture of Experts for Large Vision-Language Models 🤗
- DeepSeek-VL-7B-Chat: 🤗
- DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding, sum
- UReader: Universal OCR-free Visually-situated Language Understanding
- MME: A Comprehensive Evaluation Benchmark for Multimodal LLM
- VILA-2.7b: Vision Model from NVIDIA and MIT, demo, 🤗