zai-org/VisualGLM-6B
Open-source multimodal conversational language model supporting image, Chinese, and English dialogue with 7.8 billion parameters.

VisualGLM-6B is a multimodal dialog language model combining text and image understanding. The language component derives from ChatGLM-6B with 6.2 billion parameters, while visual information is processed through a BLIP2-Qformer bridge trained to align visual representations with the language model, bringing the total model to 7.8 billion parameters. It supports bilingual Chinese-English conversation with image inputs, enabling users to discuss visual content in natural language. The project is built on the SwissArmyTransformer library for efficient model training and deployment.