airaria/Visual-Chinese-LLaMA-Alpaca
A multimodal Chinese LLaMA model extended with visual encoding to process and understand image inputs alongside text.

VisualCLA extends the Chinese LLaMA/Alpaca foundation model with image encoding modules, enabling it to process visual information. It uses Chinese image-text pairs for multimodal pretraining to align visual and textual representations, followed by instruction tuning on multimodal datasets to improve instruction following and conversational abilities. The project provides inference code and deployment scripts via Gradio and Text-Generation-WebUI.