AlaaLab/InstructCV
InstructCV fine-tunes Stable Diffusion to handle computer vision tasks like segmentation, detection, and classification by treating them as text-to-image generation problems.

InstructCV is an instruction-tuned text-to-image diffusion model for computer vision. It adapts Stable Diffusion by casting diverse vision tasks as generation problems where input images and instruction text are encoded and output images represent task results. The model is trained on multiple vision datasets covering segmentation, object detection, depth estimation, and classification, with an LLM used to paraphrase task instructions into diverse prompt templates.