Visual-language models (VLMs) have recently become a key focus in the field of artificial intelligence research. Notably, CLIP 1 and Align 2 were trained on vast image-text pairs, enabling them to ...