[2025.03.03] - 🔥🔥🔥We have open-sourced AnyText2, which is faster, performs better, and allows you to set properties such as font and color for the text! See ...
Abstract: Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to ...
Abstract: This article presents ORB-SLAM3, the first system able to perform visual, visual-inertial and multimap SLAM with monocular, stereo and RGB-D cameras, using pin-hole and fisheye lens models.
Alibaba has released Qwen3.5-Omni, an omnimodal AI model capable of processing text, images, audio, and video, available in three different variants. The model reportedly outperforms Google's Gemini 3 ...