Researchers Shanghai AI Lab and SenseTime Propose MM-Grounding-DINO: An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

Object detection plays a vital role in multi-modal understanding systems, where images are input into models to generate proposals aligned with text. This process is crucial for state-of-the-art models handling Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). OVD models are trained on base categories in zero-shot scenarios but must predict both base and novel categories within a broad vocabulary. PG provides a phrase to describe candidate categories and output corresponding boxes, while REC accurately identifies a target from text and outlines its position using a bounding box. Grounding-DINO addresses OVD, PG, and REC, gaining widespread adoption for diverse applications.

Researchers from Shanghai AI Lab and SenseTime Research have developed MM-Grounding-DINO, a user-friendly and open-source pipeline created using the MMDetection toolbox. It utilizes diverse vision datasets for pre-training and a range of detection and grounding datasets for fine-tuning. A comprehensive analysis of reported results and detailed settings for reproducibility are provided. Through extensive experiments on benchmarks, MM-Grounding-DINO-Tiny surpasses the performance of the Grounding-DINO-Tiny baseline.

MM-Grounding-DINO builds upon the foundation of Grounding-DINO. It operates by aligning textual descriptions with corresponding generated bounding boxes in images with varied shapes. The main components of the MM-Grounding-DINO include a text backbone responsible for extracting features from text, an image backbone for extracting features from images, a feature enhancer for thorough fusion of image and text features, a language-guided query selection module for initializing queries, and a cross-modality decoder for refining bounding boxes.

When presented with an image-text pair, MM-Grounding-DINO employs an image backbone to extract features from the image at various scales. Simultaneously, a text backbone extracts features from the accompanying text. These extracted features are input into a feature enhancer module, facilitating cross-modality fusion. Within this module, text and image features undergo fusion through a Bi-Attention Block, encompassing text-to-image and image-to-text cross-attention layers. Subsequently, the fused features undergo further enhancement through vanilla self-attention and deformable self-attention layers, followed by a Feedforward Network (FFN) layer.

The study presents an open, comprehensive pipeline for unified object grounding and detection covering OVD, PG, and REC tasks. The model’s performance is evaluated through a visualization-based analysis, which reveals inaccuracies in the ground-truth annotations of the evaluation dataset. The MM-Grounding-DINO model achieves state-of-the-art performance in zero-shot settings on COCO, with a mean average precision (mAP) of 52.5. The MM-Grounding-DINO model also outperforms fine-tuned models in various domains, including marine objects, brain tumor detection, urban street scenes, and people in paintings, setting new benchmarks for mAP.

In conclusion, The study introduces a comprehensive and open pipeline for unified object grounding and detection, addressing tasks like OVD, PG, and REC. The model exhibits notable improvements in mAP across various datasets, such as COCO and LVIS, through fine-tuning. The model’s predictions’ precision surpasses existing annotations for specific objects. The authors propose an extensive evaluation framework facilitating systematic assessment across diverse datasets, including COCO, LVIS, RefCOCOg, Flickr30k Entities, ODinW1335, and Description Detection Dataset (D3).

Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.