UC Berkeley Researchers Introduce the Touch-Vision-Language (TVL) Dataset for Multimodal Alignment

Almost all forms of biological perception are multimodal by design, allowing agents to integrate and synthesize data from several sources. Linking modalities, including vision, language, audio, temperature, and robot behaviors, have been the focus of recent research in artificial multimodal representation learning. Nevertheless, the tactile modality is still mostly unexplored when it comes to multimodal comprehension. Our sense of touch allows us to identify various surface textures, materials, dimensions, and forces of contact.

Additionally, numerous studies have investigated visual-tactile associations, developed cross-modal generators, and used cross-modal information for surface roughness, fabric classification, and material properties on a limited set of vocabulary.

Tactile perception in humans, however, exhibits profound integration with language and catches a wide variety of semantic information, not limited to tactile-visual correlations. The lack of different data is a big hurdle for touch and linguistic integration. There isn’t a tactile dataset that includes open vocabulary language labels that we’re aware of, even though there have been efforts to gather datasets of paired tactile and visual observations and datasets that humans have labeled for texture or material classification based on touch.

So, to gather synchronized touch-vision data “in the wild,” away from a controlled lab environment, researchers build a bespoke handheld device. Using this arrangement, they can take tactile readings and close-up visual observations when they press and slide on different foreground surfaces and objects against various backgrounds.

Language descriptions of tactile experiences are subjective and differ between individuals, adding another obstacle to the already expensive human labeling process. To tackle these issues, previous research on training VLMs and large language models (LLMs) shows vision language understanding by training on data synthesized by themselves or by existing LLMs.Researchers believe that commercially available LLM (GPT-4V) can function as a good captioner to compensate for the absence of labeled tactile-language data by producing tactile descriptions based on visual observations.

Researchers from UC Berkeley, Meta AI, and TU Dresden introduced the Touch-Vision-Language (TVL) dataset, an innovative dataset composed of 44,000 paired vision tactile observations. Humans comment on 10% of the data, while GPT-4V labels the remaining data. Using this dataset, the researchers train a tactile encoder by pairwise contrastive learning among all three modalities rather than coupling all modalities to vision. They train a tactile encoder compatible with visual and textual modalities by utilizing existing OpenCLIP vision and language encoders. Using the encoder’s touch-vision and touch-language categorization capabilities, they assess alignment. LLaMA2 7B is then fine-tuned to provide textual descriptions of tactile images using visual and tactile observations, leveraging the dataset and the trained tactile encoder.

The proposed Touch-Vision-Language Benchmark asks multimodal models to produce tactile descriptions. Then, it uses an LLM to determine how well those descriptions match up with human comments made on the ground. Statistically speaking, the proposed touch-vision-language model outperforms both open-source VLMs (+32% improvement) and GPT-4V (+12% improvement), the label-generating model, on the TVL Benchmark, despite training on a relatively modest amount of human-labeled data.

The team believes that researchers interested in pseudo-label-based learning methods may find this work helpful, and it could also be useful for big generative models that take touch into account in the future. Additionally, the presented methodology will help improve touch digitization and robotic touch applications.

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.