Microsoft Researchers Introduce StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis

Natural Language Processing (NLP) is one area where Large transformer-based Language Models (LLMs) have achieved remarkable progress in recent years. Also, LLMs are branching out into other fields, like robotics, audio, and medicine.

Modern approaches allow LLMs to produce visual data using specialized modules like VQ-VAE and VQ-GAN, which convert continuous visual pixels into discrete grid tokens. The LLM then processes these altered grid tokens similarly to how textual word processing works, which helps with the generative modeling process of LLMs. On the other hand, LLMs aren’t as good as diffusion models.

By applying an alternate image format and vector graphics, a new study by Soochow University, Microsoft Research Asia, and Microsoft Azure AI presents a fresh method that essentially preserves the semantic concepts of images. Vector graphics readily capture the semantic concepts of the image, unlike pixel-based formats, which conceal the creation of objects. In their suggested “stroke” token system, for instance, the dolphin is divided into a series of linked strokes containing full semantic information in each stroke unit.

The team highlights that they are not arguing for the inherent superiority of vector graphics over raster images; rather, we are presenting a new way of looking at visual representation. The “stroke” token idea has several benefits, such as:

Each stroke token has visual semantics built-in, making semantic segmentation of image content more intuitive.
Vector graphics are inherently compatible with LLMs because their creation process is sequential and interconnected, similar to how LLMs process information. Put another way, LLMs can digest the strokes more naturally since each one is formed about the ones that came before and after it.
Highly compressing vector graphics strokes can drastically reduce data size without sacrificing quality or semantic integrity. This makes it possible for each stroke token to encompass a rich, compressed representation of the visual information.

Based on the analysis above, they present StrokeNUWA, a model that generates vector graphics independently of the visual module. An Encoder-Decoder model plus a VQ-Stroke module makeup StrokeNUWA. The VQ-Stroke may condense serialized vector graphic data into several SVG tokens; it is based on the design of the residual quantizer model. The Encoder-Decoder model mostly uses a pre-trained LLM to generate SVG tokens in response to textual instructions.

The researchers evaluate StrokeNUWA with optimization-based approaches for the text-guided SVG production job. By improving CLIPScore measures, the proposed method demonstrates that stroke tokens can produce visually semantically richer material. Stroke tokens can be successfully integrated with LLMs since their solution outperforms LLM-based baselines on all criteria. Lastly, the approach achieves speed improvements of up to 94 times, demonstrating great efficiency in generation, thanks to the compression capabilities inherent in vector graphics.

This study highlights the immense possibilities of using stroke tokens for vector graphic creation. The team’s long-term goal is to refine stroke token quality further using LLM-specific advanced visual tokenization techniques. They also plan to expand stroke tokens to further domains (3D), tasks (SVG Understanding), and creating SVGs from real-world photos.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.