Revolutionizing Sketching: MIT and Stanford's AI Tool Mimics Human Creativity

When trying to communicate or understand ideas, sometimes words fall short. A simple sketch can often provide clarity, like diagramming a circuit to understand a system’s workings. But what if artificial intelligence could enhance our exploration of visualizations? While AI systems typically excel at producing realistic images and cartoons, many models struggle to emulate the essence of sketching: the iterative, stroke-by-stroke process that aids human brainstorming and editing of ideas.

A new drawing system from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Stanford University, named “SketchAgent,” aims to address this gap. Using a multimodal language model, SketchAgent transforms natural language prompts into sketches within seconds. It can independently doodle a house or collaborate with a human, integrating text-based input to sketch each component separately.

Breakthrough in AI Sketching
Researchers demonstrated SketchAgent’s ability to create abstract drawings of various concepts, ranging from a robot to the Sydney Opera House. The potential applications for this tool are vast, possibly evolving into an interactive art game aiding educators and researchers in diagramming complex ideas or offering users quick drawing lessons.

Yael Vinker, a CSAIL postdoc and lead author of the paper introducing SketchAgent, emphasizes a more natural human-AI communication method. “Not everyone is aware of how much they draw in their daily life,” Vinker notes. “Our tool aims to emulate that process, making multimodal language models more useful in visually expressing ideas.”

Teaching AI to Draw
SketchAgent teaches models to draw stroke-by-stroke without relying on pre-existing data. Researchers developed a “sketching language” where a sketch is translated into a numbered sequence of strokes on a grid. The system learns by example, such as drawing a house with each stroke labeled by function, like a rectangle representing a “front door,” allowing the model to generalize new concepts.

The paper, co-authored with CSAIL affiliates and Stanford University researchers, will be presented at the 2025 Conference on Computer Vision and Pattern Recognition (CVPR).

Evaluating AI’s Sketching Abilities
Text-to-image models like DALL-E 3 can generate compelling drawings but lack the spontaneous, creative sketching process where each stroke influences the overall design. In contrast, SketchAgent’s drawings, modeled as a sequence of strokes, appear more natural and fluid, akin to human sketches.

Previous models mimicked this process by training on human-drawn datasets, often limited in scale and diversity. SketchAgent diverges by using pre-trained language models, knowledgeable about many concepts but unskilled in sketching. Teaching language models this process allowed SketchAgent to sketch diverse, untrained concepts.

The team explored SketchAgent’s collaborative potential, testing if it actively worked with humans on sketches or operated independently. In collaboration mode, a human and a language model jointly draw a concept. Removing SketchAgent’s contributions revealed its strokes were essential to the final image, as seen in a sailboat drawing where removing AI strokes made the sketch unrecognizable.

Experimenting with Multimodal Models
CSAIL and Stanford researchers experimented by integrating different multimodal language models into SketchAgent to identify which produced the most recognizable sketches. Their default model, Claude 3.5 Sonnet, generated the most human-like vector graphics, outperforming models like GPT-4o and Claude 3 Opus.

“Claude 3.5 Sonnet’s performance suggests it processes and generates visual-related information differently,” says co-author Tamar Rott Shaham. Shaham envisions SketchAgent as a valuable interface for AI model collaboration beyond text-based interaction. “Advanced models understanding and generating sketches open new ways for users to express ideas and receive intuitive, human-like responses,” she adds.

Overcoming Sketching Challenges
Despite its potential, SketchAgent struggles with professional-level sketches, rendering simple concepts using stick figures and doodles. It often requires multiple prompts to generate human-like doodles and may misinterpret user intentions, like drawing a bunny with two heads.

The model’s misunderstanding may stem from breaking tasks into smaller steps (“Chain of Thought” reasoning). When collaborating, the model might misinterpret which part of the outline a human is contributing to. Researchers aim to refine these skills by training on synthetic data from diffusion models and improving interaction ease with multimodal language models.

Although still developing, SketchAgent suggests AI can draw diverse concepts as humans do, facilitating step-by-step human-AI collaboration for more aligned final designs.

Note: This article is inspired by content from . It has been rephrased for originality. Images are credited to the original source.

Subscribe to our Newsletter