Multi-Modal AI is a UX Problem

Transformers and other AI breakthroughs have shown state-of-the-art performance across different modalities

Text-to-Text (OpenAI ChatGPT)
Text-to-Image (Stable Diffusion)
Image-to-Text (Open AI CLIP)
Speech-to-Text (OpenAI Whisper)
Text-to-Speech (Meta’s Massively Multilingual Speech)
Image-to-Image (img2img or pix2pix)
Text-to-Audio (Meta MusicGen)
Text-to-Code (OpenAI Codex / GitHub Copilot)
Code-to-Text (ChatGPT, etc.)

The next frontier in AI is combining these modalities in interesting ways. Explain what’s happening in a photo. Debug a program with your voice. Generate music from an image. There’s still technical work to be done with combining these modalities, but the greatest challenge is not a technical one but a user experience one.

What is the right UX for these use cases?

Chat isn’t always the best interface for tasks — although it’s one of the most intuitive, especially when users are being introduced to new technology (why does every AI cycle start with chat?). Sticking images, audio, and other modalities in a chat interface can get confusing very quickly. It’s why technologies like Jupyter Notebooks (which combine markup, graphs, and code in the same interface) are so polarizing. Great for many exploratory tasks, but master of none.

There’s a huge opportunity in the UX layer for integrating these different modalities. How do we best present these different types of outputs to users — audio, text, images, or code? How do we allow users to iterate on these models and provide feedback (e.g., what does it mean to fine-tune a multimodal model)?