ChatGPT Goes Multimodal with Voice and Image Capabilities

Leading startup OpenAI recently announced they’re rolling out new capabilities for their popular AI chatbot, ChatGPT, to allow it to “see, hear, and speak.”

Adding voice and image recognition, thus turning ChatGPT into a full-on multimodal AI tool, they intend to let users interact with and get more value out of the app.

This announcement follows closely on the heels of the recent release of the third version of Dall-E, the firm’s flagship text-to-image generator, which also added language capabilities.

Improved ChatGPT with Audio and Image Features

The new version of ChatGPT announced recently will incorporate voice and image recognition to expand and enhance the user experience, making it more interactive and accessible.

Now a sophisticated multimodal LLM (Large Language Model), ChatGPT will let you hold real-time, back-and-forth conversations. You can ask it to answer questions, generate content, or give you ideas, just like you did before in writing, using your voice, and it will respond in a synthesized, human-like voice as well.

Image recognition is powered by GPT 3.5 and the March-released GPT4, so ChatGPT will now have features similar to those models, where you can snap a picture for the bot to “see” and then ask for information, suggestions, and other data based on the image.

They intend for this upgrade to make ChatGPT a tool to increase accessibility and offer everyone useful real-time assistance.

It’s important to note that, according to the firm, this iteration of ChatGPT excels in English but struggles to produce the same results in other languages, particularly those with non-Roman script.

How the New ChatGPT Works

Per the announcement, the new voice and image capabilities will be enabled for the mobile app on both iOS and Android, and image recognition will also be available on the web version.

The voice feature is an opt-in; you need to go to your app settings and click on new features, then on “Voice conversations” and enable the function. Then, you’ll be able to choose between five different voice options for your app to speak to you. Once activated, you’ll be able to have voice chats back and forth and on the go with your AI assistant.

To use the image feature, all you have to do is tap on the photo button (hit the “+” button first on mobile) and either capture or select one or more photos from your storage. There is also a drawing tool to sketch an image. Once pictures are selected, you can start a convo about them.

Safety and Ethics Policies Surrounding the New Features

As part of their announcement, OpenAI disclosed how they are working to prevent this new functionality from being used in problematic or malicious ways.

For one, they are limiting the voice synthesis ability to voice chat only to prevent users from impersonating public figures or committing fraud. The synthetic voices for the voice chat were created via collaboration with different voice actors and using models that can synthesize human-like speech after just a few seconds of real human voice samples and recognize and transcribe spoken words into text.

The company has also used red teamers and alpha testers to analyze possible harmful uses of the software and, among other measures, has restricted ChatGPT’s ability to analyze and make direct statements about people depicted in pictures.

Finally, they’re transparent about the many possible limitations of the tool and discourage its use in high-risk scenarios.

When Will the New ChatGPT Be Available

Like with other products, OpenAI is releasing this new version of the AI chatbot progressively. They’ve stated that paid users (of ChatGPT Plus and Enterprise services) will be getting access to the voice and image features in the next two weeks, with the rest of the users receiving the update soon after, although they didn’t define how soon that’ll be.

Are you ready to try the new and improved multimodal ChatGPT? We sure are!

THE AUTHOR

Ivanna Attie

All About Ivanna

I am an experienced author with expertise in digital communication, stock media, design, and creative tools. I have closely followed and reported on AI developments in this field since its early days. I have gained valuable industry insight through my work with leading digital media professionals since 2014.

NETWORKING

ChatGPT Goes Multimodal with Voice and Image Capabilities

Improved ChatGPT with Audio and Image Features

How the New ChatGPT Works

Safety and Ethics Policies Surrounding the New Features

When Will the New ChatGPT Be Available

AI Insights from Experts