ChatGPT now also understands images and voice commands

The ChatGPT chatbot is constantly being improved by OpenAI. The new version allows users to activate ChatGPT with voice and images as well, bringing new questions and concerns. So what does the new version bring and when?

Most of the changes OpenAI is making to ChatGPT relate to what the AI-powered bot can do: what questions it can answer, what information it can access, and so on. This time, however, it’s also changing the way you can use ChatGPT yourself. The company is introducing a new version of the service that lets you interact with the AI bot not just by typing sentences into a text box, but also by talking to it or just uploading an image. The new features will be available to Plus subscribers in the coming weeks, with everyone else getting the new functionality “soon after.”

The voice command part isn't anything groundbreaking: you tap a button and say your question, ChatGPT converts it to text and feeds it to a large language model, gets the answer and converts it back to speech and answers you back vocally. It should be similar to talking to Alexa or Google Assistant, except that — OpenAI hopes — the answers will be better thanks to improved underlying technology. Most virtual assistants seem to be revamping to include large language models — and OpenAI is one step ahead of them all.

OpenAI’s excellent Whisper model does a lot of the speech-to-text conversion, and the company is also introducing a new text-to-speech model that is said to be able to create “human-like sound, just from text and a few seconds of sample speech.” You’ll be able to choose a voice for ChatGPT from five options, but OpenAI seems to think the model has much more potential. For example, OpenAI is working with Spotify to translate podcasts into other languages while preserving the sound of the person hosting the podcast. There are many interesting uses for synthetic voices, and OpenAI could be a big part of that industry.

Regardless, the fact that you can create a decent synthetic voice with just a few seconds of audio recording opens the door to all sorts of potentially problematic use cases. “These capabilities introduce new threats, such as the possibility of malicious actors impersonating public figures and the like,” the company wrote in a blog post announcing the new features. That’s why the model isn’t available for general use and will be much more tightly controlled and limited to specific use cases and partnerships.

The image search feature is somewhat similar to Google Lens. You snap a photo and ChatGPT will try to understand what you're asking and respond accordingly. You can also use the drawing tool in the app to make the question as clear as possible, or speak or type questions related to the picture. This is where the nature of ChatGPT comes in particularly handy: instead of running a search, getting the wrong answer, and then running a new search, you can nudge the bot and improve the answer during the process. This is very similar to what Google is doing with multimodal search.

Obviously, including images in ChatGPT also has its drawbacks. One of them is when you use ChatGPT “on a person”: OpenAI says it has deliberately limited “ChatGPT’s ability to analyze and make direct statements about people.” Both for the sake of accuracy and privacy. This means that one of the most sci-fi visions of artificial intelligence — the ability to look at someone and tell who they are — is not going to be realized anytime soon. Which is probably a good thing.

Almost a year after ChatGPT's heyday, it seems that OpenAI is still trying to figure out how to give its model more features and capabilities without creating new problems and downsides. With new releases, the company has tried to walk that fine line by consciously limiting what its new models can do. But the fact is that this approach will not always work. As more and more people use voice control and image search, and as ChatGPT moves closer to becoming a truly multi-modal, useful virtual assistant, it will become increasingly difficult to maintain all of these safeguards.