ChatGPT is undergoing a significant expansion beyond its initial role as a text-based search tool. OpenAI has revealed today that they are incorporating new capabilities centered around voice and image recognition.
Since its introduction about months ago, this highly popular AI assistant has become a notable success story in technology. It enables users to generate essays, poems, and summaries based on simple text prompts. However, ChatGPT is now poised to become much more interactive, as users will soon be able to engage in voice conversations with the chatbot.
This announcement coincides with Amazon's commitment to invest up to $4 billion in OpenAI's competitor, Anthropic. This marks a significant development in the broader competition among global tech giants in the field of generative AI. Google is making efforts to catch up with its Bard chatbot, Meta is embracing an open-source approach to gain an advantage, and Microsoft is closely aligning itself with OpenAI
Today marks a notable evolution for the generative AI movement, with OpenAI meshing the familiar world of voice-based assistants with its powerful large language models (LLMs).
For instance, a user can verbally ask ChatGPT to make up a bedtime story on the spot, with a few vocal prompts to guide the narrative. Or they can ask it a question, with ChatGPT giving its response in spoken word form.
Elsewhere, ChatGPT users will also be able to search for answers using images, for instance uploading a picture of something and asking ChatGPT to explain what it is or to provide instructions for completing a goal.
The new features will begin rolling out to premium Plus and Enterprise subscribers in the coming two weeks. To activate voice features, users need to head to the “settings” menu in the app, then head to “new features” and opt-in to voice conversations. They then have to tap the headphone button in the top-right corner and select from five different voices.
The feature is enabled through a combination of a new text-to-speech model that can generate human-like voices from text and a few seconds of sampled speech. OpenAI said that it teamed up with established voice actors to create each of the five voices, with its open-source Whisper speech recognition system used to transcribe verbal utterances into text.
Spotify was also unveiled as a launch partner, with the music-streaming giant introducing a neat new feature for podcasters that allows them to sample their voice and translate their shows from English into Spanish, French, or German — while retaining their original voice. However, it seems that OpenAI is being careful not to attract criticism, as it’s not making this technology available to anyone — it has worked specifically with podcasters including Dax Shepard, Monica Padman, Lex Fridman, Bill Simmons, and Steven Bartlett for the launch.
“The new voice technology — capable of crafting realistic synthetic voices from just a few seconds of real speech — opens doors to many creative and accessibility-focused applications,” the company wrote in a blog post. “However, these capabilities also present new risks, such as the potential for malicious actors to impersonate public figures or commit fraud.”
Voice will be limited to the ChatGPT Android and iOS apps on an opt-in beta basis initially, while image search will be landing on all platforms by default.