Learn how to transribe user audio.(STT, Speech-to-text)
Speech-to-text (STT) is a technology that converts audio input into text. It allows users to interact with AI Agents using their voice.
There are two ways to use the speech-to-text feature:
As a recommended approach, it connects to the user’s audio track and streams the transcription result in real time.
You can configure STT options in the config when calling join(). You may explicitly specify a language or enable automatic language detection. Specifying a language improves transcription accuracy. (link) If no language is specified, STT runs in the same language as the AI Avatar.
With this configuration, STT will automatically start when the joined
event is triggered.
Unmutes the user’s audio. If the track is already published, it resumes sending audio data. If it is not published, it publishes the track and starts transmitting data.
Both methods stop STT.
The difference is that unpublishUserAudio() disconnects the audio track entirely, while muteUserAudio()
keeps the track published but stops sending audio data.
To resume STT, use unmuteUserAudio()
.
If you stopped STT using muteUserAudio()
, it can resume more quickly, so this is the recommended approach.
Transcription results are delivered via events.
By default, transcription uses the same language as the agent. If you’d like to detect the spoken language automatically, you can configure that when joining the room.(link)
Converts a recorded audio blob into text.
To learn more and see full feature, see the following topics:
Learn how to transribe user audio.(STT, Speech-to-text)
Speech-to-text (STT) is a technology that converts audio input into text. It allows users to interact with AI Agents using their voice.
There are two ways to use the speech-to-text feature:
As a recommended approach, it connects to the user’s audio track and streams the transcription result in real time.
You can configure STT options in the config when calling join(). You may explicitly specify a language or enable automatic language detection. Specifying a language improves transcription accuracy. (link) If no language is specified, STT runs in the same language as the AI Avatar.
With this configuration, STT will automatically start when the joined
event is triggered.
Unmutes the user’s audio. If the track is already published, it resumes sending audio data. If it is not published, it publishes the track and starts transmitting data.
Both methods stop STT.
The difference is that unpublishUserAudio() disconnects the audio track entirely, while muteUserAudio()
keeps the track published but stops sending audio data.
To resume STT, use unmuteUserAudio()
.
If you stopped STT using muteUserAudio()
, it can resume more quickly, so this is the recommended approach.
Transcription results are delivered via events.
By default, transcription uses the same language as the agent. If you’d like to detect the spoken language automatically, you can configure that when joining the room.(link)
Converts a recorded audio blob into text.
To learn more and see full feature, see the following topics: