1. Voice commands are transmitted through a built-in remote control microphone or a smart device.
2. When voice speech starts, the client sends the information of current screen to the server, and the server extracts context information from the screen information to recognize the status.
3. Convert voice signals into text. (STT)
4. Grasp the intention of speech through morpheme analysis, syntax analysis, and semantic analysis of the voice converted into text.
5. In case of a pattern registered in the language model is matched, pass the command to the client.
The technical elements for implementing these screen recognition voice service are as follows:
• Context
A hierarchical structure with priority for matching as a basic processing unit containing predefined patterns on the screen. To identify the intent for the utterance and make an accurate response, it is necessary to check the status of the screen and register matching patterns and commands in each context. For example, if user says "Volume up," common tasks can be taken in any state, but if the command is "Turn TED on," the system needs to distinguish whether to move on the live channel or on the catch-up menu.
• Entity
Entity is an object included in the user's speech pattern, mainly includes channel name in the form of nouns, the VOD category, program name, content name, and channel number.
Entities can be defined for each context, and voice recognition processing is possible with preregistered entities alone and produce fast performance. For example, if “Avengers” is registered in the movie category entity, the content can be immediately searched even if only “Avengers” is spoken (without any interpretation of the entire sentence).
• Pattern
Pattern is used to distinguish the commands uttered in the form of sentences recognizable in the language model whether they are matched to registered patterns or not. Since voice commands can be expressed in different ways for different people, various patterns must be prepared according to the characteristics of each language to respond. For example, in the case of channel change, it is possible to say "tune", "move", and "go", and the system needs to accurately grasp the phrase and inform the client of the command to change the channel.
• Language Model
A language model is the process of analyzing text to create a pattern, and extracting sentences by predicting the probability between the words that are being spoken. The Voiceable’s NLU determines the meaning of a user command by referring to a hierarchical language model that is dynamically defined according to the state of the application. Voiceable's language model is specialized for TV viewing environments, and is optimized to create patterns and grasp user intentions.
• Client Application
Client Application takes the role of receiving and sending voice data from the device to the server and transmitting the real-time screen information at the time of voice utterance. It is also possible to display a texted voice in real time by running a prompt window on the screen when the speech is made.