Voice API: Speech recognition

Introduction
Modes of operation
     Interface modes
     Processing modes
Usage examples
How does it work in implementations?


Introduction

Speech recognition is a process of recognizing WHAT is being said. It is a process of converting speech to text. It is not a biometric method of identification, but rather a method of "translating" spoken into textual input.

Speech recognition is most commonly used for transcriptions, Internet bots (chatbots), voice assistants, etc.

MachineSense speech recognition engine/platform is based directly on our academic research and is using state-of-the-art deep learning methods. It is based on the latest research in the field of speech recognition and is being maintained and updated regularly.

MachineSense speech recognition is available in 27 languages, and is being constantly updated with the new ones. Curently supported languages are: English, Indian English, German, French, Spanish, Portuguese, Chinese, Russian, Turkish, Vietnamese, Italian, Dutch, Catalan, Arabic, Greek, Farsi, Filipino, Ukrainian, Kazakh, Swedish, Japanese, Esperanto, Hindi, Czech, Polish, Uzbek and Korean.

Modes of operation

Interface modes

MachineSense speech recognition is available in three interface modes: online file-based, online streaming and phone streaming. Online file-based mode is used for transcriptions of real-time saved or previously-recorded audio files, while streaming mode is used for real-time transcriptions of live internet-routed audio streams. Phone-streaming mode is used for phone conversations.

Interfaces to our speech recognition engine are available as REST API (online mode) and as SIP X-HEADER API (phone mode).
Additionally, realtime online streaming mode is available with our WebRTC-based interface, sending back the results (transcriptions) in real-time, as the audio is being streamed to our servers and consolidating those results on each half-second speech pauses. Similar results can be achieved with our SIP X-HEADER API, by sending the audio in chunks, and receiving the results in real-time (approximate) and on-pause (definite). Most of the speech-recognition companies apply similar result-methods.

Processing modes

MachineSense speech recognition is available in three processing modes: Full mode, Quick mode, Ultra-quick mode.

These modes differentiate by the amount of processing done on the audio file/stream. Full mode is the most precise, but also the slowest, and requires most processing power. Quick mode is a bit less precise, but also faster and requires less processing power. Ultra-quick mode is the fastest, we call this mode also "mobile" or "portable" mode; while it is theoretically not as precise, it appears to be the most useful, and especially when combined with the "dictionary mode" (see below) for ultra-quick returning of the results to IVRs and Internet Bots.

Full mode is mostly used for transcriptions, while quick and ultra-quick modes are mostly used for IVRs and Internet Bots.

Freeform or dictionary-based

MachineSense speech recognition is available in two modes: freeform and dictionary-based.

In some cases, it is not necessary to use the free-form mode (which is analyzing and trying to find appropriate words for any spoken utterance). In fact, for most of the practical implementations, especially with IVRs and Internet Bots, it is much more useful to use the dictionary-based mode, where the system is expecting only a limited number of words or phrases to be spoken. This is why we developed this mode, which is much faster and more precise than the freeform mode.

In implementations of the dictionary-based mode, we custom-tailor dictionaries of possible answers. Those may be, for example, names of cities, names of products, names of people, etc. But also, if alphanumeric input is to be spelled by the users, we will preset a spelling dictionary for all languages needed, for this particular implementation, in addition to possible other dictionaries.

When MachineSense analysis the voice-input and finds a match in the dictionary, it will return the result much quicker, and with unparallelled precision. This is why we recommend using this mode for most of the practical IVR/Bot implementations. Dictionary-mode is proprietary to MachineSense, and is one of our USPs.

Usage examples

Hereby a few practical examples of usage:

  • IVR (Interactive Voice Response) telephony systems - speech recognition is used to recognize what the caller is saying, in response to the pre-recorded prompts. This is used in many call-centers, banks, etc.
    Response is received in form of text, which is then used to determine the next step in the IVR process. It can also be combined with speaker recognition, to identify the caller, and then use the speech recognition to determine the next step.
    Typically, call centers are asking the caller to identify himself by entering some ID number, or by stating date-of-birth, or by stating some other information. MachineSense speech recognition can be used to recognize that information, and then use it to identify the caller, as well as continue the IVR process by, for example, narrowing the scope of further choices and funneling appropriate call-center personnel.
    Most high-end call-center software packages allow for recording of the calls and responses received to certain prompts. These responses are sent to MachineSense cloud-based API and responses are received in sub-second time, allowing for real-time processing of the calls.
  • Internet bots / interactive voice assistants operate in a similar manner to IVRs. They are also needing to recognize what the user is saying, and then to respond appropriately. Once voice-input is received from the user, it should be sent to MachineSense speech recognition API, and the response will be received in form of text. This text can then be used to determine the next step in the conversation with the user.
  • Online call and conference-call transcriptions are becomming a necessity nowadays. Whether it is about a 1:1 call or a group- or conference-call, and even an online seminar, users are expecting to receive the transciption of this conversation, not just a voice-recording. In times of increased usage of online communication, this is becoming a necessity and allows for more efficient following of multiple conversations, as well as much easier and more clever commenting, return actions and conversation forwarding.
    One of the typical uses in this area is creating of summaries of the conversations, which can be used for later reference, or for creating minutes of the meeting, quick comments, follow-ups, etc.
  • Pre-processing and funneling of users into type groups. Users are asked to speak a few product-identification numbers/characters and are funnelled into particular groups receiving the same information.
    Example in transport industry: User states his destination city, or last few numbers of his boarding-pass or issued ticket and receives information about that flight/drive/etc. from that moment on.
    Retail example: User states a type of product being sought and is guided to that particular product group, possibly receiving tips and incentives.

How does it work in implementations?

  • Your call-center software or internet platform includes a plugin/widget sendining either (short) recordings or streams to MachineSense speech recognition API, using one of the mentioned modes (online, online streaming or phone).
  • You specify in MachineSense Dashboard data-originator and data-receiving end-points, as well as the language of the speech being sent and processing type (freeform or dictionary-based, data-analysis depth, transciption mode, etc.).
    You may also specify additional information, which you want to receive with the results, such as the ID of the call, or the ID of the user, or any other information: This is free-form data, and is used for your own identification of the response.
  • After receiving and processing of your files or streams, You receive the results in form of text, which you can then use to process further, within your programming environment.

    Note: In case of online-streaming or phone-streaming mode, you will receive the results on your receiver end-point in two stages:
    In the first stage, you will receive the results in real-time, as the audio is being streamed to our servers. This is approximate result.
    In the second stage, which is triggered by the end of a sequence (most typically identified by a 1/2 second pause), previous real-time results are consolidated ("sharpened") and you receive the definite voice transcript.