Introduction
Modes of operation
Interface modes
Processing modes
Usage examples
How does it work in implementations?
Speech recognition is a process of recognizing WHAT is being said. It is a process of converting speech to text. It is not a biometric method of identification, but rather a method of "translating" spoken into textual input.
Speech recognition is most commonly used for transcriptions, Internet bots (chatbots), voice assistants, etc.
MachineSense speech recognition engine/platform is based directly on our academic research and is using state-of-the-art deep learning methods. It is based on the latest research in the field of speech recognition and is being maintained and updated regularly.
MachineSense speech recognition is available in 27 languages, and is being constantly updated with the new ones. Curently supported languages are: English, Indian English, German, French, Spanish, Portuguese, Chinese, Russian, Turkish, Vietnamese, Italian, Dutch, Catalan, Arabic, Greek, Farsi, Filipino, Ukrainian, Kazakh, Swedish, Japanese, Esperanto, Hindi, Czech, Polish, Uzbek and Korean.
MachineSense speech recognition is available in three interface modes: online file-based, online streaming and phone streaming. Online file-based mode is used for transcriptions of real-time saved or previously-recorded audio files, while streaming mode is used for real-time transcriptions of live internet-routed audio streams. Phone-streaming mode is used for phone conversations.
Interfaces to our speech recognition engine are available as REST API (online mode) and as SIP X-HEADER API (phone mode).
Additionally, realtime online streaming mode is available with our WebRTC-based interface, sending back the results (transcriptions)
in real-time, as the audio is being streamed to our servers and consolidating those results on each half-second speech pauses. Similar
results can be achieved with our SIP X-HEADER API, by sending the audio in chunks, and receiving the results in real-time (approximate)
and on-pause (definite). Most of the speech-recognition companies apply similar result-methods.
MachineSense speech recognition is available in three processing modes: Full mode, Quick mode, Ultra-quick mode.
These modes differentiate by the amount of processing done on the audio file/stream. Full mode is the most precise, but also the slowest, and requires most processing power. Quick mode is a bit less precise, but also faster and requires less processing power. Ultra-quick mode is the fastest, we call this mode also "mobile" or "portable" mode; while it is theoretically not as precise, it appears to be the most useful, and especially when combined with the "dictionary mode" (see below) for ultra-quick returning of the results to IVRs and Internet Bots.
Full mode is mostly used for transcriptions, while quick and ultra-quick modes are mostly used for IVRs and Internet Bots.
MachineSense speech recognition is available in two modes: freeform and dictionary-based.
In some cases, it is not necessary to use the free-form mode (which is analyzing and trying to find appropriate words for any spoken utterance). In fact, for most of the practical implementations, especially with IVRs and Internet Bots, it is much more useful to use the dictionary-based mode, where the system is expecting only a limited number of words or phrases to be spoken. This is why we developed this mode, which is much faster and more precise than the freeform mode.
In implementations of the dictionary-based mode, we custom-tailor dictionaries of possible answers. Those may be, for example, names of cities, names of products, names of people, etc. But also, if alphanumeric input is to be spelled by the users, we will preset a spelling dictionary for all languages needed, for this particular implementation, in addition to possible other dictionaries.
When MachineSense analysis the voice-input and finds a match in the dictionary, it will return the result much quicker, and with unparallelled precision. This is why we recommend using this mode for most of the practical IVR/Bot implementations. Dictionary-mode is proprietary to MachineSense, and is one of our USPs.
Hereby a few practical examples of usage:
Note: In case of online-streaming or phone-streaming mode, you will receive the results on your receiver end-point in two
stages:
In the first stage, you will receive the results in real-time, as the audio is being streamed to our servers. This is approximate result.
In the second stage, which is triggered by the end of a sequence (most typically identified by a 1/2 second pause), previous real-time
results are consolidated ("sharpened") and you receive the definite voice transcript.