API: Speech recognition

Introduction
Call
Response


Introduction

Speech recognition is a procedure of converting speech into text. It is not a biometric method of identification, but rather a method of "translating" spoken into textual input/output. You might want to use it for transcriptions, Internet bots (chatbots), voice assistants, etc.

MachineSense API can handle both recordings of speech (audio files) and live speech (streaming). Primary use is with recordings, since in most of the implementations, the speech is recorded first and then sent to the server for processing. Those are mostly very small files/BLOBs (as used for example in voice-chatbots or IVRs), and results are always definite, that's why this is primary use.

(Ad-hoc) recording-based speech recognition is described on this page. It is assuming standard REST-API calls, where you send the recording to the server and receive the response with the content of that speech (what was said, in textual form, while specifying the language it was spoken in).

Streaming speech recognition is described separately. For quick understanding: When performing streaming speech recognition, MachineSense servers are expecting SIP/RTP or WebRTC streams on our endpoints, and are sending back the content of the speech to predefined web-hooks (in your account settings). There are two types of results sent when operating in streaming mode: Immediate (approximate) results will be sent in near-realtime; Final results will be sent when the end-user stops speaking or has a pause of more than 1 second. Then - results will be "sharpened" and sent as definite.

Read more about features of our speech recognition.

Call

(Call from your server to our API.)

POST /voice/v1/speech

Parameters / body:

            {
                "audio": "string",
                "api_key": "string",
                "ref": "string",
                "dictionary": "string",
                "language": "string",
                "precision": "string"
            }

Parameters explained:

  • "audio" = (mandatory) Voice-sample BLOB, encoded as AAC in b64.
  • "api_key" = (mandatory) Your developer key found in your Settings
  • "ref" = (optional, default="") Any string you wish to send back to yourself, that you will receive with the later response to your webhook.
  • "dictionary" = (optional, default="") Dictionary to be used for speech recognition. If omitted, free-form mode will be used (no dictionary). If specifying a dictionary, it must be a valid dictionary name, as defined in your account settings. Dictionary is prepared for each case, using both your (customer) input and our customizing of this dictionary for this particular need.
  • "language" = (required) Language of the spoken utterance (example: NL-NL or EN-US).
  • "precision" = Depth and processing level of the speech recognition algorithm. Possible values are: "full", "quick", "ultraquick".

Response

Code: 200

Default response:

            {
                "result": "Ok",
                "code": 0,
                "message": "string",
                "data": {
                  "ref": "string",
                  "speech": "string"
                }
            }

Response explained:

  • "result" = "Ok" or "Err" (error)
  • "code" = 0 or error code (int)
  • "message" = If result "Err" - textual description (string)
  • "data" = JSON object with data
    • "ref" = Referential free-form string sent in the call and returned here for your reference.
    • "speech" = Content of the speech, in text-form. In language specified in the call.

Top of the Page