API: Voice (speaker) verification

Introduction
Call
Response


Introduction

Voice-verification is a procedure of verifying the identity of a person claiming to be of certain identity by comparing person's voice-sample (utterance) to the enroled voice-print of that identity.

During this procedure, you will send both the enroled voice-print (obtained during the procedure of voice-enrolment), and current voice-sample/utterance of a person attempting to identify, together with some options.
As a result, you will receive a response with the similarity score (how much the two faceprints are similar), and content of the spoken phrase (if you opted for it, by setting this option in the initial call).

MachineSense customer/partner will initially send a voice-utterance BLOB from their website or mobile app to their own servers (this step is completely independent of MachineSense), and than call the MachineSense API, including that BLOB and some parameters.
In your call to MachineSense API, you will include both enroled vector and the freshly recorded utterance (end-user) attempting to identify/verify a person.

In order to help customers start quickly with such client-side (web-based) implementation, MachineSense offers a set of examples and code, ready for copy/paste into your applications and be customized/modified. Basic operations with capturing the image, setting up parameters, etc. will be already present in those examples.
Examples are written in vanilla JavaScript, and can be used in any web-based application.
You can find them on our Demo page as well as our GitHub repository.

Customer creates own client-side page or app, including capturing of the user's selfie image.
Exception to this might be if customer is using MachineSense whitelabel client-side, in which case this is already done for them, or MachineSense WASM component. The latter process, however is a two-step process and is related to pre-built / ready-to-use modules.

More details about single-step and two-step processes.

Call

(Call from your server to our API.)

POST /voice/v1/enroll_voice

Parameters / body:

            {
                "audio": "string",
                "xvector": [
                    0
                ],
                "api_key": "string",
                "ref": "string",
                "method": "string",
                "phrase": "string",
                "content": {
                    "include": false,
                    "language": "string",
                    "precision": "string"
                }
            }

Parameters explained:

  • "audio" = (mandatory) Voice-sample BLOB, encoded as AAC in b64. Speaker's utterance claiming the identity (current).
  • "xvector" = (mandatory) Previously enrolled vector of the person trying to identify now.
  • "api_key" = (mandatory) Your developer key found in your Settings
  • "ref" = (optional, default="") A string that you've sent back to yourself (free-form, whatever you need to identify the call after you receive this response).
  • "method" = (mandatory) MachineSense method required. Can be one of:
    • "text_dep_fixed" - Text-dependent with fixed phrase (such as "My voice is my passport"). Phrase specified in "phrase" parameter.
    • "text_dep_dob" - Text-dependent with end-user's date-of-birth. If content.include = true, will receive in the response the actual date-of-birth.
    • "text_dep_user" - Text-dependent with user-defined pass-phrase (but "text dependent" because that pass-phrase should be repeated when authenticating later.
    • "text_indep" - Text-independent, meaning - any phrase of at least 5 seconds can be used for enrolment (and any phrase of at least 5 seconds can be used for verification later).
    • "liveness" - Enrolment with possible liveness detection on verification. This is done by presenting to the end user a list of words (for example - numbers 0-9) to speak for enrolment. On verification, few of those words, in shuffled order are presented to verify within n seconds. This is assuming that potential spoofer did not have enough time to record/edit/deepfake the spoofed voice sample (playback).
      List of words is supplied under "phrase" parameter, as a space-separated string (Example: "one two three" or "cat dog house").
  • "phrase" - Specified for "text_dep_fixed" or "liveness" methods. Otherwise - empty. A choice of words mentioned here should also be presented to the user prior to obtaining the current utterance, so that user conforms to this. If "liveness", this should be a list of choices from enrolled words (for example a 4 - digit ad-hoc PIN-code to be spoken, if user enrolled by speaking numbers from 0 to 9).
  • "content" - Object relating to speech-recognition (content of the spoken text).
    • "language" - if "include": true, specifies the language for which speech recognition should be performed (example: NL-NL or EN-US)
    • "precision" - depth and processing level of the speech recognition algorithm. Possible values are: "full", "quick", "ultraquick". For best results, should be the same as was used for enrollment.

Response

Code: 200

Default response:

            {
                "result": "Ok",
                "code": 0,
                "message": "string",
                "data": {
                  "ref": "string",
                  "vector": [
                    0
                  ],
                  "speech": "string",
                  "distance": 0.01,
                  "confidence": 99.99
                }
            }

Response explained:

  • "result" = "Ok" or "Err" (error)
  • "code" = 0 or error code (int)
  • "message" = If result "Err" - textual description (string)
  • "data" = JSON object with data
    • "ref" = Referential free-form string sent in either single-step- or two-step process (on session init).
    • "vector" = Array of numbers, end-user's voice-print when asking for verify. X-Vector. Can be used for re-enrollment.
    • "speech" = Content of the spoken enrolment utterance. If content asked for. In language specified in the call.
    • "distance" = Distance of the verify-utterance from the enrolled voiceprint. Smaller the distance, more likely it is the same person. Value between 0 and 1 (float).
    • "confidence" = Confidence score that it's the same person. Value from 0-100 (float), representing the percentage

Top of the Page