API: Voice (speaker) enrolment

Introduction
Call
Response


Introduction

Voice-enrolment is the first step in voice-biometric authentication process. In this step, a mathematical representation of main characteristics of a person's voice is taken, by analysing a voice-sample (utterance) of that person. This representation is called "voice-print" and in this process, such voice-print is resulting in a voice-vector (a series/array of numerical values), used later in comparison with the actual voice-sample of a person trying to authorize.

As mentioned, such vectors (face-prints and voice-prints) are non-PII (non Personally Identifiable Information). This means that by itself, they cannot be used to reconstruct the face-image or voice-sample from which they were created.

Initially, MachineSense customer/partner sends a captured voice-sample of their end-user from their website or mobile app to their own servers (this step is completely independent of MachineSense), and than call the MachineSense API, including that image and some parameters.
In order to help customers start quickly with such client-side (web-based) implementation, MachineSense offers a set of examples and code, ready for copy/paste into your applications and be customized/modified. Basic operations with capturing the voice, setting up parameters, formatting the BLOB, etc. will be already present in those examples.
Examples are written in vanilla JavaScript, and can be used in any web-based application.
You can find them on our Demo page as well as our GitHub repository.

Customer creates own client-side page or app, including capturing of the user's voice-sample. Exception to this might be if customer is using MachineSense white-label client-side, in which case this is already done for them. The latter process, however is a two-step process and is related to pre-built / ready-to-use modules.

When capturing the voice-sample, you should choose the manner in which you will authenticate your end-user. Namely, MachineSense presents you with several authentication options (and this is not common with other voice-auth providers). Depending on the option for verification chosen, you should also choose the way you collect end-user's voice-sample and present the user with instructions to perform that correctly.
Those options are (among others):

  • Text-dependent enrolment/verification on a fixed pre-set sentence (such as "My voice is my passport").
  • Text-dependent enrolment/verification on a user-defined sentence (user's date-of-birth or random chosen passphrase, for example).
  • Text-independent enrolment/verification where user may speak anything, in a language of their choice (as long as duration is satisfactory, typically 5 seconds). In this case, voice-sample of a user might be taken even in the background.
  • Liveness-checked enrolment/verification, where voice-characteristics are sent as well as the content of the speech, and form a base for verification of randomly presented strings (which couldn't be pre-recorded, hence anti-spoofing). For more details on this, please see this doc about MachineSense speaker recognition.

More details about single-step and two-step processes.

Call

(Call from your server to our API.)

POST /voice/v1/enroll_voice

Parameters / body:

            {
                "audio": "string",
                "api_key": "string",
                "ref": "string",
                "method": "string",
                "phrase": "string",
                "content": {
                    "include": false,
                    "language": "string",
                    "precision": "string"
                }
            }

Parameters explained:

  • "audio" = (mandatory) Voice-sample BLOB, encoded as AAC in b64.
  • "api_key" = (mandatory) Your developer key found in your Settings
  • "ref" = (optional, default="") Any string you wish to send back to yourself, that you will receive with the later response to your webhook.
  • "method" = (mandatory) MachineSense method required. Can be one of:
    • "text_dep_fixed" - Text-dependent with fixed phrase (such as "My voice is my passport"). Phrase specified in "phrase" parameter.
    • "text_dep_dob" - Text-dependent with end-user's date-of-birth. If content.include = true, will receive in the response the actual date-of-birth.
    • "text_dep_user" - Text-dependent with user-defined pass-phrase (but "text dependent" because that pass-phrase should be repeated when authenticating later.
    • "text_indep" - Text-independent, meaning - any phrase of at least 5 seconds can be used for enrolment (and any phrase of at least 5 seconds can be used for verification later).
    • "liveness" - Enrolment with possible liveness detection on verification. This is done by presenting to the end user a list of words (for example - numbers 0-9) to speak for enrolment. On verification, few of those words, in shuffled order are presented to verify within n seconds. This is assuming that potential spoofer did not have enough time to record/edit/deepfake the spoofed voice sample (playback).
      List of words is supplied under "phrase" parameter, as a space-separated string (Example: "one two three" or "cat dog house").
  • "phrase" - Specified for "text_dep_fixed" or "liveness" methods. Otherwise - empty.
  • "content" - Object relating to speech-recognition (content of the spoken text)
    • "include" - boolean, if to include the content of the speech recognition (content of the spoken utterance)
    • "language" - if "include": true, specifies the language for which speech recognition should be performed (example: NL-NL or EN-US)
    • "precision" - depth and processing level of the speech recognition algorithm. Possible values are: "full", "quick", "ultraquick".

Response

Code: 200

Default response:

            {
                "result": "Ok",
                "code": 0,
                "message": "string",
                "data": {
                  "ref": "string",
                  "vector": [
                    0
                  ],
                  "speech": "string"
                }
            }

Response explained:

  • "result" = "Ok" or "Err" (error)
  • "code" = 0 or error code (int)
  • "message" = If result "Err" - textual description (string)
  • "data" = JSON object with data
    • "ref" = Referential free-form string sent in either single-step- or two-step process (on session init).
    • "vector" = Array of numbers, end-user's voice-print. X-Vector.
    • "speech" = Content of the spoken enrolment utterance. If content asked for. In language specified in the call.

Top of the Page