Voice API: Speaker recognition

Introduction
Enrolment and speaker recognition methods
Enrolment steps
Verification
Verification steps


Introduction

Speaker recognition is one of the biometric methods of identifying a person. It is similar to face recognition, fingerprint recognition, etc. In fact, most of the procedures performed in function of speaker recognition will be similar or will strongly resemble those of face recognition.

A person is first enroled into the system, by providing a voice sample / utterance. This voice sample will be used to create a voiceprint of that person. This voiceprint will be sent back to you in form of a vector (array of numbers), and will be used for future identification.

As such - speaker recognition is not spoken-language-dependent. It will compare the characteristics of a particular spoken utterance with the voiceprint that the user claiming to be that person has. It will work in any spoken language.

Speaker recognition is used for identity verification, authentication, and identification.

MachineSense is creating and verifying/comparing the latest generation of vectors (X-Vectors), which are much more reliable then legacy I-vectors.

Enrolment and speaker recognition methods

Enrolment is the process of creating a voiceprint of a person. This is done by providing a voice sample / utterance of that person. This voice sample will be used to create a voiceprint of that person. This voiceprint will be sent back to you in form of a vector (array of numbers), and will be used for future identification.

Enrolment (as well as subsequent identification) can be done in several ways:

  • Text-dependent. This means that your end-user is enroled and later checked with a fixed spoken sentence (such as "My voice is my passport", for example).
  • Text-independent. This means that end-user may speak any sentence (whatever they think of at the moment, but with a minimum duration of, typically 5 seconds). As more convinient as this might sound, sometimes the matches are returned with a lower degree of confidence score.
    Text-independent enrolment is typically used for forensic matching, where the voiceprint is created from a voice sample of an unsuspected potential caller, and then compared to a database of possible known offenders.
    Even more popular use of text-independent enrolment is in the case of call-centers, which might want to identify the caller in order to speed up the process of identification. In such cases, the caller is automatically recorded speaking any sentence (with a minimum duration) and later recognized by the call-center software, which performs forensic-match or identification of the caller in conjunction with one more parameter (example: phone number, or account number, date-of-birht, etc.).

In addition to these, typical methods, MachineSense introduces a few new methods, which might be of better use for you, and safer both for your platform and your customers:

  • Text-independent with fixed password/keyword. In this case - your end-user thinks of a password or pass-phrase and repeats it every time identity is checked (when verified). This can be any sentence (also - in any language) that your user thinks of, but should be repeatable, similarly to the password-use.
    This method is safer than the text-independent method, as it is less prone to spoofing, and matches are more close (confidence level is more precise than in a standard text-independent mode).
  • Text-independent with an idenitifying parameter. End-user speaks a sentence containing also the second identifying parameter (such as date of birth, or account number, for example). Later, both voice characteristics and the identifying information is used (latter one through speech recognition) and our customer gets a match (is this the person we are identifying?) as well as the second parameter, thus narrowing the possible identification candidate(s).
  • Enrolment for liveness detection enabled verification. This is a special method of enrolment, where end-user is asked to speak a specific set of words, which will be used later as elements for random presented challenge (during the verification).
    This set of words may be, for example, the numbers (0 to 9). In this case - user will be, during the verification process, presented with a kind-of a PIN-code challenge ("Please say the following 4 digits: 4751"). If such challenge is time-limited, there is a fair chance that user cannot record (or manipulate otherwise) someone else's voice to fulfill the challenge, hence creating a kind of liveness-detection scenario.
    Another example could be a set of words, such as "Cat, House, Tree, Car, Dog, Mouse, etc.". In this case, the user will be presented with a shuffled and random choice of those words during the challenge. You may create different set of words for each user, but the end-user will be presented with a random choice of those words only. You can also mix and match alphanumeric characters with words, etc. -- making it even more difficult to spoof.

Verification performed later, on one of those types of enrolment, should match the type.
For example: If you have enroled your user with a text-dependent method, you should also verify him with a text-dependent method. And it will certainly be more difficult or impossible in some cases to verify that user with a more advanced method (such as liveness detection method, for example).

Most of the other voice-processing platforms/APIs are requiring multiple enrolment per user, in one go. MachineSense asks for a single, clear spoken text utterance. This is enough for us to create a voiceprint of that person. This largely improves user-experience for your end-users, which might be annoyed by the repeated process.

Enrolment steps

Enrolment is performed in the following steps:

  • Session initialization, where portal's back-end calls MachineSense API, sending API-key and several parameters (including the enrolment type and callback URL / where results will be sent to). MachineSense returns the Unique Session ID (USID).
    In addition to the standard parameters, our customers/partners can send also a free-form field, which will be sent back to them at the moment of the result-returning. This free-form data is sometimes needed in order to additionally identify the user as registered in the customer's system, or provide some other utility-related data needed to be processed at the time of the return.
  • Voice recording on the portal, where your portal records the voice-sample and presents to the end-user instructions on how to correctly provide this voice sample. Voice samples are recorded and sent as b64-encoded AAC blobs.
  • Sending the voice sample to MachineSense API is done by posting this b64-encoded blob, together with the USID
  • Processing of the voice, where MachineSense API creates a voiceprint/vector out of that voice-sample and sends it back to our customer/partner. Customer can further identify that vector by previously acquired USID, and by free-form providing identification data (on session-init, customer can send anything that will be used later to match-up their user(s), and any other data that will be returned together with the vectorized sample.
  • Storing the vectorized voiceprint into customer database or other data-set, together with possibly other user-identification data.

Verification

Voice verification is performed after successful enrolment. It is the process of checking if the person claiming to be indeed is that person. Enroled vector is sent to MachineSense API (in the session-init method) and is compared, in the second step, to the voiceprint of the real-time recorded sample of the person trying to authenticate.

Verification should be done according to the enrolment method. If the enrolment was done with a text-dependent method, for example, the verification should be also done text-dependent. And so on, for other methods.

Verification returns a value indicating distance between enroled sample and current speaker score, which is a number between 0 and 1. The lower the number, the higher the confidence that it's the same person (smaller distance between the two voiceprints).
In addition to this, MachineSense verification API also returns the current speaker's voiceprint, which can be used for further successive enrolment (re-enrolment), percentage-expressed "confidence score" (in case some customers prefer this to the distance). Optionally, it also returns the content (text) of the spoken utterance, if speech recognition was specified together with the speaker recognition.

Verification steps

Verification is performed in the following steps:

  • Session initialization, similar to enrolment session initialization, but specifying that it's about verification and supplying the enroled vector.
  • Voice recording on the portal, where your portal records the voice-sample and presents to the end-user instructions on how to correctly provide this voice sample (including the instructions needed to match the enrolment method). Voice samples are recorded and sent as b64-encoded AAC blobs.
  • Sending the voice sample to MachineSense API is done by posting this b64-encoded blob, together with the USID
  • Processing of the voice, where MachineSense API creates a voiceprint/vector out of that voice-sample and compares it to the enroled voiceprint/vector, before sending the results back to our customer/partner.
    Customer can further identify that vector by previously acquired USID, and by their own free-form data, which was sent during the session initialization.
    Returned data includes the distance between the two voiceprints (and possibly the percentage-expressed confidence score), the current speaker's voiceprint, and optionally the content (text) of the spoken utterance, if speech recognition was specified together with the speaker recognition.
  • Optionally, storing (re-enrolment) of the new vector-voiceprint and making decisions based on the returned distance/confidence-score and, optionally, the content of the spoken text.