The execution of the jar file directly invokes the speaker diarization method developed for broadcast news recordings. Suppose we need to compute the diarization ”./showName.seg” of the audio file ”./showName.wav”. The command line to accomplish this would be:
/usr/bin/java -Xmx2024m -jar ./LIUM_SpkDiarization.jar \ --fInputMask=./showName.wav --sOutputMask=./showName.seg --doCEClustering showName
javathe name of the java virtual machine (JVM).
-Xmx2048msets the memory of the JVM to 2048MB, which is appropriate to treat a one-hour show.
-jar ./LIUM_SpkDiarization.jarspecifies the jar to use.
--fInputMask=./showName.wavis the name of the audio file. It can be in Sphere format or Wave format (16kHz / 16bit PCM mono), the type is auto detected according the extension.
--sOutputMask=/showName.segis the output file containing the segmentation.
--doCEClusteringis set, the program computes the NCLR/CE clustering at the end. The diarization error rate is minimized. If this option is not set, the program stops right after the detection of the gender and the resulting segmentation is sufficient for a transcription system.
showNameis the name of the show.
The other possible options are:
--traceto display information during processing.
--helpto display a brief usage guide of the tools.
--system=currentselects the diarization system (currently unused).
--saveAllStepsave every step of the diarization. They are saved in the following files:
--loadInputSegmentationloads the initial segmentation (UEM) from the file specified by the option
--sInputMask. By default, the initial segmentation is composed of one segment ranging from the start to the end of the show.
Caution: there is a problem not yet solve under windows. The load of resources (as gmm) don't works.
The Sphinx 4 tools are used for the computation of features from the signal. For the first three steps described below, the features are composed of 13 MFCCs with coefficient C0 as energy, and are not normalized (no CMS or warping). Different sets of features are used for further steps, similarly computed using the Sphinx tools.
Before segmenting the signal into homogeneous regions, a safety check is performed over the features. They are checked to ensure that there is no sequence of several identical features (usually resulting from a problem during the recording of the sound), for such sequences would disturb the segmentation process.
A pass of distance-based segmentation detects the instantaneous change points corresponding to segment boundaries. It detects the change points through a generalized likelihood ratio (GLR), computed using Gaussians with full covariance matrices. The Gaussians are estimated over a five-second window sliding along the whole signal. A change point, i.e. a segment boundary, is present in the middle of the window when the GLR reaches a local maximum.
A second pass over the signal fuses consecutive segments of the same speaker from the start to the end of the record. The measure employs ∆BIC, using full covariance Gaussians, as defined in equation 1 below.
The algorithm is based upon a hierarchical agglomerative clustering. The initial set of clusters is composed of one segment per cluster. Each cluster is modeled by a Gaussian with a full covariance matrix. ∆BIC measure is employed to select the candidate clusters to group as well as to stop the merging process. The two closest clusters i and j are merged at each iteration until ∆BICi,j > 0.
∆BIC is defined in equation 1. Let |Σi|, |Σj| and |Σ| be the determinants of gaussians associated to the clusters i, j and i + j. λ is a parameter to set up. The penalty factor P (eq. 2) depends on d, the dimension of the features, as well as on ni and nj, refering to the total length of cluster i and cluster j respectively.
This penalty factor only takes the length of the two candidate clusters into account whereas the standard factor uses the length of the whole data.
A Viterbi decoding is performed to generate a new segmentation. A cluster is modeled by a HMM with only one state, represented by a GMM with 8 components (diagonal covariance). The GMM is learned by EM-ML over the segments of the cluster. The log-penalty between two HMMs is fixed experimentally.
The segment boundaries produced by the Viterbi decoding are not perfect: for example, some of them fall within words. In order to avoid this, the boundaries are adjusted by applying a set of rules defined experimentally. They are moved slightly in order to be located in low energy regions. Long segments are also cut recursively at their points of lowest energy in order to yield segments shorter than 20 seconds.
In order to remove music and jingle regions, a segmentation into speech / non-speech is obtained using a Viterbi decoding with 8 one-state HMMs. The eight models consist of 2 models of silence (wide and narrow band), 3 models of wide band speech (clean, over noise or over music), 1 model of narrow band speech, 1 model of jingles, and 1 model of music.
Each state is represented by a 64 diagonal GMM trained by EM- ML on ESTER 1 data. The features are 12 MFCCs completed by ∆ coefficients (coefficient C0 is removed).
Detection of gender and bandwidth is done using a GMM (with 128 diagonal components) for each of the 4 combinations of gender (male / female) and bandwidth (narrow / wide band). Each cluster is labeled according to the characteristics of the GMM which maximizes likelihood over the features of the cluster.
Each model is learned from about one hour of speech extracted from the ESTER training corpus. The features are composed of 12 MFCCs and ∆ coefficients (C0 is removed). The entire features of the recording are warped using a 3 second sliding window as proposed in , before the features of each cluster are normalized (centered and reduced).
The diarization resulting from this step fits the needs of automatic speech recognition: the segments are shorter than 20 seconds; they contain the voice of only one speaker; and bandwidth and gender are known for each segment.
In the segmentation and clustering steps above, features were used unnormalized in order to preserve information on the background environment, which helps differentiating between speakers. At this point however, each cluster contains the voice of only one speaker, but several clusters can be related to the same speaker. The contribution of the background environment to the cluster models must be removed (through feature normalization), before a hierarchical agglomerative clustering is performed over the last diarization in order to obtain a one-to-one relationship between clusters and speakers.
Thanks to the greater length of the speaker clusters resulting from the BIC hierarchical clustering, more robust, complex speaker models can be used for this step. A Universal Background Model (UBM), resulting from the fusion of the four gender- and bandwidth- dependent GMMs used earlier, serves as a base. The means of the UBM are adapted for each cluster to obtain the model for its speaker.
At each iteration, the two clusters that maximize a given measure are merged. The default measure is the Cross Entropy (CE/NCLR). The clustering stops when the measure gets higher than a threshold set a priori.