GMMBased Maghreb Dialect Identification System
Article information
Abstract
While Modern Standard Arabic is the formal spoken and written language of the Arab world; dialects are the major communication mode for everyday life. Therefore, identifying a speaker’s dialect is critical in the Arabicspeaking world for speech processing tasks, such as automatic speech recognition or identification. In this paper, we examine two approaches that reduce the Universal Background Model (UBM) in the automatic dialect identification system across the five following Arabic Maghreb dialects: Moroccan, Tunisian, and 3 dialects of the western (Oranian), central (Algiersian), and eastern (Constantinian) regions of Algeria. We applied our approaches to the Maghreb dialect detection domain that contains a collection of 10second utterances and we compared the performance precision gained against the dialect samples from a baseline GMMUBM system and the ones from our own improved GMMUBM system that uses a Reduced UBM algorithm. Our experiments show that our approaches significantly improve identification performance over purely acoustic features with an identification rate of 80.49%.
1. Introduction
One of the key challenges in Arabic speech research is to find the differences between Arabic dialects. Most of the recent works on Arabic speech have addressed the problem of identifying or recognizing Modern Standard Arabic. A few studies have focused on Arabic dialects [1,2], but no research has been carried out for the west Arabic countries (Maghreb). Arabic Maghreb dialects differ from Modern Standard Arabic and each other in many dimensions of the linguistic spectrum, as well as morphologically, lexically, syntactically, and phonologically.
One of the guiding questions we used for our research was, can a speaker’s regional origin or regional dialect within a given language group be determined for a given small sample of his or her speech? Our aim was to identify the dialect of a speaker from among the following five Maghrebian ones: Moroccan, Tunisian, and three Algerian dialects of Oranian, Algiersian, and Constantinian.
Since speakers with different dialects often pronounce some words differently and consistently alter certain phonemes, identifying the regional dialect prior to automatic speech identification allows to use more restricted pronunciation dictionary in decoding, which results will be in a reduced search space with a lower perplexity. However, no work in the speech topic literature has addressed the issues that are related to Maghrebian dialects.
To handle this problem, we improved an UBMGMM identification system by reducing the Universal Background Model (UBM) of the system by using two approaches based on Support Vector Machines (SVMs) that were reduced to Minimal Enclosing Ball (MEB) problems [3] using the fuzzy Cmean clustering method. The core idea of these two approaches is to adopt multiclass SVMs formulation and MEB formulation to reduce the size of the dataset by eliminating data out of the ball defined in the MEB.
We extracted MelFrequency Cepstral Coefficients (MFCCs) features from our own corpus (cf. Section 2) and then computed ShiftedDelta Cepstral (SDC) coefficients to identify the dialect of a regional speaker. We conducted a series of experiments to test our approach on spontaneous conversations in five different Arabic Maghreb dialects. We then compared the accuracy of the results of our improved UBMGMM identification system to a baseline UBMGMM identification system.
In this paper we defined the variables t, n, m, as follows:
t: index of frame, T : number of frames.
n: index of feature dimension, N: dimensionality of feature.
m: index of Gaussian component, M: number of Gaussian components.
The remainder of the paper is organized as follows: in the next three sections, we give some preliminaries where a review of the relevant research streams is provided. Then, Sections 2–4 are devoted to present the Maghrebian corpus, the Gaussian Mixture Model (GMM), and UBM MAP adaptation. Two approaches to reduce data based on MEBs are described in Section 5. In Section 6, we present our proposal of a dialect identification system based on UBMGMM. In Section 7, we report on some of the empirical experiments that we conducted on our proper database. Finally, in Section 8, we give the conclusion, which summarizes the contributions of this work and outlines potential research opportunities in the realm of Maghreb dialects identification.
2. Maghreb Dialect Corpus
Maghreb refers to the Arabic geographical region, which includes Morocco, Tunisia, Algeria, and Western Libya. The Maghreb dialects are the languages that are spoken in the aforementioned countries, and relabeled by the majority of their speakers as Darija, meaning ‘dialect’. Since, France and Spain colonized the Maghreb region, the dialects of the latter combine many French and Spanish words with Arabic suffixes to form words. This form of Arabic is not written and is less static, as it changes frequently. The Maghreb dialects’ phonemes differ in that speakers make no distinction between short and long vowels.
When training a system to identify dialects, it is important to use training and testing corpora under similar acoustic conditions. However, for our study, we used our own corpus of spontaneous speech issues from movies and TV shows, for which acoustic conditions are not similar to native artists’ speakers of the Arabic Maghreb Dialects. The corpus was made up of Moroccan, Tunisian, and three Algerian dialects (Oranian, Algiersian, and Constantinian). We used speech from:
92 speakers (54.19 h) of the Moroccan conversational artists, holding out 25 speakers for testing.
98 speakers (49.73 h) from the Oranian conversational artists, holding out 40 speakers for testing.
125 speakers (51.32 h) from the Algiersian conversational artists, holding out 32 speakers for testing.
80 speakers (45.18 h) from the Constantinian conversational artists, holding out 21 speakers for testing.
130 speakers (53.73 h) from the Tunisian conversational artists, holding out 43 speakers for testing.
3. Gaussian Mixture Model
GMMs are widely used in many speech identification and recognition applications. They provide a convenient means of modeling complex probability distributions by representing the probability density function of a random variable with a sum of weighted Gaussians. We give a brief outline of the equations that we used to form our models [4].
A GMM is a type of density model that represents a dialect or language model. It defines many different Gaussian distributions where each of them has its mean, variance, and weight in the GMM models. Suppose that M is the number of small Gaussian distributions to model. The GMM, the following equation attempts to model the probability density of a Ndimensional random vector x, by adding weighted combination of multivariate Gaussian densities:
by:
where w_{m} represents the Gaussian mixture weights, μ_{m} represents the mean, and ∑_{m} represents the diagonal covariance matrices with
The GMM is defined by the mixing of all components that represent the mean vector, covariance matrix, and weight for each model, as described below:
In a GMMbased dialect identification system, each dialect identified is modeled by m^{th} order GMM parameter parameter λ_{d} = {w_{m},μ_{m},∑_{m}} m = 1,…, M. The model parameters λ_{d} for dialect d are estimated with an ExpectationMaximization (EM) algorithm by the spectral features
GMM parameters are defined by using maximum likelihood training estimation, such as:
EM algorithm estimates maximum likelihood parameters. The basic idea is first based on initializing the model and then on estimating the model using a function such that the new model represents better parameters. After each dialect training, we obtained the mean, covariance, and weight of each Gaussian component. The algorithm consists of two main steps: the expectation Estep and the maximization Mstep. The Estep set of parameters are calculated using the current complete data likelihood function of the expected value, while the Mstep is carried out by maximizing the expected function to get the new parameters. The Estep and Mstep follow an iterative process until convergence.
First, we defined Q as:
where, m is the number of Gaussian component, λ_{m} is the current model parameter, and λ̄_{m} is the new parameter.
During the identification step, an unknown speech utterance X, is classified following the average log likelihood calculation produced by the dialect model, which is given by:
The maximumlikelihood classifier hypothesis H is calculated as:
Generally, GMMs do not tend to capture temporal dependencies satisfactorily. Hence, the introduction of Shifted Delta Coefficient that represents the acoustic features allows an acceptable performance [5]. The excellent language identification performances [6,7] establish the GMMs as a major language identification approach.
4. UBM MAP Adaptation
The EM algorithm estimates the UBM and dialect model in a similar way. However, to reduce computation and to improve performance when only a limited number of training utterances are available, we propose the use of a Bayesian maximum a posteriori (MAP) adaptation.
The MAP principle [8] differs from maximum likelihood as it assumes the parameters λ_{d} of the distribution p(Xλ_{d}) such that a random variable has a prior distribution p(λ_{d}). The MAP principle states that we should select λ̂_{d}, where the posterior probability density of the latter is maximized, as:
Using MAP for dialect model adaptation usually means that the prior distribution for the dialect model parameters is represented by the world model parameters [9]. Moreover, by using a global parameter to tune the relative importance of the prior distribution we can further do simplification without having a loss in performance. Based on the posterior probability of Gaussian m, we calculate ŵ_{m}, μ̂_{m}, and Σ̂_{m} which are the new weights, means, and diagonal covariance matrices that correspond, respectively, to the weights, means, and diagonal covariance matrices in the world model.
The posterior probability is defined as follows:
Adaptation, for all parameters of Gaussian m, is done as follows:
For each mixture and each parameter, a data dependent adaptation coefficient α is used in the above equations and is defined as:
where r, is a fixed relevance factor.
5. Reducing Data Based on MEBs
This section presents two approaches based on L2SVMs that have been reduced to MEB problems [3] using the fuzzy Cmean clustering method. The algorithms for computing L2SVMs based on the MEB equivalence used the greedy computation of a CoreSet, which is a typically small data subset that provides the same MEB as the full dataset. Therefore, we formulated a new multiclass SVM problem using CoreSets to reduce large datasets, which can optimally match the input demands of different background architectures of language or dialect identification systems. The core idea of these two approaches is to adopt a multiclass SVMs formulation and MEB in order to reduce dataset so that the data located far from the ball data that was defined in the CoreSet are eliminated.
5.1 L2Support Vector Machines
Given a training data set
After introducing Lagrange multipliers, the problem to solve is equivalent to:
where,
The optimal value is determined using model selection techniques and depends on the degree of noise and overlap among the classes [10]. With respect to KarushKuhnTucker (KKT) conditions, the hyperplane parameters are recovered as
5.2 Minimal Enclosing Balls
In [11], it is shown that the main appeal of the L2SVM implementation is that it supports a convenient reduction to a MEB problem when the kernel used in the SVM is normalized, that is, k(x,x)=κ ∀ x ∈ X where κ in which is a constant. The advantage of this equivalence is that the Badoiu and Clarkson algorithm [12] can efficiently approximate the solution of a MEB problem with any degree of accuracy.
If the training data set is
After introducing Lagrange multipliers, we obtained the following dual problem, with respect to the optimality conditions, which is as follows:
if we consider that
In [11], it is shown that the primal variables c and R can be recovered from the optimal α as:
5.3 CoreSet Definition
Badoiu and Clarkson [12] define the CoreSet of S as a set C_{S} ⊂ S where the MEB computed over C_{S} is equivalent to the MEB considering for all of points included in S. A ball ℬ(c, R) is said an εapproximation to the MEB ℬ^{*}(S, c^{*}, R^{*}) of S if R ≤ R^{*} and it contains S up to precision ε, that is: S ⊂ ℬ (c,(1+ε)R). Consequently, a set C_{S}_{,}_{ε} is called an εCoreSet if the MEB of C_{S}_{,}_{ε} is an εapproximation to ℬ^{*}(S, c^{*},R^{*}) (cf. Fig. 2).
If we consider S to be a set of T points in ℝ^{N}, R and is the radius of MEB(S), then, there exists a subset C_{S} ⊂ S such that:

➢ The center c(C_{S})MEB(C_{S}) of satisfiesd (z,c(C_{S}))≤(1+ε)R, ∀ z ∈ S, such that a subset C_{S} is a CoreSet of S for MEB. Then, a CoreSet is a subset C_{S} of S such that:
The size of C_{S} does not depend on d
The solution for C_{S} can then approximate the solution for S.
εCoreSet: The solution for C_{S} is within ε of the solution for S.
Next we present the most usual version of the algorithm used in [12].
In [12], it is proved that the algorithm of Bãdoiu and Clarkson is a greedy approach that is used to find a εCoreSet of S, which converges in no more than
5.4 MultiClass Extensions
In a multiclass problem, the samples {x_{t}} belong to a set of L categories c ∈{c_{l}; l ∈ L} with L>2 and hence, the two ‘codes’ +1 and −1 used to denote the two sides of a separating hyperplane are no longer enough to implement a decision function.
There are two types of extensions to build multiclass SVMs [13,14]. The first is the OneVersusOne (OVO) approach, which uses several binary classifiers that are separately trained and joined into a multicategory decision function. The second is the OneVersusAll (OVA) approach where a different binary SVM is used to separate each class from the all other classes.
In [15], it is shown that multiclass extension of L2SVMs preserves the data reduction to a MEB problem, which is the key requirement of our algorithms that improve the Maghreb dialects identification system, as detailed in the section below.
Let the training dataset be
Several selections are possible for the norm W^{2}. A common choice is the socalled Frobenius norm W^{2} =trace(W′W). Hence, the dual of the optimization problem obtained after introducing Lagrange multipliers is:
where
Hence, the primal solutions W, b, are obtained with respect to the KarushKuhnTucker (KKT) conditions on Eq. (22) as
Now, the arising question is about choosing the label vectors. We defined y_{tl} ∈ ℝ from [16]. Let y_{tl} denote the l^{th} element of the label vector y_{t} corresponding to z_{t}. One of the convenient ways is to choose y_{tl} as:
Then the inner product between the vectors will be:
5.5 MEB and MultiClass L2SVMs Equivalence
Now the computation of the MEB is in feature space Z̃=φ(X), which has been induced from X by the mapping function φ: X → Z̃ where we can compute the dot products in Z̃ directly from X by using a kernel function
As seen above, the optimization problem Eq. (17) is equivalent to solve the following quadratic program:
where, K̃_{tt′} =k(x_{t},x_{t′}). This problem coincides with the binary L2SVM problem shown in Eq. (23) that was obtained from the dual objective in Eq. (18) and its multiclass implementation in Eq. (21). As seen above, for the binary case, we set
5.6 Data Reduction Approaches
The key idea of our method is to cast a L2SVM as a MEB problem that has been reduced in a CoreSet by using a feature space Z̃=φ(X), where the training examples are embedded through the mapping of φ. Hence, we first formulated an algorithm to compute the MEB of the images S̃ of S in Z̃ when S is decomposed in a collection of subsets S_{p}. Then, we instantiated the solution for classifiers supporting the reduction to MEB problems (cf. Fig. 3).
Our proposed algorithm is based on the idea of computing CoreSets for each set S̃_{p} =φ(S_{p}) and taking union of all the CoreSets = ∪_{p} as an approximation to a CoreSet for S̃ = ∪_{p} S_{p}. Algorithm 2 depicts the generic procedure. In the first step, the algorithm extracts a CoreSet for each subset S_{p}. In the second step, the MEB of the union of the CoreSets is computed.
The decomposition of S in a collection of subsets S_{p} by the fuzzy Cmeans clustering method allows one piece of data to belong to two or more clusters. This algorithm was developed by Dunn and improved by Bezdek [17,18], and it aims to find the optimal number of clusters for a clustering data.
As shown in the previous sections, the kernel
So, for both the binary (OVO) and multicategory (OVA) multiclass cases, an instantiation of the Algorithm 2 would consist of computing CoreSets for the subset of examples belonging to each pair of classes, joining them, and finally recovering Algorithm 3 and Algorithm 4, respectively.
6. An UBMGMM Based Dialect Identification System
A UBM is a GMM representing the characteristics of all the different dialects processed by the dialect identification system. Instead of training dialect dependent models separately, these models are created later by employing Bayesian adaptation from the UBM using the dialectspecific training speech. Any test observations not covered by the models would typically not discriminate up on of any particular dialect identification models.
The UBM technique significantly increases the number of mixtures of the GMM, as well as the dimension of the feature vector; thereby, making it possible to model the characteristics of each dialect more accurately.
For our experiments, we introduced two systems. The first one was used as a baseline, as illustrated in Fig. 4. The second one was an improved system of the first one and was augmented by the reduced data following both the Algorithm 3 and Algorithm 4 applied to the UBM, as illustrated in Fig. 5.
For both of the systems, the mixture components of an adapted model of each dialect shared a certain correspondence with the UBM (System 1) or Reduced UBM (System 2), as each model was adapted from the same information. Therefore, the average loglikelihood score for the dialectadapted models was computed by only scoring the top 10 significant mixtures. According to the correspondence of mixtures between the UBM or Reduced UBM and the model of the dialects, these significant mixtures can be obtained by selecting models mixtures from the UBM or Reduced UBM that have the highest score. By employing this mixture testing strategy, we obtained a significantly reduced computation of scores.
A universal dialect independent background model is created to use a portion of the training data from all dialects. Then, by using MAP adaptation, all of the dialect models were trained by adapting models obtained from the UBM or Reduced UBM and the identification was performed in the same manner as defined above in the previous section. An advantage of employing UBMs in dialect identification systems is the significant reduction of the quantity of training data.
The implementation issue is simple. For each test feature vector and from all UBM mixtures, we determine the top 10 highest scoring mixtures. Using the fact that each dialect model was adapted from the UBM or from the Reduced UBM, the calculation of the dialect model likelihood only required the testing of the 10 mixtures that correspond to the top 10 mixtures from the UBM [19]. By employing this approach to the dialect identification system, the score computation complexity was improved, as shown below:
Given that both the GMM and UBM have M mixtures, we chose to test the top N mixtures for D dialects. The number of mixture tests (Nb_{mixture}) was:
Alternatively, for the standard GMM system with all mixture tests, the number of mixture tests was:
In our case, we tested five dialects using a 512 GMM mixture and determined the top 10 mixtures from the adapted models. Only the Nb_{mixture} =512+(10×5)=562 mixture tests compared to Nb_{mixture} =512×5=2560 mixture tests for the standard GMM system showed an improvement computation of up to 500%. One of the pitfalls of this method is the possible degradation of accuracy.
7. Experiments
We used our own database for all of the experiments described in this paper, as described in Section 2. Prior to automatic dialect identification, the speech signals are first preprocessed by the zero frequency filtering (ZFF) method [20]. The ZFF method is robust against various degradations since most of the frequency components have been attenuated and computed from the speech signal s(n), as:
The ZFF is based on difference the speech signal to remove any timevarying low frequency noise of speech signals.
7.1 Parameterization
From the 10 seconds of training and test utterance sets, we extracted vectors composed of 39 dimensional features, which consisted of 12 MFCCs derived from 20 filter banks. Each feature vector was extracted at 10 millisecond intervals using a 30 millisecond Hamming window limited band (300–3,400 Hz) speech. In the first stage, an utterance based on cepstral mean subtraction was applied to the features to remove channel distortion. Then, based on the cepstral feature, we computed 12 SDC coefficients. SDC computations are controlled by four parameters (N,d,P,k), as discussed in [6,7]. For our study, we used the (10,1,3,3) SDC parameter configuration. The SDC parameterization has been chosen for usage by many researchers on a series of development tests.
7.2 Reducing Data
There are two key topics for conducting a reducing data from a systematic series of experiments. For the first topic, we used the system that was based on reduced data that was taken from Algorithm 3 (multiclass OVO approach). For the second topic, we used the system that was based on reduced data that was taken from Algorithm 4 (multiclass OVA approach). We used the fuzzy Cmean clustering algorithm for both approaches.
7.3 Training
In order to train the UBM, the training data from all of the dialects was pooled together. Since this increases the training set size, the trained UBM will have a higher number of Gaussian Mixtures than GMMs trained on individual dialects.
We trained 512 genderindependent mixtures from each UBM with diagonal covariance matrices. The kernel that we used for the two algorithms (OVO and OVA approaches) was the Gaussian Radial Basis Function with 0.50, a fixed value of σ. The MAP adaptation in training was only done on the mean vectors from the UBM with a relevance factor r of 16.
7.4 Testing
The purpose of the test was to find the maximum score for dialect identification. In this process, five clusters with the mixture order from 2 to 512 were created for each Maghrebian dialect. For each test sample, the SDC coefficients were calculated and compared with each of the five clusters for a mixture order from 2, 4, 8, and 16 to 512. The test sample belonged to the cluster having the higher score. A precision was calculated for each dialect using the formula Precision=(Corect/Total)×100, where Corect defined the number of samples that were correctly classified and Total was the total number of samples given for testing.
Three key topics conduct a systematic series of experiments. For the first topic, we used the first system baseline. For the second and the third topics, we used the second system with Reduced UBM that was taken Algorithm 3 (multiclass OVO approach) or Algorithm 4 (the multiclass OVA approach), respectively. Then, the dialect identification performance was used as a function of the different training and testing sets. Finally, we compared the accuracy of dialect identification for both of the systems. As shown in Tables 1–3, we show the percentage precision for the five dialects for different mixtures.
Our results showed that the system based on reduced GMMUBM from the OVA multiclass L2SVM outperformed the GMMUBM baseline with a precision rate of 74.99%, as compared to 72.84%. The system based on reduced GMMUBM from the OVO multiclass L2SVMs exhibited the best performance with a precision rate of 80.49%.
8. Conclusion
Our study was on the Arabic Maghrebian dialect for the purpose of automatic identification. No other studies have been carried out on this before. In this paper, we have introduced two multiclass SVMs approaches reduced to MEB algorithms for improving a baseline GMMUBM dialect identification system that automatically identifies acoustic differences between dialects by reducing the data in UBM and eliminating the data that is outside the ball defined by the MEB.
We have proposed two algorithms to compute an approximation formulation to the MEB for a given finite set of vectors. Both algorithms are especially well suited for largescale instances of the MEB problem and can compute a small CoreSet whose size only depends on the approximation parameter.
In addition, it is important to note that Gaussians affected by the MAP adaptation conduct to high performance of the system, as shown in our experiments.
We conducted a series of experiments to test our approach on five Arabic Maghrebian dialects of spontaneous conversations and to compare our results to those of the baseline system. The system based on the multiclass SVM OVO approach outperformed the other approaches.
By comparing our OVO and OVA approaches applied to the dialect identification system to corresponding baseline system, we obtained an improvement of dialect identification, in absolute precision, of 80.49% for the first and 74.99% for the second.
References
Biography
NourEddine Lachachi
He is an Assistant Master in Computer Science at Oran University, Algeria. He received his State Engineering in 1988 from Study and Research Center on Computer Science, Algiers, and his Magister in Computer Science from Oran University. Currently, he is a doctor student. His research has focused on automatic spoken dialect identification and recognition.
Abdelkader Adla
He is a Full Professor in Computer Science at University of Oran, Algeria. He received his Ph.D. in Computer Science, Artificial Intelligence from Paul Sabatier University Toulouse III, France. He received also a State Doctorate in ComputerAided Design and Simulation from University of Oran in 2007. He has published papers on collaborative decision making, decision support systems (DSS), distributed group DSS and multiagents DSS. His research interests focus on group DSS, facilitation, cooperative and collaborative systems, organizational memory and multiagent decision support systems.