I-93
Preparations for the presentation of "Structurally discriminative
graphical models for automatic speech recognition - results from the
2001 John Hopkins Summer Workshop" by G. G. Zweig and others
This file has been partially translated to Japanese.
Main Parts:
Paper Structure
- Text
- Abstract
- former techniques
- until now main focus was on discriminative parameter
training techniques which are typified by Maximum Mutual Information
training
- a fixed statistical modeling structure is assumed, only the
associated numerical parameters are optimized
- this paper's new techniques
- focus is on structure learning
- fundamental dependency relationships between random
variables are learned discriminatively and seperately from the numerical
parameters
- before: structure <- not learned (fixed), parameters
<- learned
now: structure <- learned, parameters <- learned
- what was done, data sources from 2001 Johns Hiokins Summer
Workshop briefly described
- adoptation of graphical model -> application of
principles of structural discriminability possible
- the Graphical Model Toolkit (GMTK) was used
- result:significant gains through the use of discriminative
structural analysis with both MFCC and AM-FM features
- 1. Introduction
- Maximum Mutual
Information
- procedure for discriminatively optimizing HMM transition
and observation probabilities
- model: fixed model-structure, only numeric parameters are
optimized
- difference between structural discriminability
and optimizing parameters emphasized
- focus is learning of actual dependency structure between
random variables in class-conditional probabilistic models
- general pattern classification setup described
- set of K classes {C1, ... CK}
- representation of each Ck as set of T random
variables X1:T={X1, ..., XT},
1<=k<=T
- representation of the class-conditional distributions (used
to perform pattern classification) over the random variables by a
probabilistic model P(X1:T | Ck) for each Ck
- Bayes Decision Theory
- modulo a 0/1-loss function: the optimal choice is the clas
with the highest posterior probability:
k* = argmax_over_k P(Ck | X1:T) = argmax_over_k P(X1:T
| Ck)*P(Ck)
- aim of
structural discriminability
- identify minimum set of dependencies in P(X1:T | Ck)
to obtain results as close to the above formula as possible
- general description of Directed Graphical Models /
Bayesian Networks
- nodes represent random variables
- arcs encode conditional independence assumptions amongst
variables
- Xπi be parent of Xi
- a specific value of Xi be xi
- then we can factor the joint distribution:
P(X1 = x1, ..., Xn = xn) = Πi
P(Xi = xi | Xπi = xi)
- Software used: GMTK
- results obtained from 2001 Johns Hiokins Summer Workshop,
paper structure previewed
- 2. Base Graphical Model
- how to emulate a HMM using a Graphical Model (GM)
- create state and transition variables in each time frame
=> values must refer to the corresponding in an underlying finite
state HMM graph
- set conditional probabilities in the graphical model: each
assignment of its variables corresponds to a path through the HMM graph,
with same probabilities
- explanation of Figure 1
- see figure 1 for the
listing of the variables
- in decoding single
likeliest assignment of variables is found
- Figure 2 briefly described, i.e.
changes to Figure 1
- 3. Discriminative Structural Learning Algorithms
- in the past, the maximum
likelihood principle was emphasized on
- identifying model structures that maximize the probability
of some observed data
- focus of this work
- class conditional
distribution which maximizes the likelihood of the observed data
is no longer required, instead the focus is on finding discriminative
representations which maximize classification
accuracy
- 2 orthogonal
possibilities:
- 1.) discriminative parameter training methods
- 2.) each class-conditional model's underlying structure
only represents that which helps for discrimination
- how can we produce a
discriminatively structured graphical model?
- need to measure discriminative quality of an edge (in a
graph), Xti is the ith component of the tth feature vector, whereas t is the time (s. Fig. 4 & Fig. 5) and i is the ith entry in the vector,
in a HMM a hidden variable representing a phone or some sub-phonetic
unit is always the parent of Xti whilst in a DGM/DBN we add additional
between-observation edges by allowing Xrj also to be a parent of Xti where r<t
- properly choosing additional between-observation edges, Explaining
Away Residual (EAR) briefly introduced
- Q be a class random variable
- we consider adding edges to all the elements of the random
vector X from all elements of Y
- EAR(X, Y) = I(X, Y | Q) - I(X, Y)
- optimizing the EAR is equal to optimizing the Mutual Information function, specified
as I(X, Q | Y) in this case => implies that the goal of the EAR
measure is to choose additional parents of X to increase the MI between
X and as much as possible
- EAR problems
- EAR in its most general form is difficult to compute =>
in many cases only approximation can be used
- here X and Y are scalars instead of vectors and thus only
the pairwise quality of edges can be measured
- hence the utility of multiple edges can not be measured and
single edges were then chosen in a greedy fashion
- other work: switching measure approximation was designed
for each Q=q using I(X, Y | Q=q) - I(X, Y)
- this work: single global discriminative structure using the
average across the possible values of Q, Q can range over either words
or individual states within words
- 4. Experimental Results
- 4.1 Corpora
- data used in the experiment
- Aurora continuous digit recognition task
- TIDigits data passed through telephone channel filters
and subjected to additive noises
- sets A,B and C of the multi-condition set -> 8440
utterances, 70000 test sentences
- database processed in two different ways:
- standard front-end used to produce MFCCs
- delta features and double-delta features added
- cepstral mean substraction
- => 42 dimensional feature vector (s. Fig. 3)
- AM-FM task
- spectrum seperated into 20 equally spaced bands by
using multiplex quadrature band pass filters
- neighbouring pair of filters: higher-band filter output
is multiplied by the conjugate of the lower-band output
- then low-pass filtered and sampled every 10ms
- FM: sine of the angle of the sampled output
- AM: log of real component
- probably the features could be more improved
- 4.2 Mutual Information Measures
- discriminative MI between
observation components
- conditioning only on observation components, not on
hidden states
- s. Fig. 2
- Mutual Information experiment results: information is
strongest at syllabel lag 100-150ms
- 4.3 Induced Structures
- MFCC features were used
- Q-values corresponding to words in the EAR measure
- C0 at
time t and C0 at time t-100ms are conditioned
- s. Fig. 4
- AM-FM features were used as possible
conditioning parents for MFCCs
- there is indication that FM features provide significant
discriminative information about MFCCs
- s. Fig. 5
- 4.4 Word Error Rate Results
- baseline systems (GMTK-HMM) specs
- built by emulating HMM with the GMTK in
order to validate the structure learning methods
- then enhanced with discriminative structure
- vocabulary: 11 words with 16 states each
- no parameter tying between states
- additional silence models and short-pause models were
used: 3 silence states and middle state tied to short-pause, models
were strictly left-to-right
- 4 Gaussians/state -> 715 Gaussians total
- only best results which stem from the whole-word system
are presented in the paper
- results in Table
1
- published HP model:
- 3 Gaussians/state -> less than 546 Gaussians total
- results from various structure-induced systems
- relative improvement in word-error-rate for several
structure-induced systems
- performance of a conventional system was much worse under
noisy conditions when its size was increased to that of the GMTK-HMM
- => structure-induction improves performance in a
robust way
- 5. Conclusion
- results of 2001 Johns Hopkins CLSP workshop were described
- graphical models were tested with the GMTK
- improvements of 10-15% on the Aurora 2.0 recognition task
using both MFCC and AM-FM
features
- expectation: discriminative structure
learning techniques will be a good complement for
discriminative parameter learning methods
- 6. References
- Figures
- Figure 1
- recognition of the word "hi"
is shown
- variables in the network:
- in graph, top-to-bottom:
- End-of-Utterance:
assigned value of 1 with a the following probability distribution:if
the final word-transition variable
!= 1, assign 0
- Word: indicates
which word is being spoken (index in the dictionary => 5 means 5th word in the
dictionary)
- Word-transition: set to 0, but in
the last frame a phone-transition in the last position of the word
forces a value of 1
- Word-position:
within-word position
- Phone-transition:
1 when phone transition occurs
- Phone: acoustic
state corresponding to word-position
- Features: the
features deemed important
- Acoustics: the feature vector for a frame
- note: there is an error: the 3rd phone should be /AY/,not /HH/
- Figure 2
- shows basically the same as Fig.1,but
some additional edges between observations were added <=>
expanding the graph over observation feature vectors and adding edges
between those variables
- structures are expanded views of conditioning relationships
among individual entries of the acoustic feature vectors
- Figure 3
- displays the discriminativemutual information as a function
of
- a) parent feature position j
- b) time lag
- feature vector: (C1,
..., C12, C0, log-energy, delta_C1,
..., delta_C12, delta_C0, delta_log-energy,
delta_delta_C1, ..., delta_delta_C12, delta_delta_C0,
delta_delta_log-energy), i.e. there are 3 parts with 14 feature vector
entries each
- Q ranges over word-values
- the color indicates the amount of Mutual Information (MI):
~0.3=white=very high MI, 0.05=black=very low MI
- plotted as a function of either i or j and the time lag t-r
- discriminative
mutual information is strongest when C0 or log-energy are
parent features => above order chosen
- Figure 4
- displays the induced conditioning relationships for the
system of Fig. 3, feature
vector is also the same
- the thicker the line, the greater the strength of the EAR measure
- each feature vector entry was conditioned on 1 other entry
- Figure 5
- displays the induced conditioning relationships for AM-FM features
- Q ranges over word-state values
- feature vector: three parts
with a total of 82 features: 42+20+20=82 features (distribution guessed) (MFCC, AM, FM)
- the MFCCs were conditioned on <=2 parents (up to two
parents)
- Tables
- Table 1
- Word recognition rates as a function of SNR
- Table 2
- word-error-rate improvementin
percent for structure induced systems
- tested structure-induced systems:
- same number of parameters:
- WW: Q ranges
over words
- WWS: Q ranges
over states
- EP: straight
Gaussian system with twice (2 x ...) as many Gaussians as the baseline
- different number of parameters:
Definitions, Terms,
Glossary
- Switchboard
- Conversational Speaking Style: The language model is
effected due to
disfluencies , syntactic and
discourse patterns, and linguistic incoherence of automatic
segmentation. Of particular difficulty are the false starts,
interruptions, and bad grammar.
- Pronunciation Effects:
Speaking rate variability and
reduced pronunciations of conversational speech are difficult
enough for human recognition, and they only compound the already
difficult task of automatic speech recognition. A key issue in
this problem is the poor correspondence between pauses and phrase
boundaries.
- Telephone Bandwidth and Ambient Noise: The
limited bandwidth of telephone degrades the original speech signal
making acoustic discrimination a challenge. Crosstalk, background
speech, and ambient noise like music and television all
contribute to an increased error rate in automatic recognition.
Remarkably, humans do a very good job of filtering out these
types of noise, while our current statistical modeling techniques
fall far short.
- Training Data: A wealth of accurately transcribed
training data is necessary to produce good acoustic and language
models. Unfortunately, the aforementioned problems are not a problem
only during automatic recognition, they also come into play when humans
are transcribing the data. A few good examples of these problems can be
found on the SWITCHBOARD resegmentation project FAQ.
- src: http://www.isip.msstate.edu/projects/switchboard/doc/education/challenges/problem.html
- Mutual Information
- entropy: This gives us
log(N) when all the p_i are equal, which makes sense: then there are no
prefered messages, and the order of asking doesn't make any difference.
The sum is called, variously, the information, the information content,
the self-information, the entropy or the Shannon entropy of the message,
conventionally written H[S].
- joint entropy: H[S, T]
<= H[S]+H[T]
- conditional entropy: the
entropy of T conditioned on S, written H[T|S] = H[T, S] - H[S]
- mutual information: written I(S; T),
which is the amount we learn about T from knowing S, i.e., the number of
questions it saves us from having to ask, i.e., H[T] - H[T|S], which is,
as it happens, always the same as H[S] - H[S|T]. (Hence ``mutual.'')
- src: Information
Theory
- Maximum Mutual
Information
- Maximum Likelihood (ML)
- Consider the HMM probability measure P(O | lambda), where O is
a sequence of observables and lambda is the HMM parameter (the HMM model)
- if the HMM parameter lambda is assumed to be fixed but unknown,
the maximum likelihood estimate for lambda is obtained via solving the
equation
dP(O | lambda) / dlambda = 0
(its derivative is solved for 0)
- if lambda is assumed to be random with an a-priori distribution
P0(lambda), then the Maximum A Posteriori (MAP) estimate
for lambda is obtained by solving
dP(lambda | O) / dlambda = 0
- using Bayes Theorem this
can be rewritten as
P (lambda | O) = P(O | lambda)*P0(lambda) / P(O)
- src: "Fundamentals of Speech Recognition", Lawrence Rabiner and
Biing-Hwang Juang
- discriminative
fashion, structural discriminability
- aim: identify a minimal set of dependencies in class
conditional distributions P(X[1:T] | C[k]) such that there is little or
no degradation in classification accuracy relative to Bayes decision
theory
- goal: learn discriminatively the actual dependency structure
between random variables in class-conditional probabilistic models
- src: paper I-93
- MFCC
- Mel Frequency Cepstral Coefficients
- AM-FM
- AM: Amplitude Modulation
- FM: Frequency Modulation
- Aurora
- Mel-Scale
- feature-extraction
- Bayes Decision Theory
- Bayes' decision theory is a fundamental statistical approach to
the problem of pattern classification. This approach is based on the
assumption that the decision problem is posed in probabilistic terms,
and that all of the relevant probability values are known.
The aim of the decision is to assign an object to a class. This
classification is only carried out by means of measurements taken over
the objects.
- Using Bayes Theorem: P(wj|x) = p(x|wj)*P(wj)/p(x)
- src: http://www.cs.mcgill.ca/~mcleish/644/decision.html
- feature vector
- EAR (Explaining Away
Residual)
- information-theoretic discriminative edge quality measurement
- Q is class random variable, we're adding edges to all the the
elements of the random vector X from all the elements of Y, the EAR
measure is defined as
EAR(X, Y) = I(X, Y | Q) - I(X, Y)
where I(X, Y | Q) is the conditional and I(X, Y) is the unconditional mutual information between vectors X and
Y
- src: paper I-93
- Dynamic Bayesian
Networks <-> HMM
- HMM is special case of DBN, hence a DBN in its most simple form
blends with the HMM framework
- DBN combine most advantages of HMM while avoiding many of their
limitations
- Extensibility: HMM are
limited in their extensibility because the main way of increasing
complexity is to increase the number of states -> awkward when the
overall state of the system is actually composed of a combination of
seperately identifiable factors. DBNs can handle large number of
variables, provided that the graph structure is sparse
- Linearity: both HMMs
and DBNs are well suited to model this
- Nonlinearity: both
HMMs and DBNs are well suited to model this
- Interpretability:
parameters of HMM are labeled as "transition" and "emission"
probabilites, however, states do not always have a clear interpretation,
especially after training, and if there are seperable, clearly
identifiable factors, efficiently modeling those is difficult. In DBNs
each variable expresses a specific concept
- Factorization: DBNs
require exponentially fewer parameters compared to an unfactored HMM
with an equal number of possible states, whilst a factored HMM is
computationally less efficient
- In a DBN, hidden states may be conditioned on Observables
- src: paper "Speech Recognition with Dynamic Bayesian Networks",
G. G. Zweig
- Greedy Algorithm
- The greedy method is a local optimization method. It assumes
that the path to the best global optimum is a series of locally optimal
steps. At each step, one adds the pairing that has the next shortest
distance.
The problem of pairing things up does not have the greedy property, so
the algorithm doesn't work. It does produce pairings of some kind, and
is easy to implement. The running time is O(N9)
src: http://www.sandelman.ottawa.on.ca/People/Michael_Richardson/thesis/subsubsection1.2.0.4.2.1.html
- Definition: An algorithm that always takes the
best immediate, or local, solution while finding an answer. Greedy
algorithms find the overall, or globally, optimal
solution for some optimization
problems, but may find less-than-optimal solutions for some
instances of other problems.
src: http://www.nist.gov/dads/HTML/greedyalgo.html
- An algorithm
is a step-by-step recipe for solving a problem. A greedy
algorithm might also be called a "single-minded" algorithm or
an algorithm that gobbles up all of its favorites first. The idea
behind a greedy algorithm is to perform a single procedure in the recipe
over and over again until it can't be done any more and see what kind of
results it will produce. It may not completely solve the problem, or,
if it produces a solution, it may not be the very best one, but it is
one way of approaching the problem and sometimes yields very good (or
even the best possible) results.
src: http://www.c3.lanl.gov/mega-math/gloss/compute/greedy.html
- log-energy
- cepstrum,
mel-cepstrum, cepstral-mean
- the cepstrum c(m) of a
real sequence x(n) is defined as
c(m) = 1/(2*π*j)*CURVEINTEGRAL_over_C((log X (z))*zm-1*dz)
log X(z) = SUM_over_-infinity<=x<=infinity (c(m)*z-m)
where X(z) is the z-transform of x(n), and C is counterclockwise
closed contour in the region of convergence of log X(z) and encircling
the origin of the z-plane.
src: http://kt-lab.ics.nitech.ac.jp/~tokuda/tips/mgceptr_sa2.pdf
- Frequency-transformed cepstrum, so-called mel-cepstrum, is defined as
c'(m) = 1/(2*π*j)*CURVEINTEGRAL_over_C((log X(z) ) z'm-1*dz')
log X(z) = SUM_over_-infinity<=x<=infinity (c'(m)*z'-m)
z'-1 = psi(z) = (z-1 - alpha) / (1 - alpha*z-1),
-1 < alpha < 1
src: http://kt-lab.ics.nitech.ac.jp/~tokuda/tips/mgceptr_sa2.pdf
- thus the cepstral-mean
is the mean of the cepstrum coefficients
- cepstrum: transforms the
log-spectrum of the speech signal, thus simulating human hearing
above certain frequencies.
src:
http://www.isip.msstate.edu/projects/speech/software/tutorials/production/fundamentals/current/glossary/
- cepstral-mean substraction:
technique addressing distortions. It subtracts the mean cepstral value
from each feature vector and then produces a normalized cepstrum vector
which can better capture the acoustics where recognition occurs.
src: http://www.isip.msstate.edu/projects/speech/software/tutorials/production/fundamentals/current/glossary/
- baseline HMM
- SNR
- Kullback-Leibler
divergence (KL-divergence)
- the KL-divergence between 2 probability model functions p(x)
and q(x) can be seen as the average number of bits that are wasted by
encoding events from a distribution p with a code based on a
not-quite-right distribution
src: http://www.cs.dal.ca/~nat/Courses/NLP-Course/NLP-Lecture-5.ppt
- it is a relative entropy measure
Further recommended
references