I-93
Preparations for the presentation of "Structurally discriminative
graphical models for automatic speech recognition - results from the
2001 John Hopkins Summer Workshop" by G. G. Zweig and others
Main Parts:
Paper Structure
- Text
- Abstract
- former techniques
前のテクニック
- until now main focus was on discriminative parameter
training techniques which are typified by Maximum Mutual Information
training
今まで、 焦点 【しょうてん】 は ディスクリミネイティヴパラメタを 学習
【がくしゅう】するテクニックでした。書き分ける 【かきわける】のは "Maximum Mutual Information
training" です。
- a fixed statistical modeling structure is assumed, only the
associated numerical parameters are optimized
統計的 【とうけいてき】なモデルの 構造 【こうぞう】が 仮定 【かてい】される。 パラメタのオプティマイズ
だけをする。
- this paper's new techniques
この 論文 【ろんぶん】の新しいテクニックは:
- focus is on structure learning
構造 【こうぞう】 を 習う 【ならう】事 【こと】が 焦点 【しょうてん】です。
- fundamental dependency relationships between random
variables are learned discriminatively and separately from the numerical
parameters
確率変数 【かくりつへんすう】 の 間 【あいだ】の 基本 【きほん】の 依存 【いぞん】関係
【かんけい】 はパラメタからディスクリミネイティヴとセパレートに 学習 【がくしゅう】されます。
- before: structure <- not learned (fixed), parameters
<- learned
now: structure <- learned, parameters <- learned
前に:構造 【こうぞう】 は 固定 【こてい】ですが、パラメタを 習います 【ならいます】。
今: 構造 【こうぞう】 と パラメタを 習います 【ならいます】。
- what was done, data sources from 2001 Johns Hopkins Summer
Workshop briefly described
つぎは2001年の"John Hoplins Summer Workshop" の 成行き 【なりゆき】のデータ です。
- adaptation of graphical
model -> application of principles of structural discriminability
possible
- the Graphical Model Toolkit (GMTK) was used
- result:significant gains through the use of discriminative
structural analysis with both MFCC and AM-FM features
- 1. Introduction
"Introduction"
- Maximum Mutual
Information
"Maximum Mutual Information":
- procedure for discriminatively optimizing HMM transition and
observation probabilities
ディスクリミネイティヴにオプティマイズされる "HMM" の 遷移 【せんい】の と 出力 【しゅつりょく】の 確率
【かくりつ】の 為の 【ための】 手順 【てじゅん】です。
- model: fixed model-structure, only numerical parameters are
optimized
パラメタだけがオプティマイズされている 固定 【こてい】されたモデルの 構造 【こうぞう】です。
- difference between structural discriminability
and optimizing parameters emphasized
オプティマイズするパラメタと 構造 【こうぞう】の "discriminability" の 間 【あいだ】の 違い
【ちがい】は
重要 【じゅうよう】である。
- focus is learning of actual dependency structure between
random variables in class-conditional probabilistic models
焦点 【しょうてん】は 条件付き 【じょうけんつき】確率 【かくりつ】の 確率変数 【かくりつへんすう】の 間
【あいだ】の 実際 【じっさい】の 依存 【いぞん】 構造 【こうぞう】を 学習 【がくしゅう】する。
- general pattern classification setup described
どんなパターンかの 記述 【きじゅつ】:
- set of K classes {C1, ... CK}
- representation of each Ck as set of T random
variables X1:T={X1, ..., XT},
1<=k<=T
"K"このクラスから 成る 【なる】 集合 【しゅうごう】があると、クラス 毎 【ごと】 の 表現
【ひょうげん】は "T"この 確率変数 【かくりつへんすう】から 成る 【なる】 集合 【しゅうごう】です。(1から
"T"まで)
- representation of the class-conditional distributions (used
to perform pattern classification) over the random variables by a
probabilistic model P(X1:T | Ck) for each Ck
"k"この クラス 毎 【ごと】に 条件付き 【じょうけんつき】確率
【かくりつ】が 割り当て 【わりあて】られる。
- Bayes Decision Theory
ベイズのデシジョンメーキングのセオリー:
- modulo a 0/1-loss function: the optimal choice is the clas
with the highest posterior probability:
0/1 の 損失 【そんしつ】の 関数 【かんすう】のモジュロでは 一番 【いちばん】高い 【たかい】事後
【じご】 確率 【かくりつ】のクラスを 一番 【いちばん】 いい 選択 【せんたく】とする。
k* = argmax_over_k P(Ck | X1:T) = argmax_over_k P(X1:T
| Ck)*P(Ck)
- aim of
structural discriminability
志向 【しこう】は 構造 【こうぞう】をディスクリミネイティヴにする 事 【こと】 です:
- identify minimum set of dependencies in P(X1:T | Ck)
to obtain results as close to the above formula as possible
結果 【けっか】を 式 【しき】 同じく 【おなじく】 見つける 【みつける】 為に
【ために】負んぶに抱っこ 【おんぶにだっこ】の 最少 【さいしょう】の 集合 【しゅうごう】 を 見分けて
【みわけて】みます。
前 【まえ】の式 【しき】にできるだけ 近い 【ちかい】結果 【けっか】を 得る 【える】為に
【ために】P(X1:T | Ck) の 最小 【さいしょう】の 集合 【しゅうごう】を 見つける 【みつける】。
- general description of Directed Graphical Models /
Bayesian Networks
グラフィカのモデルの 記述 【きじゅつ】:
- nodes represent random variables
ノードは 確率変数 【かくりつへんすう】の 表現 【ひょうげん】をします。
- arcs encode conditional independence assumptions amongst
variables
アークは 確率変数 【かくりつへんすう】の 中 【なか】で 条件付き 【じょうけんつき】で 独立 【どくりつ】に
仮定 【かてい】に エンコードをします。
- Xπi be parent of Xi
"Xπi" は "Xi" の 親 【おや】です。
- a specific value of Xi be xi
"Xi" の 特定 【とくてい】のバリューは "xi" です。
- then we can factor the joint distribution:
つなぎ目 【つなぎめ】の 分布 【ぶんぷ】のファクトライゼションをできます。
P(X1 = x1, ..., Xn = xn) = Πi
P(Xi = xi | Xπi = xi)
- Software used: GMTK
- results obtained from 2001 Johns Hiokins Summer Workshop,
paper structure previewed
- 2. Base Graphical Model
- how to emulate a HMM using a Graphical Model (GM)
- create state and transition variables in each time frame
=> values must refer to the corresponding in an underlying finite
state HMM graph
- set conditional probabilities in the graphical model: each
assignment of its variables corresponds to a path through the HMM graph,
with same probabilities
- explanation of Figure 1
- see figure 1 for the
listing of the variables
- in decoding single
likeliest assignment of variables is found
- Figure 2 briefly described, i.e.
changes to Figure 1
- 3. Discriminative Structural Learning Algorithms
- in the past, the maximum
likelihood principle was emphasized on
- identifying model structures that maximize the probability
of some observed data
- focus of this work
- class conditional
distribution which maximizes the likelihood of the observed data
is no longer required, instead the focus is on finding discriminative
representations which maximize classification
accuracy
- 2 orthogonal
possibilities:
- 1.) discriminative parameter training methods
- 2.) each class-conditional model's underlying structure
only represents that which helps for discrimination
- how can we produce a
discriminatively structured graphical model?
- need to measure discriminative quality of an edge (in a
graph), Xti is the ith component of the tth feature vector, whereas t is the time (s. Fig. 4 & Fig. 5) and i is the ith entry in the vector,
in a HMM a hidden variable representing a phone or some sub-phonetic
unit is always the parent of Xti whilst in a DGM/DBN we add additional
between-observation edges by allowing Xrj also to be a parent of Xti where r<t
- properly choosing additional between-observation edges, Explaining
Away Residual (EAR) briefly introduced
- Q be a class random variable
- we consider adding edges to all the elements of the random
vector X from all elements of Y
- EAR(X, Y) = I(X, Y | Q) - I(X, Y)
- optimizing the EAR is equal to optimizing the Mutual Information function, specified
as I(X, Q | Y) in this case => implies that the goal of the EAR
measure is to choose additional parents of X to increase the MI between
X and as much as possible
- EAR problems
- EAR in its most general form is difficult to compute =>
in many cases only approximation can be used
- here X and Y are scalars instead of vectors and thus only
the pairwise quality of edges can be measured
- hence the utility of multiple edges can not be measured and
single edges were then chosen in a greedy fashion
- other work: switching measure approximation was designed
for each Q=q using I(X, Y | Q=q) - I(X, Y)
- this work: single global discriminative structure using the
average across the possible values of Q, Q can range over either words
or individual states within words
- 4. Experimental Results
- 4.1 Corpora
- data used in the experiment
- Aurora continuous digit recognition task
- TIDigits data passed through telephone channel filters
and subjected to additive noises
- sets A,B and C of the multi-condition set -> 8440
utterances, 70000 test sentences
- database processed in two different ways:
- standard front-end used to produce MFCCs
- delta features and double-delta features added
- cepstral mean substraction
- => 42 dimensional feature vector (s. Fig. 3)
- AM-FM task
- spectrum seperated into 20 equally spaced bands by
using multiplex quadrature band pass filters
- neighbouring pair of filters: higher-band filter output
is multiplied by the conjugate of the lower-band output
- then low-pass filtered and sampled every 10ms
- FM: sine of the angle of the sampled output
- AM: log of real component
- probably the features could be more improved
- 4.2 Mutual Information Measures
- discriminative MI between
observation components
- conditioning only on observation components, not on
hidden states
- s. Fig. 2
- Mutual Information experiment results: information is
strongest at syllabel lag 100-150ms
- 4.3 Induced Structures
- MFCC features were used
- Q-values corresponding to words in the EAR measure
- C0 at
time t and C0 at time t-100ms are conditioned
- s. Fig. 4
- AM-FM features were used as possible
conditioning parents for MFCCs
- there is indication that FM features provide significant
discriminative information about MFCCs
- s. Fig. 5
- 4.4 Word Error Rate Results
- baseline systems (GMTK-HMM) specs
- built by emulating HMM with the GMTK in
order to validate the structure learning methods
- then enhanced with discriminative structure
- vocabulary: 11 words with 16 states each
- no parameter tying between states
- additional silence models and short-pause models were
used: 3 silence states and middle state tied to short-pause, models were
strictly left-to-right
- 4 Gaussians/state -> 715 Gaussians total
- only best results which stem from the whole-word system
are presented in the paper
- results in Table
1
- published HP model:
- 3 Gaussians/state -> less than 546 Gaussians total
- results from various structure-induced systems
- relative improvement in word-error-rate for several
structure-induced systems
- performance of a conventional system was much worse under
noisy conditions when its size was increased to that of the GMTK-HMM
- => structure-induction improves performance in a
robust way
- 5. Conclusion
- results of 2001 Johns Hopkins CLSP workshop were described
- graphical models were tested with the GMTK
- improvements of 10-15% on the Aurora 2.0 recognition task
using both MFCC and AM-FM
features
- expectation: discriminative structure
learning techniques will be a good complement for
discriminative parameter learning methods
- 6. References
- Figures
- Figure 1
- recognition of the word "hi"
is shown
- variables in the network:
- in graph, top-to-bottom:
- End-of-Utterance:
assigned value of 1 with a the following probability distribution:if
the final word-transition variable
!= 1, assign 0
- Word: indicates
which word is being spoken (index in the dictionary => 5 means 5th word in the
dictionary)
- Word-transition: set to 0, but in
the last frame a phone-transition in the last position of the word
forces a value of 1
- Word-position:
within-word position
- Phone-transition:
1 when phone transition occurs
- Phone: acoustic
state corresponding to word-position
- Features: the
features deemed important
- Acoustics: the feature vector for a frame
- note: there is an error: the 3rd phone should be /AY/,not /HH/
- Figure 2
- shows basically the same as Fig.1,but
some additional edges between observations were added <=>
expanding the graph over observation feature vectors and adding edges
between those variables
- structures are expanded views of conditioning relationships
among individual entries of the acoustic feature vectors
- Figure 3
- displays the discriminativemutual information as a function
of
- a) parent feature position j
- b) time lag
- feature vector: (C1,
..., C12, C0, log-energy, delta_C1,
..., delta_C12, delta_C0, delta_log-energy,
delta_delta_C1, ..., delta_delta_C12,
delta_delta_C0, delta_delta_log-energy), i.e. there are 3
parts with 14 feature vector entries each
- Q ranges over word-values
- the color indicates the amount of Mutual Information (MI):
~0.3=white=very high MI, 0.05=black=very low MI
- plotted as a function of either i or j and the time lag t-r
- discriminative
mutual information is strongest when C0 or log-energy are
parent features => above order chosen
- Figure 4
- displays the induced conditioning relationships for the
system of Fig. 3, feature
vector is also the same
- the thicker the line, the greater the strength of the EAR measure
- each feature vector entry was conditioned on 1 other entry
- Figure 5
- displays the induced conditioning relationships for AM-FM features
- Q ranges over word-state values
- feature vector: three parts
with a total of 82 features: 42+20+20=82 features (distribution guessed)
(MFCC, AM, FM)
- the MFCCs were conditioned on <=2 parents (up to two
parents)
- Tables
- Table 1
- Word recognition rates as a function of SNR
- Table 2
- word-error-rate improvementin
percent for structure induced systems
- tested structure-induced systems:
- same number of parameters:
- WW: Q ranges
over words
- WWS: Q ranges
over states
- EP: straight
Gaussian system with twice (2 x ...) as many Gaussians as the baseline
- different number of parameters:
Definitions, Terms,
Glossary
- Switchboard
- Conversational Speaking Style: The language model is
effected due to
disfluencies , syntactic and
discourse patterns, and linguistic incoherence of automatic
segmentation. Of particular difficulty are the false starts,
interruptions, and bad grammar.
- Pronunciation Effects:
Speaking rate variability and
reduced pronunciations of conversational speech are difficult
enough for human recognition, and they only compound the already
difficult task of automatic speech recognition. A key issue in
this problem is the poor correspondence between pauses and phrase
boundaries.
- Telephone Bandwidth and Ambient Noise: The
limited bandwidth of telephone degrades the original speech signal
making acoustic discrimination a challenge. Crosstalk, background
speech, and ambient noise like music and television all
contribute to an increased error rate in automatic recognition.
Remarkably, humans do a very good job of filtering out these
types of noise, while our current statistical modeling techniques
fall far short.
- Training Data: A wealth of accurately transcribed
training data is necessary to produce good acoustic and language
models. Unfortunately, the aforementioned problems are not a problem
only during automatic recognition, they also come into play when humans
are transcribing the data. A few good examples of these problems can be
found on the SWITCHBOARD resegmentation project FAQ.
- src: http://www.isip.msstate.edu/projects/switchboard/doc/education/challenges/problem.html
- Mutual Information
- entropy: This gives us
log(N) when all the p_i are equal, which makes sense: then there are no
prefered messages, and the order of asking doesn't make any difference.
The sum is called, variously, the information, the information content,
the self-information, the entropy or the Shannon entropy of the message,
conventionally written H[S].
- joint entropy: H[S, T]
<= H[S]+H[T]
- conditional entropy: the
entropy of T conditioned on S, written H[T|S] = H[T, S] - H[S]
- mutual information: written I(S; T),
which is the amount we learn about T from knowing S, i.e., the number of
questions it saves us from having to ask, i.e., H[T] - H[T|S], which is,
as it happens, always the same as H[S] - H[S|T]. (Hence ``mutual.'')
- src: Information
Theory
- Maximum Mutual
Information
- Maximum Likelihood (ML)
- Consider the HMM probability measure P(O | lambda), where O is
a sequence of observables and lambda is the HMM parameter (the HMM model)
- if the HMM parameter lambda is assumed to be fixed but unknown,
the maximum likelihood estimate for lambda is obtained via solving the
equation
dP(O | lambda) / dlambda = 0
(its derivative is solved for 0)
- if lambda is assumed to be random with an a-priori distribution
P0(lambda), then the Maximum A Posteriori (MAP) estimate
for lambda is obtained by solving
dP(lambda | O) / dlambda = 0
- using Bayes Theorem this
can be rewritten as
P (lambda | O) = P(O | lambda)*P0(lambda) / P(O)
- src: "Fundamentals of Speech Recognition", Lawrence Rabiner and
Biing-Hwang Juang
- discriminative
fashion, structural discriminability
- aim: identify a minimal set of dependencies in class
conditional distributions P(X[1:T] | C[k]) such that there is little or
no degradation in classification accuracy relative to Bayes decision
theory
- goal: learn discriminatively the actual dependency structure
between random variables in class-conditional probabilistic models
- src: paper I-93
- MFCC
- Mel Frequency Cepstral Coefficients
- AM-FM
- AM: Amplitude Modulation
- FM: Frequency Modulation
- Aurora
- Mel-Scale
- feature-extraction
- Bayes Decision Theory
- Bayes' decision theory is a fundamental statistical approach to
the problem of pattern classification. This approach is based on the
assumption that the decision problem is posed in probabilistic terms,
and that all of the relevant probability values are known.
The aim of the decision is to assign an object to a class. This
classification is only carried out by means of measurements taken over
the objects.
- Using Bayes Theorem: P(wj|x) = p(x|wj)*P(wj)/p(x)
- src: http://www.cs.mcgill.ca/~mcleish/644/decision.html
- feature vector
- EAR (Explaining Away
Residual)
- information-theoretic discriminative edge quality measurement
- Q is class random variable, we're adding edges to all the the
elements of the random vector X from all the elements of Y, the EAR
measure is defined as
EAR(X, Y) = I(X, Y | Q) - I(X, Y)
where I(X, Y | Q) is the conditional and I(X, Y) is the unconditional mutual information between vectors X and
Y
- src: paper I-93
- Dynamic Bayesian
Networks <-> HMM
- HMM is special case of DBN, hence a DBN in its most simple form
blends with the HMM framework
- DBN combine most advantages of HMM while avoiding many of their
limitations
- Extensibility: HMM are
limited in their extensibility because the main way of increasing
complexity is to increase the number of states -> awkward when the
overall state of the system is actually composed of a combination of
seperately identifiable factors. DBNs can handle large number of
variables, provided that the graph structure is sparse
- Linearity: both HMMs
and DBNs are well suited to model this
- Nonlinearity: both
HMMs and DBNs are well suited to model this
- Interpretability:
parameters of HMM are labeled as "transition" and "emission"
probabilites, however, states do not always have a clear interpretation,
especially after training, and if there are seperable, clearly
identifiable factors, efficiently modeling those is difficult. In DBNs
each variable expresses a specific concept
- Factorization: DBNs
require exponentially fewer parameters compared to an unfactored HMM
with an equal number of possible states, whilst a factored HMM is
computationally less efficient
- In a DBN, hidden states may be conditioned on Observables
- src: paper "Speech Recognition with Dynamic Bayesian Networks",
G. G. Zweig
- Greedy Algorithm
- The greedy method is a local optimization method. It assumes
that the path to the best global optimum is a series of locally optimal
steps. At each step, one adds the pairing that has the next shortest
distance.
The problem of pairing things up does not have the greedy property, so
the algorithm doesn't work. It does produce pairings of some kind, and
is easy to implement. The running time is O(N9)
src: http://www.sandelman.ottawa.on.ca/People/Michael_Richardson/thesis/subsubsection1.2.0.4.2.1.html
- Definition: An algorithm that always takes the
best immediate, or local, solution while finding an answer. Greedy
algorithms find the overall, or globally, optimal
solution for some optimization
problems, but may find less-than-optimal solutions for some
instances of other problems.
src: http://www.nist.gov/dads/HTML/greedyalgo.html
- An algorithm
is a step-by-step recipe for solving a problem. A greedy
algorithm might also be called a "single-minded" algorithm or
an algorithm that gobbles up all of its favorites first. The idea
behind a greedy algorithm is to perform a single procedure in the recipe
over and over again until it can't be done any more and see what kind of
results it will produce. It may not completely solve the problem, or,
if it produces a solution, it may not be the very best one, but it is
one way of approaching the problem and sometimes yields very good (or
even the best possible) results.
src: http://www.c3.lanl.gov/mega-math/gloss/compute/greedy.html
- log-energy
- cepstrum,
mel-cepstrum, cepstral-mean
- the cepstrum c(m) of a
real sequence x(n) is defined as
c(m) = 1/(2*π*j)*CURVEINTEGRAL_over_C((log X (z))*zm-1*dz)
log X(z) = SUM_over_-infinity<=x<=infinity (c(m)*z-m)
where X(z) is the z-transform of x(n), and C is counterclockwise
closed contour in the region of convergence of log X(z) and encircling
the origin of the z-plane.
src: http://kt-lab.ics.nitech.ac.jp/~tokuda/tips/mgceptr_sa2.pdf
- Frequency-transformed cepstrum, so-called mel-cepstrum, is defined as
c'(m) = 1/(2*π*j)*CURVEINTEGRAL_over_C((log X(z) ) z'm-1*dz')
log X(z) = SUM_over_-infinity<=x<=infinity (c'(m)*z'-m)
z'-1 = psi(z) = (z-1 - alpha) / (1 - alpha*z-1),
-1 < alpha < 1
src: http://kt-lab.ics.nitech.ac.jp/~tokuda/tips/mgceptr_sa2.pdf
- thus the cepstral-mean
is the mean of the cepstrum coefficients
- cepstrum: transforms the
log-spectrum of the speech signal, thus simulating human hearing
above certain frequencies.
src:
http://www.isip.msstate.edu/projects/speech/software/tutorials/production/fundamentals/current/glossary/
- cepstral-mean substraction:
technique addressing distortions. It subtracts the mean cepstral value
from each feature vector and then produces a normalized cepstrum vector
which can better capture the acoustics where recognition occurs.
src: http://www.isip.msstate.edu/projects/speech/software/tutorials/production/fundamentals/current/glossary/
- baseline HMM
- SNR
- Kullback-Leibler
divergence (KL-divergence)
- the KL-divergence between 2 probability model functions p(x)
and q(x) can be seen as the average number of bits that are wasted by
encoding events from a distribution p with a code based on a
not-quite-right distribution
src: http://www.cs.dal.ca/~nat/Courses/NLP-Course/NLP-Lecture-5.ppt
- it is a relative entropy measure
Further recommended
references