Introduction to Graphical Models,
Hidden Markov Models and
Bayesian Networks

Markus Stengel








Address: Department of Information and Computer Sciences

Toyohashi University of Technology

Toyohashi, 441-8580 Japan


Date: March 7th, 2003


I Table of Contents

I Table of Contents I

II HMM, GM, BN 1

1 Hidden Markov Model 2

1.1 Introduction 2

1.2 Markov Processes 2

1.2.1 Discrete Time Markov Processes 2

1.2.2 Limitations of Discrete Time Markov Processes 4

1.3 Definition of Hidden Markov Models 4

1.4 Problems for HMMs 5

1.4.1 Problems 5

1.4.1.1 Probability Evaluation 5

1.4.1.2 Optimal State Sequence 6

1.4.1.3 Parameter Estimation 6

1.4.2 Solutions 6

1.4.2.1 Probability Evaluation 6

1.4.2.2 Optimal State Sequence 8

1.4.2.3 Parameter Estimation 10

1.5 Continuous Observation Densities in HMMs 13

1.6 Types and Classes of HMMs 15

1.6.1 Types 15

1.6.1.1 Ergodic HMM 15

1.6.1.2 Left-Right HMM 16

1.6.1.3 Variations 17

1.6.2 Classes 17

1.6.2.1 Factorial HMMs 17

1.6.2.2 Tree structured HMMs 18

1.6.2.3 Autoregressive HMMs 18

1.6.2.4 Vector Linear Predictor HMMs 19

1.7 Optimization Criteria 20

1.7.1 Maximum Likelihood (ML) 20

1.7.2 Minimum Discrimination Information (MDI) 20

1.7.3 Maximum Mutual Information (MMI) 21

2 Graphical Models (GM) 23

2.1 Introduction 23

2.2 Undirected Graphical Models 23

2.2.1 Introduction 23

2.2.2 Independence 23

2.3 Directed Graphical Models 23

2.3.1 Introduction 23

2.3.2 Definitions 24

2.3.3 Independence 24

2.3.4 Conditional Probability Distribution (CPD) 25

3 Bayesian Networks (BN) 26

3.1 Introduction 26

3.2 Why “Bayesian”? 26

3.3 Conditional Independence and Representation 26

3.4 Example 27

3.5 Inference 28

3.6 Explaining away 29

3.7 Reasoning 29

3.8 Causality or correlation? 29

3.9 Singly connected and multiply connected networks 29

3.10 Evidence and Belief Propagation 30

3.11 BN with discrete and continuous nodes 31

3.12 Dynamic Bayesian Networks (DBN) 31

3.12.1 Introduction 31

3.12.2 Strengths and Advantages of DBN 32

3.12.3 Hidden Markov Models as DBN 32

3.12.4 Inference 34

3.12.5 Learning 34

4 Comments 35

III Index i

IV Bibliography iv

V Illustrations vi



II HMM, GM, BN

Introduction

This essay is an excerpt from a report I wrote which covers some of the various topics I studied. Since I felt it might be useful to others as well, I decided to make this part publicly available. Due to its origin there maybe some references to the original report within the text, so don't let this confuse you.

This essay starts with an introduction to Hidden Markov Models and continues with a brief explanation of Graphical Models. Thereafter, Bayesian Networks and their relationship to various other models, such as the Hidden Markov Models, is outlined.

The illustration below might aid in understanding the relationship between hidden Markov models, Graphical Models and Bayesian networks.

Frame14
where the abbreviations have the following meaning: GM Graphical Model, UGM Undirected Graphical Model, DGM Directed Graphical Model, BN Bayesian Network, DBN Dynamic Bayesian Network, HMM Hidden Markov Model, KF Kalman Filter and NN Neural Network1.



1 Hidden Markov Model

1.1 Introduction

A traditional approach to speech recognition systems is the use of grammars and templates.

The grammar approach one is a very strict approach and only allows limited flexibility of the recognition system, i.e. only paths that are predefined in the grammar can be taken. It explicitly models all possible combinations of patterns, such as words or utterances. Therefore, it can be used alone only with speech that is always pronounced in exactly the same way (same speed, same pitch, etc.). Hence it is rather useful with speech that has been preprocessed and converted to a machine-readable format.

The template approach is based on the idea of deriving typical sequences of speech frame for a pattern, which can be, but is not limited to a e.g. Word, via some averaging procedure [1]. Then a distance measure function can be used to compare patterns and temporally align them to account for e.g. differences in speaking rates across talkers or repetitions of the word by the same talker. The methodology of the template approach is well developed and provides good recognition performance for a variety of practical applications.

The inherent problem of both approaches is lack of flexibility. To date, it has not been possible for anyone to adequately model natural language to the extent that it can be used in automatic speech recognition systems (ASR). This already puts a stop on the grammar approach. Furthermore, though the template approach is a simplified, non-parametric method which already contains some inherent statistical signal characterization, due to the way the templates have been collected, it is not sufficient. Particularly, it neglects the very important second-order statistics, i.e. covariances.

1.2 Markov Processes

1.2.1 Discrete Time Markov Processes

Consider a system that can be in one of a set of N distinct states indexed by {1, 2, ..., N}. At regularly spaced, discrete times the system undergoes a change of state (that also includes the same state, a repetition) according to a set of probabilities associated with the sate. The time instants associated with state changes are denoted as t = 1, 2, .... The actual state at time t be denoted as qt. In general, a full probabilistic description of the system would require specification of the current state at time t, as well as all predecessor states.

Here, however, we limit ourselves to the special case of a discrete time, first order Markov chain. Therefore the probabilistic dependence can be truncated to just the preceding state2:



(1)



Furthermore, we only consider those processes in which the right-hand side of (1) is independent of time. Thus we get the state-transition probabilities aij:


(2)


Since the aij obey stochastic constraints, they have the following properties:


(3)

and

(4)


Because the output of the process is the set of states at each instant of time and, therefore, each state corresponds to an observable event, this stochastic process could be called an observable Markov model.

A simple example of this model is the following three-state Markov model of the weather (Illustration 2). We assume that once a day the weather is observed as being one of the following:

We demand that the weather on day t is characterized by a single one of the three states above, and that the matrix A of state-transition probabilities aij is:

Frame1


Given the model of Illustration 2, we can now answer questions about the behavior of weather patterns over time. For example, “What is the probability that the weather for eight consecutive days is sun-sun-sun-rain-rain-sun-cloudy-sun?”:

We define the observation sequence O as O = (3, 3, 3, 1, 1, 3, 2, 3) and the probability distribution over the initial state π as

(5)



Then we can directly determine P(O|Model), assuming we assign π =(0, 0, 1):



(6)






1.2.2 Limitations of Discrete Time Markov Processes

Markov models, in which each state corresponds to a deterministically observable event, cannot model random output since the output is tied to the state. These models are far too restrictive too many problems of interest, such as speech recognition.

1.3 Definition of Hidden Markov Models

Hidden Markov Models (HMMs) are a frequently used tool for modeling time series data. They are used in numerous applications such as image recognition, pattern recognition, data compression and speech recognition. It can represent probability distributions over sequences of observations:

Since discrete time Markov processes are too limited and too restrictive, we extend the concept of Markov models in the way that the observation is a probabilistic function of the state. The resulting model is a doubly embedded stochastic process with an underlying stochastic process that is not directly observable; it is hidden. The hidden states are the true states of the described system. They are assumed to be discrete3 and can only be observed through another set of stochastic processes that produce the sequence of observations. It is important to note that the number of observed states may differ from the number of hidden states. Furthermore, we assume that the state of this hidden process satisfies the Markov property4. The outputs also satisfy a Markov property with respect to the states: Given qt, ot is independent of the states and observations at all other time indeces.

Taken together, these Markov properties mean that the joint distribution of a sequence of states and observations can be factored in the following way5 [2]:


(7)



To define a probability distribution over sequences of observations, all that is left to specify is the probability distribution over the initial state π as in (5), the NxN state transition matrix A defining P(qt|qt-1) and the output model defining P(ot|qt). It is usually assumed that the state transition matrices as the output models are – except for the initial state – time invariant6. Therefore, if the observables are discrete symbols which can take on one of M values, the output model can be fully specified by a NxM observation matrix B, which is also called the emission matrix. If the observation vectors are real-valued, P(ot|qt) can be modeled in many different forms, e.g. Gaussian7, mixture of Gaussians, neural network etc.

Therefore, an HMM model can be fully specified by the following parameter set:


(8)



For convenience, a more frequently used compact notation is


(9)



In fact, this parameter set defines a probability measure for O, i.e. P(O|λ). So when we use the terminology HMM we are referring to the parameter set λ and its associated probability measure.

1.4 Problems for HMMs

1.4.1 Problems

Once a system can be described by an HMM, three problems need to be addressed:

1.4.1.1 Probability Evaluation

Abstract problem

We have a number of HMMs8 – a set of (A, B, π) triples – which describe different systems, and a sequence of observations O1:T. How can we find out efficiently which HMM most probably generated the sequence?

Example

For example, let's use the weather example as given in section 1.2.1. We extend it in the following way: Depending on the season, the weather transition differs. So we have four hidden states, one for each season: spring, summer, autumn, winter. We now want to determine the season based on the observation of the weather.

Occurrence

This sort of problem occurs in speech recognition when a large number of Markov models is used, i.e. each modeling a particular word . So after the HMM best fitting to an observation sequence formed from a spoken word is identified, the word is identified as well.

1.4.1.2 Optimal State Sequence

Abstract problem

Given the observation sequence O1:T and the HMM λ, how do we choose a corresponding state sequence Q1:T that best explains the observations?

Example

Consider the extended weather example from above. We want to know sequence of the seasons (e.g.: is it winter, autumn, spring, summer?)

Occurrence

This is the probably most important part of a speech recognition system, since this is where the actual recognition takes place. It can also be used, as in e.g. in Natural Language Processing, to tag words with their syntactic class. Then the words in the sentences are the observables, and the syntactic classes are their hidden states.

1.4.1.3 Parameter Estimation

Abstract problem

Given the observation sequence O1:T and the HMM λ, how do we adjust the model parameters λ = (A, B, π) to maximize P(O| λ )?

Example

As before, the extended weather example from above shall be used: After using our system for some time we realized, that it is not as precise as we want it to be. So we want to readjust the probabilities of our weather prediction system.

Occurrence

This is most common when training the speech recognition system.

1.4.2 Solutions

1.4.2.1 Probability Evaluation

Exhaustive search for solution

The most straightforward way to calculate the probability of the observation sequence O1:T given the model λ = (A, B, π) is through enumeration of every possible state sequence of length T, the number of observations. Since there are N possible transitions from each state to another, there are NT such state sequences.

If given one such fixed state sequence S1:t, where s1 is the initial state, assuming statistical independence of observations the probability of the state sequence is


(10)



(11)


The probability of such a state sequence S can be written as


(12)


with the joint probability of O and q


(13)



Hence the probability of O is obtained by summing this joint probability over all possible state sequences S, giving


(14)



This way of calculating is computationally expensive, if not even infeasible. Therefore a more effective algorithm is required.

The Forward Algorithm

We can use the time invariance of the probabilities to reduce the complexity of the problem. Consider the forward variable αt(i) defined as

(15)



αt(i) is the probability of the partial observation sequence O1:t and state i at time t, given the model λ. In other words, these partial probabilities represent the probability of getting to a particular state q at time t. This way we can reduce the complexity of calculating (14).

αt(i) can be solved inductively as follows:

  1. Initialization


(16)


  1. Induction


(17)


  1. Termination


(18)

Frame2

The basic idea of the forward algorithm, also called the forward procedure [1], is to exploit recursion in the calculations to avoid the necessity for exhaustive calculation of all paths. The probability of the observation sequence is found by calculating the partial probabilities at time t=1, 2, ..., T, and adding all α's at t=T.

Using the forward algorithm, it is straightforward to determine e.g. which of a number of HMMs best describes a given observation sequence - the forward algorithm is evaluated for each, and that giving the highest probability selected.



1.4.2.2 Optimal State Sequence

Unlike the probability evaluation in 1.4.2.1, there are several ways of solving this problem. Finding the optimal state sequence first requires definition of what an “optimal state sequence” is.

For example, one could maximize the number of states that are individually most likely to occur at each time t. This optimality criterion maximizes the expected number of correct individual states. Though this seems a viable solution at first glance, given some thought it quickly becomes obvious that there could be some problem with the resulting state sequence. For example, if the HMM has state transition which have probabilities of zero, i.e. those transitions are not possible, the resulting state sequence might not even be a valid one. Naturally, this problem can be eased through various ways, e.g. one could solve for the state sequence that maximizes the expected number of correct tuples of states. However, the most widely used criterion is to find the single best state sequence Q, which means maximizing P(Q|O,λ)=P(Q,O|λ). For this task the Viterbi algorithm [1][4] is used.

Viterbi Algorithm

We first define the partial probability δ, which is the probability of reaching a particular intermediate state in the lattice of observations and states:


(19)



Thus δ is the maximum probability of all sequences ending at state i at time t. Furthermore, the partial best path is the sequence which achieves this maximum probability. In a simplified 3-state model for the first 3 observations this would like this:

By induction we have

(20)


In order to retrieve the state sequences, we need to keep track of the argument that maximized (20), for each t and j. This is achieved by using an array ψt(j). The complete Viterbi algorithm is listed below:

  1. Initialization

(21)


  1. Recursion

(22)


  1. Termination

(23)


  1. Path (state sequence) backtracking

(24)



Except for the backtracking step, the Viterbi algorithm is similar in implementation to the forward calculation of (16)-(18). The main difference is the maximization over previous states in (22) that is used instead of the summation in (17).

Alternative Viterbi Implementation

By taking logarithms of the model parameters, the above Viterbi algorithm can be implemented without the need for any multiplications:

  1. Preprocessing

(25)

  1. Initialization

(26)

  1. Recursion

(27)

  1. Termination

(28)

  1. Backtracking

(29)




1.4.2.3 Parameter Estimation

Also called the learning problem, this is the most difficult problem. It stems from the fact that there are many circumstances in practical problems where the model parameters cannot be directly estimated. Though there is no known way to analytically determine the model parameter set that maximizes the probability of the observation sequence in a closed form [1], we can choose λ=(A,B,π) such that its likelihood P(O|λ) is locally maximized using an iterative procedure such as the Baum-Welch method9 [5] or using gradient techniques [6].

Here the iterative EM algorithm will be explained. First though, we need to define the backward procedure:

Backward Procedure

Similar to the forward procedure in 1.4.2.1, we define the backward variable βt(i), which is the probability of the partial observation sequence from t+1 to the end, given state i at time t and the model λ:


(30)



βt(i) can then be solved inductively10:

  1. Initialization

(31)


  1. Induction

(32)



Further required definitions

We define the probability of being in state i at time t, and in state j at time t+1, given O and λ:


(33)



Using the definition of the forward and the backward variables, we can write ξt(i,j) in the form


(34)



Given ξt(i,j), we can define γt(i), the probability of being in state i at time t, given the entire observation sequence and the model:


(35)



By summing over γt(i) and ξt(i,j), we can get the two quantities:


(36)


(37)



Now we are ready to define the reestimation formulas which are known as

The Forward-Backward Algorithm

If we define the model λ=(A,B,π), we can compute the reestimated model λ*=(A*,B**) using following reestimation formulas:


(38)


(39)


(40)



The final result of this reestimation procedure is a Maximum Likelihood (ML) estimate of the HMM. It is important to realize that the forward-backward algorithm leads to local maxima only. Since the likelihood function is often very complex and can have many local maxima, this needs to be taken into account.

Baum's Auxiliary Function

Instead of deriving from the forward and backward procedure, one can also find the reestimation formulas (38)-(40) by maximizing Baum's auxiliary function over λ:


(41)



Baum's auxiliary function's advantage to gradient techniques is that it provides monotonic improvement in the likelihood, whilst gradient techniques are not guaranteed to do so [1].

Stochastic constraints

An important property of the reestimation procedure is that the stochastic constraints of the HMM parameters

(42)

(43)

(44)

are automatically incorporated at each iteration.



1.5 Continuous Observation Densities in HMMs

Since to this point the observations were characterized as discrete symbols chosen from a finite alphabet, a discrete probability density within each could be used. However, sometimes the observations are continuous signals or vectors. Using vector quantization codebooks and other methods, one can convert, such observations into a sequence of discrete symbols. However, an improper or not so well fit discretization might lead to degradation. Therefore it would be advantageous to model continuous signal representations directly.

In order to use a continuous observation density, the form of the model probability density function (pdf) needs to be restricted to ensure that the parameters of the pdf can be reestimated in a consistent way. The most general representation of the pdf, for which a reestimation procedure has been formulated [1], is a finite mixture of the form


(45)


where <o> is the observation vector11 that is modeled, cjk are the mixture coefficients, or mixture gains, for the kth mixture in state j, and N12 is any log-concave or elliptically symmetric density [6]. The most frequently used density is Gaussian13. To ease the understanding, first the definition of the one-dimensional case [7] is given:


(46)



Here, x is a real-valued random variable, a is the mean value and σ2 is the variance. In the case of a=0 and σ2=1, this is called the standard normal distribution or, equivalently, the standard Gaussian distribution In the multidimensional case, which is the usual used one in ASR systems, the multivariate Gaussian distribution is defined as


(47)



For clarification: X=(x1, x2, ..., xn), A=(a1, a2, ..., an) and Σ2=(σ21, σ22 , ..., σ2n ).

In the following, it is assumed without loss of generality that N in (45) is Gaussian with mean vector µjk and covariance matrix Ujk for the kth mixture component in state j. Since the mixture gains cjk satisfy the stochastic constraint


(48)



the pdf is properly normalized:


(49)



To denote the reestimation formulas for the pdf, we first need to define γt14, the probability of being in state j at time t with the kth mixture component accounting for <o>t, the observation vector at time t:


(50)



Then the reestimation formulas for the coefficients of the mixture density are of the form


(51)


(52)


(53)



It has been shown in [8] that an HMM state with a mixture density is equivalent to a multistate single-mixture density model. The basic idea is to create substates, where the transition coefficients from each state to its substates are the equal to the mixture coefficients. Then the distribution of the composite set of substates (each with a single density) is mathematically equivalent to the composite mixture density within a single state. This is illustrated by illustration 4 below:



Frame3


1.6 Types and Classes of HMMs

1.6.1 Types

1.6.1.1 Ergodic HMM

Ergodic HMM, or fully connected HMM, is the type of HMM we have considered until now: Every state of the model could be reached in a single step from every other state of the model. For example, a three state ergodic HMM would look like illustration 5 and is defined as in (54):



Frame4



(54)



1.6.1.2 Left-Right HMM

The fundamental property of a left-right HMM, or Bakis model, is that as time increases, the state index increases (or stays the same), thus the system proceeds from left to right. It can easily model signals whose properties change over time in a successive manner, i.e. speech.

Frame5
The entries of the transition matrix are

(55)

giving us

(56)


1.6.1.3 Variations

Though there are other models, the ergodic and the left-right HMM are the most important ones. Besides, many HMMs are just variations from the previous two, where they vary in the sense that there are



1.6.2 Classes

1.6.2.1 Factorial HMMs

We generalize the HMM by representing the state St using a collection of discrete state variables


(57)



If each can take on K values15, the state space of the factorial HMM consists of all KM combinations of the St(m) variables, and the transition structure results in a KMxKM transition matrix. Since both the time complexity and sample complexity of estimation algorithm are exponential in M, and interesting structures cannot be discovered due to the arbitrary interaction of all variables, we want to constrain the underlying state transitions. For example, we can let each state variable evolve to its own dynamics and uncouple it from the other state variables:


(58)





Frame6


This is further described in [2].



1.6.2.2 Tree structured HMMs

In factorial HMMs, the state variables at one time step are assumed to be a priori independent given the state variables at the previous time step. If we couple the state variables in a single time step [9], we can relax this assumption.

For example, we can order them, such that St(m) depends on St(n) for 1≤n<m. Furthermore, we can make the state variables and the output dependent on an observable input variable Xt:



Frame7


Tree structured HMMs are useful for modeling time series with both temporal and spatial structure at multiple resolutions.



1.6.2.3 Autoregressive HMMs

The class of autoregressive HMMs16 is particularly applicable to speech processing. The observation vectors are drawn from an autoregression process. The components of the observation vector O0:K-1 are assumed to be from an autoregressive Gaussian source, satisfying the relationship


(59)



where e0≤k<K are Gaussian, independent, identically distributed random variables with zero mean and variance σ2e, and a1≤i≤p are the autoregression or predictor coefficients.

Verbally, since there is “memory” or “feedback”, the system can generate internal dynamics, and the current observation can be estimated by a linear weighted sum of previous observations, where the weights are the autoregression coefficients.

The problem in autoregression (AR) is to derive the best values ai from a given series O0:K-1. Most methods assume the series is linear and stationary. The main two categories are least squares and Burg method [10].The most common least squares method is based upon the Yule-Walker equations. More specific information can be found in [1].



1.6.2.4 Vector Linear Predictor HMMs

When using HMM, it is assumed that successive frames from a given output state are independent and identically distributed (IID). This piece-wise constant assumption about the temporal evolution of speech signals is known not to be a valid assumption [11], in fact, the speech signals are highly temporally correlated.

A possible solution to this is to extend the standard HMM to allow explicit modeling of temporal information by using Vector Linear Predictors (VLP). VLP extended models can capture temporal dynamics whilst maintaining the efficient training and decoding algorithms of the standard HMMs. VLP-HMMs deliver better results than standard HMMs, but poorer than standard HMMs with Dynamic Parameters. Furthermore, VLP-HMMs do not give a better discrimination since the model is not the true speech model, and the selection of the predictor model offsets are arbitrary.

VLP-HMMs are in all aspects the same as standard HMMs except for the definition of the output probability density. For VLP-HMMs, the output probability density for state j of a particular Gaussian model is defined as


(60)



where Cj is the covariance matrix of the prediction error ej(t) for state j at time t, which is defined as

(61)

where Ajp is the pth predictor for state j, qp is the “offset” associated with the pth predictor and µjp is the mean value for vectors at offset qp. The predictors can have arbitrary offsets, and either a full or a diagonal matrix can be used as predictor.

The reestimation formulas for maximum likelihood and maximum mutual information (MMI) can be derived using (60) and (61), though it should be noted that the MMI formulas require a constant D which represents the step size and ensures convergence. This constant is also required for the calculation of the inter-frame covariance matrix between observations from offset qx and qy (the state index j is dropped for clarity):


(62)



In the case of ML, D=0 and ψj(t) is γt(j) (cf. (35)). In the case of MMI D is chosen to ensure convergence and ψj(t)=γt(j)-γt(j)gen, where γt(j)gen is the occupation probability for frame t of sentence r from the confusion lattice.

More can be found in [11]



1.7 Optimization Criteria

Though to this point only the maximum likelihood criterion has been used, there are other criteria such as the maximum mutual information (MMI) and minimum discrimination information (MDI). Brief formal definition of the latter two along with ML are given below:

1.7.1 Maximum Likelihood (ML)

The standard ML criterion is to use a training sequence of observations O to derive the set of model parameters λ, yielding


(63)



All previously discussed reestimation algorithms, as for example the forward-backward algorithm in 1.4.2.3, provide a solution to this problem.

1.7.2 Minimum Discrimination Information (MDI)

The basic idea about statistical models is that if the model parameters are correctly chosen, the signal or the observation sequence can be modeled. As already mentioned 1.6.2.4, the assumed model is sometimes inadequate to model the observed signal; no matter how carefully the model parameters are chosen, the modeling accuracy is limited. One can try to overcome this model mismatch situation with MDI17.

We associate the observed signal O1:T with a sequence of constraints R1:T (e.g. rt be the autocorrelation matrix that characterizes the observation ot). Obviously, depending on R, O is only one of possibly uncountably many observation sequences that satisfy R. Hence there exists a set of probability distributions of the observation sequences that would also satisfy R, which we denote Ω(R). The MDI is a measure of closeness between two probability measures under the given constraint R and is defined by


(64)


(65)


where Pλ is the probability measure in HMM form, and I(Q : Pλ) is the discrimination information – calculated based on the given training set of observations – between distributions Q and Pλ [1], q(*) and p(*|λ) are the pdf corresponding to Q and Pλ respectively.

The MDI criterion tries to choose a model parameter set λ such that is v(R, Pλ) minimized. The MDI can be interpreted that the model parameter λ is chosen so that the model P(O|λ) is as close as possible to a member of the set Ω(R). This is the motivation behind the MDI criterion: Since the closeness is always measured in terms of the discrimination information evaluated on the given observation, the intrinsic characteristics of the training sequences strongly influence the parameter selection, and by emphasizing the measure discrimination, the model estimation is no longer limited by the assumed model form. However, the MDI optimization is not as straightforward as the ML optimization problem, and no simple robust implementation of the procedure is known [1].



1.7.3 Maximum Mutual Information (MMI)

As the MDI criterion, the maximum mutual information criterion (MMI) tries to maximize the discrimination of each model18.

We define P(v) to be the a priori probability for word v, whereas v is taken from a vocabulary of V words. Each v is represented by an HMM, with parameter set λv, v=1, 2, ..., V. Together with the a priori probabilities the set of HMMs Λ={λv} thus defines a probability measure for an arbitrary observation sequence O


(66)



where P(O|λv, v) indicates that it is a probability conditioned on the word v. Now, to train these models utterances of known (labeled) words are used. These labeled training sequences are denoted by Ov, with superscript v denoting that Ov is a rendition of word v. Then the mutual information is defined as


(67)



The MMI criterion to find the entire model set Λ such that the mutual information is maximized:


(68)



Though both ML and MMI are cross-entropy approaches, they differ in the following way:

The problem of MMI is though, that (Λ)MMI is not as straightforward to obtain as (Λ)ML. Furthermore one often has to use general optimization procedures like the descent algorithm to solve (68) which lead to numerical problems in implementations.

2 Graphical Models (GM)

2.1 Introduction

Graphical models (GM) combine probability theory and graph theory. They provide a tool for dealing with the problems of uncertainty and complexity. They play an increasingly important role in the design and analysis of machine learning algorithms. The basic idea of graphical models is the notion of modularity: a complex system is built by combining simpler parts. Where these various parts are combined, probability theory ensures that the system as a whole is consistent and provides ways to interface models to data. Due to its graph theoretic side, humans can easily and intuitively model highly-interacting sets of variables. Furthermore, the resulting data structure not only lends itself naturally to the design of efficient general-purpose algorithms, but also can be used with all the well-developed and highly efficient graph algorithms suited for the specific network topology.

Probabilistic graphical models are graphs in which nodes represent random variables. Arcs, or the lack of arcs, represent conditional independence assumptions. Therefore they provide a compact representation of joint probability distributions [12]. There are two types of graphical models: Undirected graphical models and directed graphical models.

2.2 Undirected Graphical Models

2.2.1 Introduction

As “undirected” already implies, in undirected graphical models, if there is an arc between a node A and a node B, one can always go from A to B and from node B (back) to node A with equal probability, that is, there is no “one-way street”.

Undirected graphical models are an important tool for representing probability distributions. They are most popular with the physics and vision communities, and are of little importance for speech recognition. Their set of semantics differs from those of directed graphical models and can be looked up in [13] and [14]. To illustrate the differences between undirected and directed graphical models, the definition of independence is given:

2.2.2 Independence

Undirected graphical models19 have a simple definition of independence [12]:

Two sets of nodes A and B are conditionally independent given a third set, C, if all paths between the nodes in A and B are separated by a node in C.



2.3 Directed Graphical Models

2.3.1 Introduction

As opposed to undirected graphical models, the arcs of directed graphical models are “one-way streets”. If there is an arc from a node A to a node B, one can only go from B to A when there is an explicit extra arc from B to A20.

Directed graphical models are especially useful for AI and statistics applications and, hence, for speech recognition. They can encode deterministic relationships and are easier to “learn” than undirected graphical models, that is fit to data.

2.3.2 Definitions

Before we continue, we need to state some basic definitions from graph theory. Consider illustration 9 below:



Frame8
The node X is a parent of another node Y if there is a directed arc from X to Y. If so, Y is a child of X. The descendants of a node are its children, its children's children, etc. For example, in illustration 9 B, C and D are A's descendants. By contrast, the ancestors of a node are its parents, its parents' parents etc.

A directed path from a node X to node Y is a sequence of nodes starting from X and ending in Y such that each node in the sequence is a parent of the following node in the sequence, e.g. (A, B, D) is a directed path whilst (C, B, D) is not.

An undirected path from X to Y is a sequence of nodes starting from X and ending in Y such that each node in the sequence is a parent or a child of the following node. Hence a directed path is always also an undirected path, but not vice-versa.



2.3.3 Independence

Directed graphical models21 have a more complicated notion of independence which takes the directionality of the arcs into account [2]:

Each node is conditionally independent from its non-descendants given its parents. More generally, two disjoint sets of nodes A and B are conditionally independent given C, if C d-separates them.

d-separation [2][15] means that if along every undirected path between a node in A and a node in B there is a node D such that one of the following applies:

  1. D has converging arrows22, and neither D nor its descendants are in C

  2. D does not have converging arrows and D is in C

Put differently [2], A is conditionally independent from B given C if


(69)


Though this definition is more complicated, directed graphical models have several advantages. Their most important one is that one can regard an arc from A to B as indicating that A causes B, which aids in constructing the graph structure.

2.3.4 Conditional Probability Distribution (CPD)

In addition to the graph structure we need to specify the parameters of the model. For a directed model, we need to specify the conditional probability distribution (CPD) at each node. If the variables are discrete, the CPD can be represented as a table, the Conditional Probability Table (CPT), which lists the probability that each child node takes on each of its different values for each combination of values of its parents [12].

3 Bayesian Networks (BN)

3.1 Introduction

A Bayesian Network (BN) [15] has the ability to represent a probability distribution over arbitrary sets of random variables [16]. It is possible to factor these distributions in arbitrary ways, and to make arbitrary conditional independence assumptions. This makes it possible to vastly reduce the number of parameters required to represent a probability distribution and, therefore, allows the model parameters to be estimated with greater accuracy from a limited amount of data [17][18].

3.2 Why “Bayesian”?

Bayesian networks are named “Bayesian” because they make use of Bayes' Rule, also called Bayes' Theorem, for inference [12], which is


(70)



where P(B|A) is the a posteriori probability of B given A, and P(A) is the a priori probability of A.

3.3 Conditional Independence and Representation

A BN is a directed graphical model, as introduced in 2.3. It is a graphical way to represent a particular factorization of a joint distribution. A directed arc is drawn from a node A to a node B if B is conditioned on A in the factorization of the joint distribution. For example, consider the BN



Frame9


with the independence relations


(71)



It is easy to see, that e.g. Y is conditioned on W. Furthermore, given the values of X and Y we can use the conditional independence definition from 2.3.3 and show that W and Z are independent:


(72)


The absence of arcs in a BN implies conditional independence relations which can be exploited to obtain efficient algorithms for computing marginal and conditional probabilities [2]. For example, the simplest conditional independence relationship encoded in a Bayesian network can be stated as follows:

A node is independent of its ancestors given its parents, where the ancestor/parent relationship is with respect to some fixed topological ordering of the nodes [12].

3.4 Example

Before we continue, a brief example shall be given:

Frame11
All nodes are binary, i.e. have two possible values: True (T) and false (F). We see that the event “grass is wet” (W=T) has two possible causes: Ether it is raining (R=T) or the water sprinkler is on (W=T). It is shown in the corresponding tables how strongly this relationship is, e.g. we see that P(W=T|S=T, R=F)=0.9 and therefore P(W=F|S=T, R=F)=1-0.9=0.1. Since the C node has no parents, its CPT specifies the prior probability that it is cloudy (0.5 in this case).

By the joint rule of probability, the joint probability of all the nodes in the graph above23 is


(73)



By using conditional independence relationships, we can rewrite this as


(74)



We can simplify P(R|C,S) because R is independent of S given its parent C, and we can simplify P(W|C,S,R) because W is independent of C given its parents S and R.

3.5 Inference

The most common task we wish to solve using BN is probabilistic inference [12]. Consider the example given in 3.4 above. Suppose we observe that the grass is wet. There are two possible causes for this: Either it is raining (R=T) or the sprinkler is on (S=T). We can use Bayes' Rule to compute the posterior probability of each explanation:


(75)


(76)


(77)



So we see that it is more likely that the grass is wet, because it is raining: The likelihood ratio is 0.7079/0.4298 =1.647.

3.6 Explaining away

In the above example, two causes “compete” to “explain” the observed data. Hence even though S and R are marginally independent, they become conditionally independent when their common child W is observed. For example, when the grass is wet and we know that it is raining, the (posterior) probability of the sprinkler being on goes down: P(S=T|W=T, R=T)=0.1945.

This is called explaining away, also known as Berkenson's paradox or selection bias [12].

3.7 Reasoning

There are two ways to infer in a Bayesian network [12]: bottom-up and top-down:

Bottom-up, or diagnostic, is to infer the most likely cause from an evidence of an effect. In the above example, we observed “grass is wet” and inferred that the most likely reason is that it is raining.

Top-down, or generative, is causal reasoning. For example, again using the above example, we can compute the probability that the grass will be wet given that it is cloudy.

3.8 Causality or correlation?

Bayesian networks enable us to put discussions on causality on a solid mathematical basis (cf 2.3.3 and 3.3). However, there may be the problem of distinguishing causality from mere correlation. We can “sometimes” [12] solve this, but we need to measure the relationship between at least three variables, where one acts as a “virtual control” for the relationship between the other two, so we don't always need to do experiments to infer causality.

3.9 Singly connected and multiply connected networks

If there is no undirected path with loops in the underlying graph of a BN, e.g. as in illustration 10, the network is called a singly connected network, and a general algorithm called belief propagation [15][19]24 exists.

For multiply connected networks, in which there can can be more than one undirected path between any two nodes, there exists a more general algorithm known as junction tree algorithm [20][21].

3.10 Evidence and Belief Propagation

The observed value of some variables in the network is called evidence. The goal of belief propagation is to update the marginal probabilities of all the variables in the network to incorporate this new evidence. This is achieved by local message passing:

Each node n sends a message to its parents and to its children. Since the graph is singly connected (cf 3.9), n separates the graph, and therefore the evidence, into two mutually exclusive sets:



Frame10


The message from n to each of its children is the probability of each setting of n given the evidence observed in the set e+(n). If n is a K-valued discrete variable, then this message is a K-dimensional vector. For real-valued variables, the message is the probability density over the domain of values n can take.

The message from n to each of its parents is the probability, given every setting of the parent, of the evidence observed in the set e-(n){n}. The marginal probability of a node is proportional to the product of the messages obtained from its parents, weighted by the conditional probability of the node given its parents, and the message obtained from its children. If the parents of n are {p1, ..., pk} and the children of n are {c1, ..., cl}, then


(78)



where the summation (or, more generally, the integral) extends over all settings of {p1, ..., pk}. For example, for the BN in illustration 10, given the evidence e={X=x, Z=z},


(79)



where P(W) is the message, passed from W to Y since e+(W)={}27, and P(Z=z, X=x|Y) is the message passed from Z to Y. Variables in the evidence set are referred to as observable variables, while those not in the evidence set are referred to as hidden variables.

3.11 BN with discrete and continuous nodes

Though the previous examples used nodes with categorical values, as has already been implied in 3.10 it is also possible to create Bayesian networks with continuous valued nodes. The most common distribution for such variables is the Gaussian. Its implementation is similar to the HMM one (cf 1.5).

3.12 Dynamic Bayesian Networks (DBN)

3.12.1 Introduction

Dynamic Bayesian Networks28 (DBN) are directed graphical models of stochastic processes. Put differently, they are simply BN for modeling time series data. Since we assume in time series modeling that an event can cause another event in the future, but no event in the future can cause one in the past, directed arcs flow forward in time29. Hence the design of the BN is greatly simplified. For example, consider the example above:

Frame12
Illustration 13 is an example of a first-order Markov process, where each of the variables Ot in the observed sequence O1:t is influenced only by the previous variable Ot-1:


(80)



As already mentioned in 1.2.1, one way to extend Markov models is to allow higher interactions between variables, e.g. use a nth-order Markov process where n>1. But we can go much further than that, in fact, we can implement HMMs and other models as a DBN, which can have significant advantages.

3.12.2 Strengths and Advantages of DBN

Dynamic Bayesian networks are ideally suited for modeling temporal processes. They have the following advantages over other models such as Kalman filters [12], Neural networks [16] and HMMs [16]:

  1. nonlinearity:
    By using a tabular representation of conditional probabilities30, it is quite easy to represent arbitrary nonlinear phenomena. Moreover, it is possible to do efficient computation with DBNs even when the variables are continuous and the conditional probabilities are represented by Gaussians [22].

  2. interpretability:
    Each variable represents a specific concept.

  3. factorization:
    The joint distribution is factorized as much as possible. This leads to:

    1. statistical efficiency:
      Compared to an unfactored HMM with an equal number of possible states, a DBN with a factored state representation and sparse connections between variables will require exponentially fewer parameters.

    2. computational efficiency:
      Depending on the exact graph topology, the reduction in model parameters may be reflected in a reduction in running time.

  4. extensibility:
    DBNs can handle large numbers of variables, provided the graph structure is sparse.

  5. semantics:
    DBNs have a precise and well-understood probabilistic semantics, i.e. each variable can model a specific concept.

3.12.3 Hidden Markov Models as DBN

Hidden Markov models are a sub-class of DBN in which we assume that observations are dependent on a hidden variable (state). The following illustration shows the simplest kind of an HMM with one discrete hidden node and one discrete observed node per slice:



Frame13


Unrolled for four time slices, O1:4 are the observed nodes and Q1:4 are the hidden nodes of the above HMM. The structure and parameters are assumed to repeat as the model is unrolled further. Hence to specify a DBN we need to define

Though it is straightforward to create a DBN from an HMM by use of a single hidden variable (which can take as many values as there are states in the HMM) in each time slice, there is no known way to convert an existing HMM to the minimal32 equivalent DBN [16].

However, there is a simple procedure for constructing an HMM from a DBN. What we want to specify are the five quantities33 of an HMM:

  1. the number of states

  2. the number of observation symbols

  3. transition probabilities between states

  4. emission probabilities for observation symbols

  5. probability distribution on the initial states

Given a DBN, these quantities can be derived for an equivalent HMM in the following way:

  1. The number of HMM states is equal to the number of ways the DBN states can be instantiated, so if we have e.g. k binary DBN state notes, there are 2k HMM states.

  2. The number of HMM observation symbols is equal to the number of ways the DBN observation nodes can be instantiated, so if we have e.g. k binary DBN observation notes, there are 2k HMM observation symbols.

  3. To calculate the HMM transition probability from state i to state j:

    1. instantiate the DBN state nodes in a time slice t in the configuration that corresponds to HMM state i

    2. instantiate the state nodes in slice t+1 in the configuration that corresponds to j.

    3. the transition probability is the product of the probabilities of the state nodes in slice t+1 given their parents

  4. Emission probabilities are calculated in a similar manner as the transition probabilities, but only the state and observation nodes in a single time slice are instantiated.

  5. To calculate the probability of being initially in state i:

    1. instantiate the DBN state nodes in time slice 1 in the configuration that corresponds to HMM state i

    2. the product of the probabilities of the state nodes in time slice 1 given their parents34 is the desired prior

3.12.4 Inference

For inference in singly connected networks the belief propagation algorithm has already been introduced in 3.10. However, besides the junction-tree algorithm mentioned in 3.9, for multiply connected DBN cutset conditioning [21] may be the simplest algorithm [16].

3.12.5 Learning

In general, the learning techniques for Bayesian networks are analogous to the learning techniques for HMMs. In both cases, EM35 is used. However, under some circumstances it is necessary to resort to gradient descent:

4 Comments

Due to lack of time and for the sake of brevity many important algorithms and methods, such as e.g. deleted interpolation and corrective training [1], inference in clique trees and inference in general graphs [16], or comparisons to other interesting models such as the Kalman filter, neural networks or dynamic Bayesian multinets [24] have not found their way into this essay. They can be looked up in the literature list, where especially recommended readings are [25], [2], [1], [12] and [16].



III Index

2TBN 33

σ2 3pp., 7p., 10, 13

A 5

advantage of Baum's auxiliary function 12

advantages of DBN 32

advantages of directed graphical models 25

aij 2

all pole filter 18

Alternative Viterbi Implementation 9

ancestor 24

ancestors 24

a posteriori 26

AR 19

arc 25

arcs 23

ASR 2

automatic speech recognition systems 2

Autoregression 19

autoregression coefficients 19

autoregression process 18

Autoregressive HMMs 18

auxiliary function 12

B 5

backward procedure 10

backward variable 10

Bakis model 16

Baum's Auxiliary Function 12

Baum's auxiliary function's advantage 12

Baum-Welch method 10

Bayesian Networks 26

Bayes' Rule 24, 26

Bayes' Theorem 26

Belief Networks 24

belief propagation 29p.

Berkenson's paradox 29

BN 26

BNs 24

bottom-up 29

Burg method 19

causality 29

causal reasoning 29

child 24

cjk 13

clique trees 35

conditional independence 23

conditionally independent 24

conditional probability distribution 25

Conditional Probability Table 25

continuous 13, 31

Continuous Observation Densities 13

continuous valued nodes 31

converging arrows 24

corrective training 35

correlation 29

covariances 2

CPD 25

CPT 25

cross-entropy 22

cross-entropy approach 22

cutset conditioning 34

DBN 31

Definition of Hidden Markov Models 4

descent algorithm 22

diagnostic 29

directed graphical models 23

directed path 24

discrete hidden states 4

Discrete Time Markov Processes 2

discrimination information 21

d-separate 24

d-separation 24

Dynamic Bayesian Multinets 35

Dynamic Bayesian Networks 31

Dynamic Parameters 19

elliptically symmetric density 13

EM 10

emission matrix 5

ergodic 15

Ergodic HMM 15

Evidence 30

expectation-maximization method 10

explaining away 29

exponential family 34

Factorial HMMs 17

first order Markov chain 2

Forward Algorithm 7

Forward-Backward Algorithm 11

forward procedure 8

fully connected HMM 15

Gaussian 13

generative 29

GM 23

gradient descent 34

gradient techniques 10, 12

grammars 2

Graphical Models 23

graphs 23

graph theory 23

hidden 4, 31

Hidden Markov Model 2

hidden variable 32

hidden variables 31

HMM 2

IID 19

IIR 18

independence 23

independent and identically distributed 19

infinite impulse response filter 18

initial state probability distribution 4

inter-slice topology 33

intra-slice topology 33

joint probability distributions 23

joint rule of probability 28

junction tree algorithm 29

Kalman Filter 32

learning problem 10

least squares 19

left-right HMM 16

likelihood ratio 29

Limitations of Discrete Time Markov Processes 4

local maxima 12

local message passing 30

log-concave density 13

Markov networks 23

Markov Processes 2

Markov property 4

Markov Random Fields 23

maximum entropy model 18

Maximum Likelihood 12, 20

Maximum Mutual Information 20p.

MDI 20

mean value 13

message 30

Minimum Discrimination Information 20

mixture 13

mixture coefficients 13

mixture gains 13

ML 12, 20

MMI 20p.

model mismatch situation 20

MRFs 23

multiply connected networks 29

multivariate Gaussian distribution 13

Neural Networks 32

normal distribution 13

observable 31

observable Markov model 3

observable variables 31

observation matrix 5

observation sequence 4

observation vector 13

Optimal State Sequence 6, 8

output probability density 19

parallel path left-right HMM 17

Parameter Estimation 6

parent 24

partial observation sequence 7

partial probabilities 7

partial probability 8

pdf 13

predictor coefficients 19

probability density function 13

Probability Evaluation 5p.

probability theory 23

random variable 13

reestimation formulas 11

second-order statistics 2

selection bias 29

singly connected 29

standard Gaussian distribution 13

standard normal distribution 13

templates 2

temporal Bayesian network 31

time-invariant 5, 31

top-down 29

Tree structured HMMs 18

two-slice temporal BN 33

undirected graphical models 23

undirected path 24

variance 13

Vector Linear Predictor HMMs 19

Vector Linear Predictors 19

vector quantization codebooks 13

virtual control 29

Viterbi Algorithm 8p.

VLP 19

Yule-Walker equations 19



IV Bibliography

1: L.K. Saul, M. Rahim, "Modeling acoustic correlations by factor analysis", 1998

2: K. Murphy, "A Brief Introduction to Graphical Models and Bayesian Networks", 2001

3: A.P. Dempster, N.M. Laird, D.B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm", 1977

4: L. Rabiner, B.-H. Juang, "Fundamentals of Speech Recognition", 1993

5: Zoubin Ghahramani, "An Introduction to Hidden Markov Models and Bayesian Networks",

6: A.J. Viterbi, "Error bounds f. convolutional codes and asymptotically optim. decod. algor.", 1967

7: L.R. Rabiner, S.E. Levinson, M.M. Sondhi, "Application of the Theory of Probabilistic Functions to ASR", 1983

8: O. Kerner, T. Thode, J. Steffens, R. Voller, J. Maurer, "Vieweg Mathematik Lexikon", 1995

9: B. Juang, S.E. Levinson, M.M. Sondhi, "ML estimation for multivariate mixture observations of Markov chains", 1985

10: L.K. Saul, M.I, Jordan, "Mixed memory Markov models", 1997

11: Paul Bourke, "Autoregression Analysis (AR)", 1998

12: K.K. Chin, P.C. Woodland, "Maximum Mutual Information Training of HMMs with Vector Linear Predictors", 2002

13: J. Besag, "Spatial interaction and the statistical analysis of lattice systems", 1974

14: S. Geman, D. Geman, "Stochastic relaxation, Gibbs distributions, Bayesian restoration of images", 1984

15: J. Pearl, "Prob. Reasoning in Intelligent Systems: Networks of Plausible Inference", 1988

16: G.G. Zweig, "Speech Recognition with Dynamic Bayesian Networks", 1998

17: G.G. Zweig, "A forward-backward algorithm for inference in Bayesian networks", 1996

18: Z. Ghahramani, M.I. Jordan, "Factorial hidden Markov models", 1997

19: J.H. Kim, J. Peal, "A comp. model for causal and diagnostic reasoning in inference systems", 1983

20: F.V. Jensen, S.L. Lauritzen, K.G. Olesen, "Bayesian updating in recursive graphical models by local computations", 1990

21: S.L. Lauritzen, D.J. Spiegelhalter, "Local computations with probabilities on graphical structures and their ...", 1988

22: R.D. Schachter, M.A. Peot, "Simulation approaches to general probabilistic inference on belief networks", 1989

23: J. Binder, D. Koller, S. Russell, K. Kanazawa, "Adaptive probabilistic networks with hidden variables", 1997

24: J.A. Bilmes, "Dynamic Bayesian Multinets", 2000

25: A.P. Dempster, "Covariance Selection", 1972


V Illustrations

  1. the relationships between the various models in this essay 1

  2. Markov model of the weather 3

  3. partial best paths in a simplified 3-state model for the first 3 observations 8

  4. Equivalence of a state with a mixture density to a multistate single-density distribution. i is the current state, w a wait state and s the substates 15

  5. three state ergodic HMM 16

  6. three state left-right HMM 16

  7. three state factorial HMM 18

  8. three state tree structured HMM 18

  9. a simple graph 24

  10. a directed acyclic graph (DAG) consistent with the conditional independence relations in P(W, X, Y, Z) 26

  11. a directed graph modeling the weather 27

  12. separation of evidence in singly connected graphs 30

  13. a Bayesian network representing a first-order Markov process 31

  14. HMM as DBN with one discrete hidden node and one discrete observed node, unrolled for 4 time slices 33



1though the Kalman Filter and Neural Networks are not covered by this essay, they are put in there to ease the understanding of the relationships between models in the dynamic Bayesian networks framwork

2Note: If we wish to model a second, third, ... order Markov process, we need to take the corresponding amount of preceding states into account (2, 3, ...). n order Markov properties means that the current state qt is dependent on only the last n preceding states

3Qt can take on N values which will be denoted by the integers {1, ..., N}

4usually the first-order Markov property, but higher orders can also be used

5for notational convenience O1:t denotes o1, ..., ot

6not dependent on t

7for high-dimensional real-valued observations, a very useful model can be obtained by replacing the Gaussian by a factor analyzer (FA) [3]

8a set of λ

9also known as the expectation-maximization method (EM)

10in the initialization step, βT(i) is initialized with an arbitrary value

11The observation vector differs from the observation sequence in the way that the entries in the observation vector all represent observations at the same time, whilst entries in the observation sequence vector represent consecutive observations. Thus an observation sequence O can have observation vectors as its components.

12not to be confused with N

13also known as normal distribution

14in the case of a simple mixture, γt generalizes to (35)

15naturally this can be extended, so that each state variable St(m) can take on K(m) variables

16an autoregressive model is also known as infinite impulse response filter (IIR), all pole filter and, in physics applications, as a maximum entropy model

17also with MMI, cf. 1.7.3

18the ability to distinguish between observation sequences generated by the model that describes the correct word and

19also called Markov Random Fields (MRFs) or Markov networks [12]

20Of course, for brevity one may define undirected arcs in a directed graphical model to represent two arcs with opposing directions but equal probability.

21Also called Bayesian networks or belief networks (BNs) (cf page 26), it is important to note that despite the name, Bayesian methods don't need to be used at all. Rather they are so called because they use Bayes' rule for inference

22that is, D is a child of both the previous and the following nodes in the path [12]

23compare to the example in 3.3

24cf 3.10

25the nodes for which the undirected path to n goes through a parent of n

26the nodes for which the undirected path to n goes through a child of n

27empty set

28temporal Bayesian network would be a better suited and descriptive name as long as one assumes the usual case that the structure of the BN does not change (time-invariant), but DBN has become entrenched [12]

29of course, they may also point to the same node to model repetitions

30CPT, cf 2.3.4 and the example 3.4

31such a two-slice temporal BN is often called 2TBN

32minimal in terms of the number of parameters or computational requirements

33as defined in (8) in 1.3

34if there are no parents, they are not required

35cf 1.4.2.3 and [5]

36cf 1.7.1-1.7.3 and [1]