Training and Decoding the AN4 Corpus Using HTK

Introduction

This procedure will provide a step-by-step tutorial in creating and training an acoustic model in HTK. Refer to section 2 of the HTKBook for the original tutorial. Refer to section 4 of the HTKBook for details on each of the provided programs.

Conventions used in this tutorial:

  1. Italics are used to denote commands to run. For most commands, you can add “-T 0001” to the command to output detailed trace information (useful for debugging).
  2. Courier New font denotes text file contents
  3. Boldface is used to denote important filenames

Environment Setup

  1. Obtain the latest HTK 3.0 baseline from Note you may need to register to download the software
  2. Unzip and untar the release in your cygwin root directory:
    /htk
  3. Configure, make, and install HTK (see the included README files for detailed instructions)

Data Preparation

  1. Create a directory called /htktut. Copy helper perl scripts that will be used to process AN4-specific formats
  2. Obtain the AN4 corpus data from: Make sure to obtain the version with the files in NIST Sphere format.
  3. Extract the corpus to a directory called /an4.
  4. cd /htktut
  5. Create a file containing the allowable grammar for each word and utterance in the AN4 corpus. Author a file called gram with the following contents:

$words = A | AND | APOSTROPHE | APRIL |

AREA | AUGUST | B | C | CODE |

D | DECEMBER | E | EIGHT | EIGHTEEN |

EIGHTEENTH | EIGHTH | EIGHTY | ELEVEN | ELEVENTH |

F | FEBRUARY | FIFTEEN |

FIFTEENTH | FIFTH | FIFTY | FIRST | FIVE |

FORTY | FOUR | FOURTEEN | FOURTH | G |

GO | H | HALF | HUNDRED |

I | J | JANUARY | JULY | JUNE |

K | L | M | MARCH | MAY |

N | NINE | NINETEEN | NINETY | NINTH | NOVEMBER |

O | OCTOBER | OF | OH | ONE | P | Q | R |

S | SECOND | SEPTEMBER |

SEVEN | SEVENTEEN | SEVENTH | SEVENTY |

SIX | SIXTEEN | SIXTEENTH | SIXTH | SIXTY | T | TEN | THIRD |

THIRTEEN | THIRTIETH | THIRTY | THOUSAND | THREE |

TWELFTH | TWELVE | TWENTIETH | TWENTY | TWO |

U | V | W | X | Y | Z | ZERO;

$singleCmd = GO | YES | NO | REPEAT | STOP | ERASE | HELP;

$startCmd = RUBOUT | ENTER;

( silence ( $singleCmd | $startCmd <$words> | <$words> ) silence )

  1. Use the program HParse to create a word-net that graphs the words and their associations using the created grammar:
    HParse.exe gram wdnet
  2. Create the pronunciation dictionary dict for the AN4 corpus:
    cp an4.dict to dir
    ./dict_clean.pl > dict
  3. Create the training word MLF (Master Label File). This file contains all the transcriptions of training data.
    cp an4_train.transcription to dir

./an4_mlf_maker.pl > words.mlf

  1. Author the following HTK configuration filemkphones0.led:
    EX

IS sil sil

DE sp

  1. run HLed to make phone-level MLF’s
    HLEd.exe -l '*' -d dict -i phones0.mlf mkphones0.led words.mlf
  1. code the train data to MFCC with the following commands
    mkdir an4_train_audio an4_train_mfcc

find /an4/wav/[train] -name "*.sph" | xargs -i cp {} an4_train_audio

find /htktut/an4_train_audio > pre_codetr.scp

./make_codetr.pl > codetr.scp

  1. author file:config_hcopy:

# Coding parameters

SOURCEKIND = WAVEFORM

SOURCEFORMAT = NIST

#SOURCERATE = 625

TARGETKIND = MFCC_0_D_A

TARGETRATE = 100000.0

SAVECOMPRESSED = T

SAVEWITHCRC = T

WINDOWSIZE = 250000.0

USEHAMMING = T

PREEMCOEF = 0.97

NUMCHANS = 26

CEPLIFTER = 22

NUMCEPS = 12

ENORMALISE = F

  1. Run the HCopy program to translate wav data to MFCC

HCopy.exe -T 1 -C config_hcopy -S codetr.scp

Create Monophone Context Independent Models:

1.Create the hmmdirectories. These will store each Baum-Welch re-estimated version of the model.

mkdir hmm0 hmm1 hmm2 hmm3 hmm4 hmm5 hmm61 hmm7 hmm8 hmm9 hmm10 hmm11 hmm12 hmm13 hmm14 hmm15

2. Author proto. This is the initial model architecture for each 3-state phone HMM:

~o <VecSize> 39 <MFCC_0_D_A>

~h "proto"

<BeginHMM>

<NumStates> 5

<State> 2

<Mean> 39

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

<Variance> 39

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

<State> 3

<Mean> 39

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

<Variance> 39

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

<State> 4

<Mean> 39

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

<Variance> 39

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

<TransP> 5

0.0 1.0 0.0 0.0 0.0

0.0 0.6 0.4 0.0 0.0

0.0 0.0 0.6 0.4 0.0

0.0 0.0 0.0 0.7 0.3

0.0 0.0 0.0 0.0 0.0

<EndHMM>

3. Author config.

# Coding parameters

TARGETKIND = MFCC_0_D_A

TARGETRATE = 100000.0

SAVECOMPRESSED = T

SAVEWITHCRC = T

WINDOWSIZE = 250000.0

USEHAMMING = T

PREEMCOEF = 0.97

NUMCHANS = 26

CEPLIFTER = 22

NUMCEPS = 12

ENORMALISE = F

4. Use HCompV to create the first template HMM-set: hmm0

HCompV.exe -C config -f 0.01 -m -S train.scp -M hmm0 proto

5. Author hmm0/macros. This file contains common information needed to do model re-estimation:

~o <MFCC_0_D_A> <VecSize> 39

~v varFloor1

<Variance> 39

7.266113e-01 4.722334e-01 9.523441e-01 6.613073e-01 7.113036e-01 5.931423e-01 6.048728e-01 6.435860

e-01 4.942584e-01 4.086919e-01 4.240367e-01 2.945997e-01 1.170168e+00 2.929090e-02 2.164918e-02 2.93

3550e-02 2.839202e-02 3.198905e-02 2.880209e-02 3.227245e-02 3.238301e-02 2.820307e-02 2.370699e-02

2.331906e-02 1.896634e-02 3.761791e-02 4.119497e-03 3.521117e-03 4.239024e-03 4.587236e-03 5.152307e

-03 4.821633e-03 5.481833e-03 5.578703e-03 4.909256e-03 4.267938e-03 4.158779e-03 3.423044e-03 5.544

049e-03

6. Author hmm0/hmmdefs Model Macro File (MMF). This file contains definitions for ALL HMM’s (one for each phone for now).

Cpan4/etc/an4.phone /htktut

./make_mmf.pl > hmm0/hmmdefs

cp an4.phone monophones0

add the phone “sp” to monophones0, save it as a new file, monophones1

7. Perform the first re-estimate,hmm1

HERest.exe -C config -I phones0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm0/macros -H hmm0/hmmdefs -M hmm1 monophones0

8. Createhmm2 and hmm3 (performing more re-estimation and refinement on the CI models).

HERest.exe -C config -I phones0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm1/macros -H hmm1/hmmdefs -M hmm2 monophones0

HERest.exe -C config -I phones0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm2/macros -H hmm2/hmmdefs -M hmm3 monophones0

9. author hmm4/hmmdefs by adding appending the following to hmm3/hmmdefs,

~h "sp"

<BEGINHMM>

<NUMSTATES> 3

<STATE> 2

<MEAN> 39

-9.510303e+00 -1.681556e+00 -2.859189e+00 -3.372297e+00 -3.174411e+00 -4.109135e+00 -6.625902e+00 -3.097823e+00 -3.832037e+00 -1.941238e+00 -2.021214e+00 -9.669427e-01 4.440228e+01 -6.433553e-03 1.031006e-01 -5.078135e-02 -1.677521e-02 -6.127201e-03 -1.733933e-02 -7.105561e-02 -7.521260e-02 5.238789e-03 3.878872e-02 4.696222e-03 -4.062191e-02 -1.051957e-01 -9.008758e-03 -6.651144e-03 2.551728e-02 3.584296e-03 8.466722e-03 1.126705e-02 2.028633e-02 1.525841e-02 8.087316e-04 -1.189610e-03 -1.932883e-03 1.927950e-03 3.416471e-02

<VARIANCE> 39

5.016994e+00 1.055978e+01 9.199114e+00 1.149608e+01 1.264298e+01 1.406597e+01 1.837638e+01 1.957598e+01 1.790414e+01 1.914055e+01 1.634619e+01 1.472672e+01 1.478365e+01 1.714879e-01 3.757185e-01 5.289578e-01 6.911641e-01 8.839969e-01 1.038755e+00 1.234214e+00 1.296763e+00 1.341149e+00 1.388875e+001.319310e+00 1.238895e+00 1.258024e-01 3.459969e-02 6.561287e-02 1.027706e-01 1.376464e-01 1.742430e-01 2.065688e-01 2.438512e-01 2.532034e-01 2.670280e-01 2.767306e-01 2.619496e-01 2.470811e-01 1.865003e-02

<GCONST> 7.528535e+01

<TRANSP> 3

0.000000e+00 1.000000e+00 0.000000e+00

0.000000e+00 9.335565e-01 6.644349e-02

0.000000e+00 0.000000e+00 0.000000e+00

<ENDHMM>

cp hmm3/macros hmm4/macros

10. Author HHed script file: sil.hed

AT 2 4 0.2 {sil.transP}

AT 4 2 0.2 {sil.transP}

AT 1 3 0.3 {sp.transP}

TI silst {sil.state[3],sp.state[2]}

11. Create hmm5. This version will include a silence model state within each HMM to accurately model gaps in utterances.

HHed.exe -H hmm4/macros -H hmm4/hmmdefs -M hmm5 sil.hed monophones1

12. Create hmm6 and hmm7 (re-estimate with the new silence models)

HERest.exe -C config -I phones0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm5/macros -H hmm5/hmmdefs -M hmm6 monophones1

HERest.exe -C config -I phones0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm6/macros -H hmm6/hmmdefs -M hmm7 monophones1

13. Appendthe following to dict

silence sil

14. Create align MLF. This version of the MLF file takes into account multiple pronunciations of a single word.

HVite.exe -l '*' -o SWT -b silence -C config -a -H hmm7/macros -H hmm7/hmmdefs -i aligned.mlf -m -t 250.0 -y lab -I words.mlf -S train.scp dict monophones1

15. Create hmm8 and hmm9 using the aligned version of the phone MLF.

HERest.exe -C config -I aligned.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm7/macros -H hmm7/hmmdefs -M hmm8 monophones1

HERest.exe -C config -I aligned.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm8/macros -H hmm8/hmmdefs -M hmm9 monophones1

15. author gauss_split.hed

MU 2 {*.state[2-4].mix}

15. Modifyhmm9using the new macros to split the Gaussians used in each state.

HHed.exe -H hmm9/macros -H hmm9/hmmdefs -M hmm9gauss_split..hed monophones1

16. Retrainhmm9 using the aligned version of the phone MLF.

HERest.exe -C config -I aligned.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm9/macros -H hmm9/hmmdefs -M hmm9 monophones1

HERest.exe -C config -I aligned.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm9/macros -H hmm9/hmmdefs -M hmm9 monophones1

Create Tied-state, Context Dependent Triphone HMM’s

1. Author mktri.led HLed script file:

WB sp

WB sil

TC

2. Create tri-phone MLF and triphone listing. These will be the new models that will be context-dependant on the previous and next triphone occurrences.

HLed.exe -n triphones1 -l '*' -i wintri.mlf mktri.led aligned.mlf

3. Author mktri.hed HHed script file. This uses the script file maketrihed

maketrihed monophones1 triphones1

4. create hmm10. This model version will replace the monophone models with their triphone versions.

HHed.exe -B -H hmm9/macros -H hmm9/hmmdefs -M hmm10 mktri.hed monophones1

5. Createhmm11 (re-estimation using the new triphone models)

HERest.exe -C config -I wintri.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm10/macros -H hmm10/hmmdefs -M hmm11 triphones1

6. Create hmm12, make sure to include the –s flag to generate the stats file

HERest.exe -C config -I wintri.mlf -t 250.0 150.0 1000.0 -s stats -S train.scp -H hmm11/macros -H hmm11/hmmdefs -M hmm12 triphones1

7. Author tree.hed file:

RO 100.0 stats

TR 0

QS "R_NONBOUNDARY" { *+* }

QS "R_SILENCE" { *+sil }

QS "R_STOP" { *+P,*+PD,*+B,*+T,*+TD,*+D,*+DD,*+K,*+KD,*+G }

QS "R_NASAL" { *+M,*+N,*+EN,*+NG }

QS "R_FRICATIVE" { *+S,*+SH,*+Z,*+F,*+V,*+CH,*+JH,*+TH,*+DH }

QS "R_LIQUID" { *+L,*+EL,*+R,*+W,*+Y,*+HH }

QS "R_VOWEL" { *+EH,*+IH,*+AO,*+AA,*+UW,*+AH,*+AX,*+ER,*+AY,*+OY,*+EY,*+IY,*+OW }

QS "R_C-FRONT" { *+P,*+PD,*+B,*+M,*+F,*+V,*+W }

QS "R_C-CENTRAL" { *+T,*+TD,*+D,*+DD,*+EN,*+N,*+S,*+Z,*+SH,*+TH,*+DH,*+L,*+EL,*+R }

QS "R_C-BACK" { *+SH,*+CH,*+JH,*+Y,*+K,*+KD,*+G,*+NG,*+HH }

QS "R_V-FRONT" { *+IY,*+IH,*+EH }

QS "R_V-CENTRAL" { *+EH,*+AA,*+ER,*+AO }

QS "R_V-BACK" { *+UW,*+AA,*+AX,*+UH }

QS "R_FRONT" { *+P,*+PD,*+B,*+M,*+F,*+V,*+W,*+IY,*+IH,*+EH }

QS "R_CENTRAL" { *+T,*+TD,*+D,*+DD,*+EN,*+N,*+S,*+Z,*+SH,*+TH,*+DH,*+L,*+EL,*+R,*+EH,*+AA,*+ER,*+AO }

QS "R_BACK" { *+SH,*+CH,*+JH,*+Y,*+K,*+KD,*+G,*+NG,*+HH,*+AA,*+UW,*+AX,*+UH }

QS "R_FORTIS" { *+P,*+PD,*+T,*+TD,*+K,*+KD,*+F,*+TH,*+S,*+SH,*+CH }

QS "R_LENIS" { *+B,*+D,*+DD,*+G,*+V,*+DH,*+Z,*+SH,*+JH }

QS "R_UNFORTLENIS" { *+M,*+N,*+EN,*+NG,*+HH,*+L,*+EL,*+R,*+Y,*+W }

QS "R_CORONAL" { *+T,*+TD,*+D,*+DD,*+N,*+EN,*+TH,*+DH,*+S,*+Z,*+SH,*+CH,*+JH,*+L,*+EL,*+R }

QS "R_NONCORONAL" { *+P,*+PD,*+B,*+M,*+K,*+KD,*+G,*+NG,*+F,*+V,*+HH,*+Y,*+W }

QS "R_ANTERIOR" { *+P,*+PD,*+B,*+M,*+T,*+TD,*+D,*+DD,*+N,*+EN,*+F,*+V,*+TH,*+DH,*+S,*+Z,*+L,*+EL,*+W }

QS "R_NONANTERIOR" { *+K,*+KD,*+G,*+NG,*+SH,*+HH,*+CH,*+JH,*+R,*+Y }

QS "R_CONTINUENT" { *+M,*+N,*+EN,*+NG,*+F,*+V,*+TH,*+DH,*+S,*+Z,*+SH,*+HH,*+L,*+EL,*+R,*+Y,*+W }

QS "R_NONCONTINUENT" { *+P,*+PD,*+B,*+T,*+TD,*+D,*+DD,*+K,*+KD,*+G,*+CH,*+JH }

QS "R_STRIDENT" { *+S,*+Z,*+SH,*+CH,*+JH }

QS "R_NONSTRIDENT" { *+F,*+V,*+TH,*+DH,*+HH }

QS "R_UNSTRIDENT" { *+P,*+PD,*+B,*+M,*+T,*+TD,*+D,*+DD,*+N,*+EN,*+K,*+KD,*+G,*+NG,*+L,*+EL,*+R,*+Y,*+W }

QS "R_GLIDE" { *+HH,*+L,*+EL,*+R,*+Y,*+W }

QS "R_SYLLABIC" { *+EN,*+M,*+L,*+EL,*+ER }

QS "R_UNVOICED-CONS" { *+P,*+PD,*+T,*+TD,*+K,*+KD,*+S,*+SH,*+F,*+TH,*+HH,*+CH }

QS "R_VOICED-CONS" { *+JH,*+B,*+D,*+DD,*+DH,*+G,*+Y,*+L,*+EL,*+M,*+N,*+EN,*+NG,*+R,*+V,*+W,*+Z}

QS "R_UNVOICED-ALL" { *+P,*+PD,*+T,*+TD,*+K,*+KD,*+S,*+SH,*+F,*+TH,*+HH,*+CH,*+sil }

QS "R_LONG" { *+IY,*+AA,*+OW,*+AO,*+UW,*+EN,*+M,*+L,*+EL }

QS "R_SHORT" { *+EH,*+EY,*+AA,*+IH,*+AY,*+OY,*+AH,*+AX,*+UH }

QS "R_DIPTHONG" { *+EY,*+AY,*+OY,*+AA,*+ER,*+EN,*+M,*+L,*+EL }

QS "R_FRONT-START" { *+EY,*+AA,*+ER }

QS "R_FRONTING" { *+AY,*+EY,*+OY }

QS "R_HIGH" { *+IH,*+UW,*+AA,*+AX,*+IY }

QS "R_MEDIUM" { *+EY,*+ER,*+AA,*+AX,*+EH,*+EN,*+M,*+L,*+EL }

QS "R_LOW" { *+EH,*+AY,*+AA,*+AW,*+AO,*+OY }

QS "R_ROUNDED" { *+AO,*+UW,*+AA,*+AX,*+OY,*+W }

QS "R_UNROUNDED" { *+EH,*+IH,*+AA,*+ER,*+AY,*+EY,*+IY,*+AW,*+AH,*+AX,*+EN,*+M,*+HH,*+L,*+EL,*+R,*+Y }

QS "R_NONAFFRICATE" { *+S,*+SH,*+Z,*+F,*+V,*+TH,*+DH }

QS "R_AFFRICATE" { *+CH,*+JH }

QS "R_IVOWEL" { *+IH,*+IY }

QS "R_EVOWEL" { *+EH,*+EY }

QS "R_AVOWEL" { *+EH,*+AA,*+ER,*+AY,*+AW }

QS "R_OVOWEL" { *+AO,*+OY,*+AA }

QS "R_UVOWEL" { *+AA,*+AX,*+EN,*+M,*+L,*+EL,*+UW }

QS "R_VOICED-STOP" { *+B,*+D,*+DD,*+G }

QS "R_UNVOICED-STOP" { *+P,*+PD,*+T,*+TD,*+K,*+KD }

QS "R_FRONT-STOP" { *+P,*+PD,*+B }

QS "R_CENTRAL-STOP" { *+T,*+TD,*+D,*+DD }

QS "R_BACK-STOP" { *+K,*+KD,*+G }

QS "R_VOICED-FRIC" { *+Z,*+SH,*+DH,*+CH,*+V }

QS "R_UNVOICED-FRIC" { *+S,*+SH,*+TH,*+F,*+CH }

QS "R_FRONT-FRIC" { *+F,*+V }

QS "R_CENTRAL-FRIC" { *+S,*+Z,*+TH,*+DH }

QS "R_BACK-FRIC" { *+SH,*+CH,*+JH }

QS "R_AA" { *+AA }

QS "R_AE" { *+AE }

QS "R_AH" { *+AH }

QS "R_AO" { *+AO }

QS "R_AW" { *+AW }

QS "R_AX" { *+AX }

QS "R_AY" { *+AY }

QS "R_B" { *+B }

QS "R_CH" { *+CH }

QS "R_D" { *+D }

QS "R_DD" { *+DD }

QS "R_DH" { *+DH }

QS "R_DX" { *+DX }

QS "R_EH" { *+EH }

QS "R_EL" { *+EL }

QS "R_EN" { *+EN }

QS "R_ER" { *+ER }

QS "R_EY" { *+EY }

QS "R_F" { *+F }

QS "R_G" { *+G }

QS "R_HH" { *+HH }

QS "R_IH" { *+IH }

QS "R_IY" { *+IY }

QS "R_JH" { *+JH }

QS "R_K" { *+K }

QS "R_KD" { *+KD }

QS "R_L" { *+L }

QS "R_M" { *+M }

QS "R_N" { *+N }

QS "R_NG" { *+NG }

QS "R_OW" { *+OW }

QS "R_OY" { *+OY }

QS "R_P" { *+P }

QS "R_PD" { *+PD }

QS "R_R" { *+R }

QS "R_S" { *+S }

QS "R_SH" { *+SH }

QS "R_T" { *+T }

QS "R_TD" { *+TD }

QS "R_TH" { *+TH }

QS "R_TS" { *+TS }

QS "R_UH" { *+UH }

QS "R_UW" { *+UW }

QS "R_V" { *+V }

QS "R_W" { *+W }

QS "R_Y" { *+Y }

QS "R_Z" { *+Z }

QS "L_NONBOUNDARY" { *-* }

QS "L_SILENCE" { sil-* }

QS "L_STOP" { P-*,PD-*,B-*,T-*,TD-*,D-*,DD-*,K-*,KD-*,G-* }

QS "L_NASAL" { M-*,N-*,EN-*,NG-* }

QS "L_FRICATIVE" { S-*,SH-*,Z-*,F-*,V-*,CH-*,JH-*,TH-*,DH-* }

QS "L_LIQUID" { L-*,EL-*,R-*,W-*,Y-*,HH-* }

QS "L_VOWEL" { EH-*,IH-*,AO-*,AA-*,UW-*,AH-*,AX-*,ER-*,AY-*,OY-*,EY-*,IY-*,OW-* }

QS "L_C-FRONT" { P-*,PD-*,B-*,M-*,F-*,V-*,W-* }

QS "L_C-CENTRAL" { T-*,TD-*,D-*,DD-*,EN-*,N-*,S-*,Z-*,SH-*,TH-*,DH-*,L-*,EL-*,R-* }

QS "L_C-BACK" { SH-*,CH-*,JH-*,Y-*,K-*,KD-*,G-*,NG-*,HH-* }

QS "L_V-FRONT" { IY-*,IH-*,EH-* }

QS "L_V-CENTRAL" { EH-*,AA-*,ER-*,AO-* }

QS "L_V-BACK" { UW-*,AA-*,AX-*,UH-* }

QS "L_FRONT" { P-*,PD-*,B-*,M-*,F-*,V-*,W-*,IY-*,IH-*,EH-* }

QS "L_CENTRAL" { T-*,TD-*,D-*,DD-*,EN-*,N-*,S-*,Z-*,SH-*,TH-*,DH-*,L-*,EL-*,R-*,EH-*,AA-*,ER-*,AO-* }

QS "L_BACK" { SH-*,CH-*,JH-*,Y-*,K-*,KD-*,G-*,NG-*,HH-*,AA-*,UW-*,AX-*,UH-* }

QS "L_FORTIS" { P-*,PD-*,T-*,TD-*,K-*,KD-*,F-*,TH-*,S-*,SH-*,CH-* }

QS "L_LENIS" { B-*,D-*,DD-*,G-*,V-*,DH-*,Z-*,SH-*,JH-* }

QS "L_UNFORTLENIS" { M-*,N-*,EN-*,NG-*,HH-*,L-*,EL-*,R-*,Y-*,W-* }

QS "L_CORONAL" { T-*,TD-*,D-*,DD-*,N-*,EN-*,TH-*,DH-*,S-*,Z-*,SH-*,CH-*,JH-*,L-*,EL-*,R-* }

QS "L_NONCORONAL" { P-*,PD-*,B-*,M-*,K-*,KD-*,G-*,NG-*,F-*,V-*,HH-*,Y-*,W-* }

QS "L_ANTERIOR" { P-*,PD-*,B-*,M-*,T-*,TD-*,D-*,DD-*,N-*,EN-*,F-*,V-*,TH-*,DH-*,S-*,Z-*,L-*,EL-*,W-* }

QS "L_NONANTERIOR" { K-*,KD-*,G-*,NG-*,SH-*,HH-*,CH-*,JH-*,R-*,Y-* }

QS "L_CONTINUENT" { M-*,N-*,EN-*,NG-*,F-*,V-*,TH-*,DH-*,S-*,Z-*,SH-*,HH-*,L-*,EL-*,R-*,Y-*,W-* }

QS "L_NONCONTINUENT" { P-*,PD-*,B-*,T-*,TD-*,D-*,DD-*,K-*,KD-*,G-*,CH-*,JH-* }

QS "L_STRIDENT" { S-*,Z-*,SH-*,CH-*,JH-* }

QS "L_NONSTRIDENT" { F-*,V-*,TH-*,DH-*,HH-* }

QS "L_UNSTRIDENT" { P-*,PD-*,B-*,M-*,T-*,TD-*,D-*,DD-*,N-*,EN-*,K-*,KD-*,G-*,NG-*,L-*,EL-*,R-*,Y-*,W-* }

QS "L_GLIDE" { HH-*,L-*,EL-*,R-*,Y-*,W-* }

QS "L_SYLLABIC" { EN-*,M-*,L-*,EL-*,ER-* }

QS "L_UNVOICED-CONS" { P-*,PD-*,T-*,TD-*,K-*,KD-*,S-*,SH-*,F-*,TH-*,HH-*,CH-* }

QS "L_VOICED-CONS" { JH-*,B-*,D-*,DD-*,DH-*,G-*,Y-*,L-*,EL-*,M-*,N-*,EN-*,NG-*,R-*,V-*,W-*,Z-*}

QS "L_UNVOICED-ALL" { P-*,PD-*,T-*,TD-*,K-*,KD-*,S-*,SH-*,F-*,TH-*,HH-*,CH-*,sil-* }

QS "L_LONG" { IY-*,AA-*,OW-*,AO-*,UW-*,EN-*,M-*,L-*,EL-* }

QS "L_SHORT" { EH-*,EY-*,AA-*,IH-*,AY-*,OY-*,AH-*,AX-*,UH-* }

QS "L_DIPTHONG" { EY-*,AY-*,OY-*,AA-*,ER-*,EN-*,M-*,L-*,EL-* }

QS "L_FRONT-START" { EY-*,AA-*,ER-* }

QS "L_FRONTING" { AY-*,EY-*,OY-* }

QS "L_HIGH" { IH-*,UW-*,AA-*,AX-*,IY-* }

QS "L_MEDIUM" { EY-*,ER-*,AA-*,AX-*,EH-*,EN-*,M-*,L-*,EL-* }

QS "L_LOW" { EH-*,AY-*,AA-*,AW-*,AO-*,OY-* }

QS "L_ROUNDED" { AO-*,UW-*,AA-*,AX-*,OY-*,W-* }

QS "L_UNROUNDED" { EH-*,IH-*,AA-*,ER-*,AY-*,EY-*,IY-*,AW-*,AH-*,AX-*,EN-*,M-*,HH-*,L-*,EL-*,R-*,Y-* }

QS "L_NONAFFRICATE" { S-*,SH-*,Z-*,F-*,V*,TH-*,DH-* }

QS "L_AFFRICATE" { CH-*,JH-* }

QS "L_IVOWEL" { IH-*,IY-* }

QS "L_EVOWEL" { EH-*,EY-* }

QS "L_AVOWEL" { EH-*,AA-*,ER-*,AY-*,AW-* }

QS "L_OVOWEL" { AO-*,OY-*,AA-* }

QS "L_UVOWEL" { AA-*,AX-*,EN-*,M-*,L-*,EL-*,UW-* }

QS "L_VOICED-STOP" { B-*,D-*,DD-*,G-* }

QS "L_UNVOICED-STOP" { P-*,PD-*,T-*,TD-*,K-*,KD-* }

QS "L_FRONT-STOP" { P-*,PD-*,B-* }

QS "L_CENTRAL-STOP" { T-*,TD-*,D-*,DD-* }

QS "L_BACK-STOP" { K-*,KD-*,G-* }

QS "L_VOICED-FRIC" { Z-*,SH-*,DH-*,CH-*,V-* }

QS "L_UNVOICED-FRIC" { S-*,SH-*,TH-*,F-*,CH-* }

QS "L_FRONT-FRIC" { F-*,V-* }

QS "L_CENTRAL-FRIC" { S-*,Z-*,TH-*,DH-* }

QS "L_BACK-FRIC" { SH-*,CH-*,JH-* }

QS "L_AA" { AA-* }

QS "L_AE" { AE-* }

QS "L_AH" { AH-* }

QS "L_AO" { AO-* }

QS "L_AW" { AW-* }

QS "L_AX" { AX-* }

QS "L_AY" { AY-* }

QS "L_B" { B-* }

QS "L_CH" { CH-* }

QS "L_D" { D-* }

QS "L_DD" { DD-* }

QS "L_DH" { DH-* }

QS "L_DX" { DX-* }

QS "L_EH" { EH-* }

QS "L_EL" { EL-* }

QS "L_EN" { EN-* }

QS "L_ER" { ER-* }

QS "L_EY" { EY-* }

QS "L_F" { F-* }

QS "L_G" { G-* }

QS "L_HH" { HH-* }

QS "L_IH" { IH-* }

QS "L_IY" { IY-* }

QS "L_JH" { JH-* }

QS "L_K" { K-* }

QS "L_KD" { KD-* }

QS "L_L" { L-* }

QS "L_M" { M-* }

QS "L_N" { N-* }

QS "L_NG" { NG-* }

QS "L_OW" { OW-* }

QS "L_OY" { OY-* }

QS "L_P" { P-* }

QS "L_PD" { PD-* }

QS "L_R" { R-* }

QS "L_S" { S-* }

QS "L_SH" { SH-* }

QS "L_T" { T-* }

QS "L_TD" { TD-* }

QS "L_TH" { TH-* }

QS "L_TS" { TS-* }

QS "L_UH" { UH-* }

QS "L_UW" { UW-* }

QS "L_V" { V-* }

QS "L_W" { W-* }

QS "L_Y" { Y-* }

QS "L_Z" { Z-* }

TR 2

TB 350.0 "ST_AA_2_" {("AA","*-AA+*","AA+*","*-AA").state[2]}

TB 350.0 "ST_AE_2_" {("AE","*-AE+*","AE+*","*-AE").state[2]}

TB 350.0 "ST_AH_2_" {("AH","*-AH+*","AH+*","*-AH").state[2]}

TB 350.0 "ST_AO_2_" {("AO","*-AO+*","AO+*","*-AO").state[2]}

TB 350.0 "ST_AW_2_" {("AW","*-AW+*","AW+*","*-AW").state[2]}

TB 350.0 "ST_AY_2_" {("AY","*-AY+*","AY+*","*-AY").state[2]}

TB 350.0 "ST_B_2_" {("B","*-B+*","B+*","*-B").state[2]}

TB 350.0 "ST_CH_2_" {("CH","*-CH+*","CH+*","*-CH").state[2]}

TB 350.0 "ST_D_2_" {("D","*-D+*","D+*","*-D").state[2]}

TB 350.0 "ST_EH_2_" {("EH","*-EH+*","EH+*","*-EH").state[2]}

TB 350.0 "ST_ER_2_" {("ER","*-ER+*","ER+*","*-ER").state[2]}

TB 350.0 "ST_EY_2_" {("EY","*-EY+*","EY+*","*-EY").state[2]}

TB 350.0 "ST_F_2_" {("F","*-F+*","F+*","*-F").state[2]}

TB 350.0 "ST_G_2_" {("G","*-G+*","G+*","*-G").state[2]}

TB 350.0 "ST_HH_2_" {("HH","*-HH+*","HH+*","*-HH").state[2]}

TB 350.0 "ST_IH_2_" {("IH","*-IH+*","IH+*","*-IH").state[2]}

TB 350.0 "ST_IY_2_" {("IY","*-IY+*","IY+*","*-IY").state[2]}

TB 350.0 "ST_JH_2_" {("JH","*-JH+*","JH+*","*-JH").state[2]}

TB 350.0 "ST_K_2_" {("K","*-K+*","K+*","*-K").state[2]}

TB 350.0 "ST_L_2_" {("L","*-L+*","L+*","*-L").state[2]}

TB 350.0 "ST_M_2_" {("M","*-M+*","M+*","*-M").state[2]}

TB 350.0 "ST_N_2_" {("N","*-N+*","N+*","*-N").state[2]}

TB 350.0 "ST_OW_2_" {("OW","*-OW+*","OW+*","*-OW").state[2]}

TB 350.0 "ST_P_2_" {("P","*-P+*","P+*","*-P").state[2]}

TB 350.0 "ST_R_2_" {("R","*-R+*","R+*","*-R").state[2]}

TB 350.0 "ST_S_2_" {("S","*-S+*","S+*","*-S").state[2]}

TB 350.0 "ST_sil_2_" {("sil","*-sil+*","sil+*","*-sil").state[2]}

TB 350.0 "ST_sp_2_" {("sp","*-sp+*","sp+*","*-sp").state[2]}

TB 350.0 "ST_T_2_" {("T","*-T+*","T+*","*-T").state[2]}

TB 350.0 "ST_TH_2_" {("TH","*-TH+*","TH+*","*-TH").state[2]}

TB 350.0 "ST_UW_2_" {("UW","*-UW+*","UW+*","*-UW").state[2]}

TB 350.0 "ST_V_2_" {("V","*-V+*","V+*","*-V").state[2]}

TB 350.0 "ST_W_2_" {("W","*-W+*","W+*","*-W").state[2]}

TB 350.0 "ST_Y_2_" {("Y","*-Y+*","Y+*","*-Y").state[2]}

TB 350.0 "ST_Z_2_" {("Z","*-Z+*","Z+*","*-Z").state[2]}

TB 350.0 "ST_AA_3_" {("AA","*-AA+*","AA+*","*-AA").state[3]}

TB 350.0 "ST_AE_3_" {("AE","*-AE+*","AE+*","*-AE").state[3]}

TB 350.0 "ST_AH_3_" {("AH","*-AH+*","AH+*","*-AH").state[3]}

TB 350.0 "ST_AO_3_" {("AO","*-AO+*","AO+*","*-AO").state[3]}

TB 350.0 "ST_AW_3_" {("AW","*-AW+*","AW+*","*-AW").state[3]}

TB 350.0 "ST_AY_3_" {("AY","*-AY+*","AY+*","*-AY").state[3]}

TB 350.0 "ST_B_3_" {("B","*-B+*","B+*","*-B").state[3]}

TB 350.0 "ST_CH_3_" {("CH","*-CH+*","CH+*","*-CH").state[3]}

TB 350.0 "ST_D_3_" {("D","*-D+*","D+*","*-D").state[3]}

TB 350.0 "ST_EH_3_" {("EH","*-EH+*","EH+*","*-EH").state[3]}

TB 350.0 "ST_ER_3_" {("ER","*-ER+*","ER+*","*-ER").state[3]}

TB 350.0 "ST_EY_3_" {("EY","*-EY+*","EY+*","*-EY").state[3]}

TB 350.0 "ST_F_3_" {("F","*-F+*","F+*","*-F").state[3]}

TB 350.0 "ST_G_3_" {("G","*-G+*","G+*","*-G").state[3]}

TB 350.0 "ST_HH_3_" {("HH","*-HH+*","HH+*","*-HH").state[3]}

TB 350.0 "ST_IH_3_" {("IH","*-IH+*","IH+*","*-IH").state[3]}

TB 350.0 "ST_IY_3_" {("IY","*-IY+*","IY+*","*-IY").state[3]}

TB 350.0 "ST_JH_3_" {("JH","*-JH+*","JH+*","*-JH").state[3]}

TB 350.0 "ST_K_3_" {("K","*-K+*","K+*","*-K").state[3]}

TB 350.0 "ST_L_3_" {("L","*-L+*","L+*","*-L").state[3]}

TB 350.0 "ST_M_3_" {("M","*-M+*","M+*","*-M").state[3]}

TB 350.0 "ST_N_3_" {("N","*-N+*","N+*","*-N").state[3]}

TB 350.0 "ST_OW_3_" {("OW","*-OW+*","OW+*","*-OW").state[3]}

TB 350.0 "ST_P_3_" {("P","*-P+*","P+*","*-P").state[3]}

TB 350.0 "ST_R_3_" {("R","*-R+*","R+*","*-R").state[3]}

TB 350.0 "ST_S_3_" {("S","*-S+*","S+*","*-S").state[3]}

TB 350.0 "ST_sil_3_" {("sil","*-sil+*","sil+*","*-sil").state[3]}

TB 350.0 "ST_sp_3_" {("sp","*-sp+*","sp+*","*-sp").state[3]}

TB 350.0 "ST_T_3_" {("T","*-T+*","T+*","*-T").state[3]}

TB 350.0 "ST_TH_3_" {("TH","*-TH+*","TH+*","*-TH").state[3]}

TB 350.0 "ST_UW_3_" {("UW","*-UW+*","UW+*","*-UW").state[3]}

TB 350.0 "ST_V_3_" {("V","*-V+*","V+*","*-V").state[3]}

TB 350.0 "ST_W_3_" {("W","*-W+*","W+*","*-W").state[3]}

TB 350.0 "ST_Y_3_" {("Y","*-Y+*","Y+*","*-Y").state[3]}

TB 350.0 "ST_Z_3_" {("Z","*-Z+*","Z+*","*-Z").state[3]}

TB 350.0 "ST_AA_4_" {("AA","*-AA+*","AA+*","*-AA").state[4]}

TB 350.0 "ST_AE_4_" {("AE","*-AE+*","AE+*","*-AE").state[4]}

TB 350.0 "ST_AH_4_" {("AH","*-AH+*","AH+*","*-AH").state[4]}

TB 350.0 "ST_AO_4_" {("AO","*-AO+*","AO+*","*-AO").state[4]}

TB 350.0 "ST_AW_4_" {("AW","*-AW+*","AW+*","*-AW").state[4]}

TB 350.0 "ST_AY_4_" {("AY","*-AY+*","AY+*","*-AY").state[4]}

TB 350.0 "ST_B_4_" {("B","*-B+*","B+*","*-B").state[4]}

TB 350.0 "ST_CH_4_" {("CH","*-CH+*","CH+*","*-CH").state[4]}

TB 350.0 "ST_D_4_" {("D","*-D+*","D+*","*-D").state[4]}

TB 350.0 "ST_EH_4_" {("EH","*-EH+*","EH+*","*-EH").state[4]}

TB 350.0 "ST_ER_4_" {("ER","*-ER+*","ER+*","*-ER").state[4]}

TB 350.0 "ST_EY_4_" {("EY","*-EY+*","EY+*","*-EY").state[4]}

TB 350.0 "ST_F_4_" {("F","*-F+*","F+*","*-F").state[4]}

TB 350.0 "ST_G_4_" {("G","*-G+*","G+*","*-G").state[4]}

TB 350.0 "ST_HH_4_" {("HH","*-HH+*","HH+*","*-HH").state[4]}

TB 350.0 "ST_IH_4_" {("IH","*-IH+*","IH+*","*-IH").state[4]}

TB 350.0 "ST_IY_4_" {("IY","*-IY+*","IY+*","*-IY").state[4]}

TB 350.0 "ST_JH_4_" {("JH","*-JH+*","JH+*","*-JH").state[4]}

TB 350.0 "ST_K_4_" {("K","*-K+*","K+*","*-K").state[4]}

TB 350.0 "ST_L_4_" {("L","*-L+*","L+*","*-L").state[4]}

TB 350.0 "ST_M_4_" {("M","*-M+*","M+*","*-M").state[4]}

TB 350.0 "ST_N_4_" {("N","*-N+*","N+*","*-N").state[4]}

TB 350.0 "ST_OW_4_" {("OW","*-OW+*","OW+*","*-OW").state[4]}

TB 350.0 "ST_P_4_" {("P","*-P+*","P+*","*-P").state[4]}

TB 350.0 "ST_R_4_" {("R","*-R+*","R+*","*-R").state[4]}

TB 350.0 "ST_S_4_" {("S","*-S+*","S+*","*-S").state[4]}

TB 350.0 "ST_sil_4_" {("sil","*-sil+*","sil+*","*-sil").state[4]}

TB 350.0 "ST_sp_4_" {("sp","*-sp+*","sp+*","*-sp").state[4]}

TB 350.0 "ST_T_4_" {("T","*-T+*","T+*","*-T").state[4]}

TB 350.0 "ST_TH_4_" {("TH","*-TH+*","TH+*","*-TH").state[4]}

TB 350.0 "ST_UW_4_" {("UW","*-UW+*","UW+*","*-UW").state[4]}

TB 350.0 "ST_V_4_" {("V","*-V+*","V+*","*-V").state[4]}

TB 350.0 "ST_W_4_" {("W","*-W+*","W+*","*-W").state[4]}

TB 350.0 "ST_Y_4_" {("Y","*-Y+*","Y+*","*-Y").state[4]}

TB 350.0 "ST_Z_4_" {("Z","*-Z+*","Z+*","*-Z").state[4]}

TR 2

AU "fulllist"

CO "tiedlist"

ST "trees"

8. author global.ded script file

AS sp

RS cmu

MP sil sil sp

TC

9. Create fulllist of triphones. (This is referenced by tree.hed) Up until now, we have only considered the triphones that occur in the training data, however, test data may include ones not encountered yet.

HDMan.exe -b sp -n fulllist -g global.ded -l flog dict-tri dict

Append following manually to newly createdfulllist:

AY

EY

IY

OW

10. Create hmm13. This version will use the heuristics found intree.hed file to perform state tying of the triphone models.

HHEd.exe -B -H hmm12/macros -H hmm12/hmmdefs -M hmm13 tree.hed triphones1 > log

11. modify gauss_split.hed

MU 4 {*.state[2-4].mix}

12. Modifyhmm13using the new macros to split the Gaussians used in each state.

HHed.exe -H hmm13/macros -H hmm13/hmmdefs -M hmm13gauss_split.hed tiedlist

13. Retrainhmm13.

HERest.exe -C config -I wintri.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm13/macros -H hmm13/hmmdefs -M hmm13 tiedlist

HERest.exe -C config -I wintri.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm13/macros -H hmm13/hmmdefs -M hmm13tiedlist

14. modify gauss_split.hed

MU 8{*.state[2-4].mix}

15. Makehmm14using the new macros to split the Gaussians used in each state.

HHed.exe -H hmm13/macros -H hmm13/hmmdefs -M hmm14gauss_split.hed tiedlist

16. Retrainhmm14.

HERest.exe -C config -I wintri.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm14/macros -H hmm14/hmmdefs -M hmm14tiedlist

HERest.exe -C config -I wintri.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm14/macros -H hmm14/hmmdefs -M hmm14tiedlist

17. Create hmm15. Hmm15 will be the final model used for recognition. (final re-estimation with the tied models and 8 Gaussians)

HERest.exe -C config -I wintri.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm14/macros -H hmm14/hmmdefs -M hmm15 tiedlist

Running the HTK Word Recognizer

1. the same procedure of importing and coding the test data to MFCC must be done. Run the following commands to prepare the data.

mkdir an4_test_audio an4_test_mfcc

find /an4/wav/an4test_clstk -name "*.sph" | xargs -i cp {} an4_test_audio

find /htktut/an4_test_audio > pre_codetest.scp

./make_codetest.pl > codetest.scp

HCopy.exe -T 1 -C config_hcopy -S codetest.scp

2. Create test word MLF and phone-level MLF

cp /an4/etc/an4_test.transcription /htktut

./an4_test_mlf_maker.pl > testwords.mlf

HLEd.exe -l '*' -d dict -i testphones.mlf mkphones0.led testwords.mlf

3. Run the recognizer, HVite against hmm15. Recout.mlf will be a phone-level MLF containing the hypothesis of the recognizer

HVite.exe -H hmm15/macros -H hmm15/hmmdefs -S test.scp -l '*' -i recout.mlf -w wdnet -p 0.0 -s 5.0 dict tiedlist

4. View the result data statistics by comparing the resulting MLF to the transcription MLF.

HResults -I testwords.mlf tiedlist recout.mlf