Training and Decoding the AN4 Corpus Using HTK
Introduction
This procedure will provide a step-by-step tutorial in creating and training an acoustic model in HTK. Refer to section 2 of the HTKBook for the original tutorial. Refer to section 4 of the HTKBook for details on each of the provided programs.
Conventions used in this tutorial:
- Italics are used to denote commands to run. For most commands, you can add “-T 0001” to the command to output detailed trace information (useful for debugging).
- Courier New font denotes text file contents
- Boldface is used to denote important filenames
Environment Setup
- Obtain the latest HTK 3.0 baseline from Note you may need to register to download the software
- Unzip and untar the release in your cygwin root directory:
/htk - Configure, make, and install HTK (see the included README files for detailed instructions)
Data Preparation
- Create a directory called /htktut. Copy helper perl scripts that will be used to process AN4-specific formats
- Obtain the AN4 corpus data from: Make sure to obtain the version with the files in NIST Sphere format.
- Extract the corpus to a directory called /an4.
- cd /htktut
- Create a file containing the allowable grammar for each word and utterance in the AN4 corpus. Author a file called gram with the following contents:
$words = A | AND | APOSTROPHE | APRIL |
AREA | AUGUST | B | C | CODE |
D | DECEMBER | E | EIGHT | EIGHTEEN |
EIGHTEENTH | EIGHTH | EIGHTY | ELEVEN | ELEVENTH |
F | FEBRUARY | FIFTEEN |
FIFTEENTH | FIFTH | FIFTY | FIRST | FIVE |
FORTY | FOUR | FOURTEEN | FOURTH | G |
GO | H | HALF | HUNDRED |
I | J | JANUARY | JULY | JUNE |
K | L | M | MARCH | MAY |
N | NINE | NINETEEN | NINETY | NINTH | NOVEMBER |
O | OCTOBER | OF | OH | ONE | P | Q | R |
S | SECOND | SEPTEMBER |
SEVEN | SEVENTEEN | SEVENTH | SEVENTY |
SIX | SIXTEEN | SIXTEENTH | SIXTH | SIXTY | T | TEN | THIRD |
THIRTEEN | THIRTIETH | THIRTY | THOUSAND | THREE |
TWELFTH | TWELVE | TWENTIETH | TWENTY | TWO |
U | V | W | X | Y | Z | ZERO;
$singleCmd = GO | YES | NO | REPEAT | STOP | ERASE | HELP;
$startCmd = RUBOUT | ENTER;
( silence ( $singleCmd | $startCmd <$words> | <$words> ) silence )
- Use the program HParse to create a word-net that graphs the words and their associations using the created grammar:
HParse.exe gram wdnet - Create the pronunciation dictionary dict for the AN4 corpus:
cp an4.dict to dir
./dict_clean.pl > dict - Create the training word MLF (Master Label File). This file contains all the transcriptions of training data.
cp an4_train.transcription to dir
./an4_mlf_maker.pl > words.mlf
- Author the following HTK configuration filemkphones0.led:
EX
IS sil sil
DE sp
- run HLed to make phone-level MLF’s
HLEd.exe -l '*' -d dict -i phones0.mlf mkphones0.led words.mlf
- code the train data to MFCC with the following commands
mkdir an4_train_audio an4_train_mfcc
find /an4/wav/[train] -name "*.sph" | xargs -i cp {} an4_train_audio
find /htktut/an4_train_audio > pre_codetr.scp
./make_codetr.pl > codetr.scp
- author file:config_hcopy:
# Coding parameters
SOURCEKIND = WAVEFORM
SOURCEFORMAT = NIST
#SOURCERATE = 625
TARGETKIND = MFCC_0_D_A
TARGETRATE = 100000.0
SAVECOMPRESSED = T
SAVEWITHCRC = T
WINDOWSIZE = 250000.0
USEHAMMING = T
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
ENORMALISE = F
- Run the HCopy program to translate wav data to MFCC
HCopy.exe -T 1 -C config_hcopy -S codetr.scp
Create Monophone Context Independent Models:
1.Create the hmmdirectories. These will store each Baum-Welch re-estimated version of the model.
mkdir hmm0 hmm1 hmm2 hmm3 hmm4 hmm5 hmm61 hmm7 hmm8 hmm9 hmm10 hmm11 hmm12 hmm13 hmm14 hmm15
2. Author proto. This is the initial model architecture for each 3-state phone HMM:
~o <VecSize> 39 <MFCC_0_D_A>
~h "proto"
<BeginHMM>
<NumStates> 5
<State> 2
<Mean> 39
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
<Variance> 39
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
<State> 3
<Mean> 39
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
<Variance> 39
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
<State> 4
<Mean> 39
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
<Variance> 39
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
<TransP> 5
0.0 1.0 0.0 0.0 0.0
0.0 0.6 0.4 0.0 0.0
0.0 0.0 0.6 0.4 0.0
0.0 0.0 0.0 0.7 0.3
0.0 0.0 0.0 0.0 0.0
<EndHMM>
3. Author config.
# Coding parameters
TARGETKIND = MFCC_0_D_A
TARGETRATE = 100000.0
SAVECOMPRESSED = T
SAVEWITHCRC = T
WINDOWSIZE = 250000.0
USEHAMMING = T
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
ENORMALISE = F
4. Use HCompV to create the first template HMM-set: hmm0
HCompV.exe -C config -f 0.01 -m -S train.scp -M hmm0 proto
5. Author hmm0/macros. This file contains common information needed to do model re-estimation:
~o <MFCC_0_D_A> <VecSize> 39
~v varFloor1
<Variance> 39
7.266113e-01 4.722334e-01 9.523441e-01 6.613073e-01 7.113036e-01 5.931423e-01 6.048728e-01 6.435860
e-01 4.942584e-01 4.086919e-01 4.240367e-01 2.945997e-01 1.170168e+00 2.929090e-02 2.164918e-02 2.93
3550e-02 2.839202e-02 3.198905e-02 2.880209e-02 3.227245e-02 3.238301e-02 2.820307e-02 2.370699e-02
2.331906e-02 1.896634e-02 3.761791e-02 4.119497e-03 3.521117e-03 4.239024e-03 4.587236e-03 5.152307e
-03 4.821633e-03 5.481833e-03 5.578703e-03 4.909256e-03 4.267938e-03 4.158779e-03 3.423044e-03 5.544
049e-03
6. Author hmm0/hmmdefs Model Macro File (MMF). This file contains definitions for ALL HMM’s (one for each phone for now).
Cpan4/etc/an4.phone /htktut
./make_mmf.pl > hmm0/hmmdefs
cp an4.phone monophones0
add the phone “sp” to monophones0, save it as a new file, monophones1
7. Perform the first re-estimate,hmm1
HERest.exe -C config -I phones0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm0/macros -H hmm0/hmmdefs -M hmm1 monophones0
8. Createhmm2 and hmm3 (performing more re-estimation and refinement on the CI models).
HERest.exe -C config -I phones0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm1/macros -H hmm1/hmmdefs -M hmm2 monophones0
HERest.exe -C config -I phones0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm2/macros -H hmm2/hmmdefs -M hmm3 monophones0
9. author hmm4/hmmdefs by adding appending the following to hmm3/hmmdefs,
~h "sp"
<BEGINHMM>
<NUMSTATES> 3
<STATE> 2
<MEAN> 39
-9.510303e+00 -1.681556e+00 -2.859189e+00 -3.372297e+00 -3.174411e+00 -4.109135e+00 -6.625902e+00 -3.097823e+00 -3.832037e+00 -1.941238e+00 -2.021214e+00 -9.669427e-01 4.440228e+01 -6.433553e-03 1.031006e-01 -5.078135e-02 -1.677521e-02 -6.127201e-03 -1.733933e-02 -7.105561e-02 -7.521260e-02 5.238789e-03 3.878872e-02 4.696222e-03 -4.062191e-02 -1.051957e-01 -9.008758e-03 -6.651144e-03 2.551728e-02 3.584296e-03 8.466722e-03 1.126705e-02 2.028633e-02 1.525841e-02 8.087316e-04 -1.189610e-03 -1.932883e-03 1.927950e-03 3.416471e-02
<VARIANCE> 39
5.016994e+00 1.055978e+01 9.199114e+00 1.149608e+01 1.264298e+01 1.406597e+01 1.837638e+01 1.957598e+01 1.790414e+01 1.914055e+01 1.634619e+01 1.472672e+01 1.478365e+01 1.714879e-01 3.757185e-01 5.289578e-01 6.911641e-01 8.839969e-01 1.038755e+00 1.234214e+00 1.296763e+00 1.341149e+00 1.388875e+001.319310e+00 1.238895e+00 1.258024e-01 3.459969e-02 6.561287e-02 1.027706e-01 1.376464e-01 1.742430e-01 2.065688e-01 2.438512e-01 2.532034e-01 2.670280e-01 2.767306e-01 2.619496e-01 2.470811e-01 1.865003e-02
<GCONST> 7.528535e+01
<TRANSP> 3
0.000000e+00 1.000000e+00 0.000000e+00
0.000000e+00 9.335565e-01 6.644349e-02
0.000000e+00 0.000000e+00 0.000000e+00
<ENDHMM>
cp hmm3/macros hmm4/macros
10. Author HHed script file: sil.hed
AT 2 4 0.2 {sil.transP}
AT 4 2 0.2 {sil.transP}
AT 1 3 0.3 {sp.transP}
TI silst {sil.state[3],sp.state[2]}
11. Create hmm5. This version will include a silence model state within each HMM to accurately model gaps in utterances.
HHed.exe -H hmm4/macros -H hmm4/hmmdefs -M hmm5 sil.hed monophones1
12. Create hmm6 and hmm7 (re-estimate with the new silence models)
HERest.exe -C config -I phones0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm5/macros -H hmm5/hmmdefs -M hmm6 monophones1
HERest.exe -C config -I phones0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm6/macros -H hmm6/hmmdefs -M hmm7 monophones1
13. Appendthe following to dict
silence sil
14. Create align MLF. This version of the MLF file takes into account multiple pronunciations of a single word.
HVite.exe -l '*' -o SWT -b silence -C config -a -H hmm7/macros -H hmm7/hmmdefs -i aligned.mlf -m -t 250.0 -y lab -I words.mlf -S train.scp dict monophones1
15. Create hmm8 and hmm9 using the aligned version of the phone MLF.
HERest.exe -C config -I aligned.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm7/macros -H hmm7/hmmdefs -M hmm8 monophones1
HERest.exe -C config -I aligned.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm8/macros -H hmm8/hmmdefs -M hmm9 monophones1
15. author gauss_split.hed
MU 2 {*.state[2-4].mix}
15. Modifyhmm9using the new macros to split the Gaussians used in each state.
HHed.exe -H hmm9/macros -H hmm9/hmmdefs -M hmm9gauss_split..hed monophones1
16. Retrainhmm9 using the aligned version of the phone MLF.
HERest.exe -C config -I aligned.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm9/macros -H hmm9/hmmdefs -M hmm9 monophones1
HERest.exe -C config -I aligned.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm9/macros -H hmm9/hmmdefs -M hmm9 monophones1
Create Tied-state, Context Dependent Triphone HMM’s
1. Author mktri.led HLed script file:
WB sp
WB sil
TC
2. Create tri-phone MLF and triphone listing. These will be the new models that will be context-dependant on the previous and next triphone occurrences.
HLed.exe -n triphones1 -l '*' -i wintri.mlf mktri.led aligned.mlf
3. Author mktri.hed HHed script file. This uses the script file maketrihed
maketrihed monophones1 triphones1
4. create hmm10. This model version will replace the monophone models with their triphone versions.
HHed.exe -B -H hmm9/macros -H hmm9/hmmdefs -M hmm10 mktri.hed monophones1
5. Createhmm11 (re-estimation using the new triphone models)
HERest.exe -C config -I wintri.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm10/macros -H hmm10/hmmdefs -M hmm11 triphones1
6. Create hmm12, make sure to include the –s flag to generate the stats file
HERest.exe -C config -I wintri.mlf -t 250.0 150.0 1000.0 -s stats -S train.scp -H hmm11/macros -H hmm11/hmmdefs -M hmm12 triphones1
7. Author tree.hed file:
RO 100.0 stats
TR 0
QS "R_NONBOUNDARY" { *+* }
QS "R_SILENCE" { *+sil }
QS "R_STOP" { *+P,*+PD,*+B,*+T,*+TD,*+D,*+DD,*+K,*+KD,*+G }
QS "R_NASAL" { *+M,*+N,*+EN,*+NG }
QS "R_FRICATIVE" { *+S,*+SH,*+Z,*+F,*+V,*+CH,*+JH,*+TH,*+DH }
QS "R_LIQUID" { *+L,*+EL,*+R,*+W,*+Y,*+HH }
QS "R_VOWEL" { *+EH,*+IH,*+AO,*+AA,*+UW,*+AH,*+AX,*+ER,*+AY,*+OY,*+EY,*+IY,*+OW }
QS "R_C-FRONT" { *+P,*+PD,*+B,*+M,*+F,*+V,*+W }
QS "R_C-CENTRAL" { *+T,*+TD,*+D,*+DD,*+EN,*+N,*+S,*+Z,*+SH,*+TH,*+DH,*+L,*+EL,*+R }
QS "R_C-BACK" { *+SH,*+CH,*+JH,*+Y,*+K,*+KD,*+G,*+NG,*+HH }
QS "R_V-FRONT" { *+IY,*+IH,*+EH }
QS "R_V-CENTRAL" { *+EH,*+AA,*+ER,*+AO }
QS "R_V-BACK" { *+UW,*+AA,*+AX,*+UH }
QS "R_FRONT" { *+P,*+PD,*+B,*+M,*+F,*+V,*+W,*+IY,*+IH,*+EH }
QS "R_CENTRAL" { *+T,*+TD,*+D,*+DD,*+EN,*+N,*+S,*+Z,*+SH,*+TH,*+DH,*+L,*+EL,*+R,*+EH,*+AA,*+ER,*+AO }
QS "R_BACK" { *+SH,*+CH,*+JH,*+Y,*+K,*+KD,*+G,*+NG,*+HH,*+AA,*+UW,*+AX,*+UH }
QS "R_FORTIS" { *+P,*+PD,*+T,*+TD,*+K,*+KD,*+F,*+TH,*+S,*+SH,*+CH }
QS "R_LENIS" { *+B,*+D,*+DD,*+G,*+V,*+DH,*+Z,*+SH,*+JH }
QS "R_UNFORTLENIS" { *+M,*+N,*+EN,*+NG,*+HH,*+L,*+EL,*+R,*+Y,*+W }
QS "R_CORONAL" { *+T,*+TD,*+D,*+DD,*+N,*+EN,*+TH,*+DH,*+S,*+Z,*+SH,*+CH,*+JH,*+L,*+EL,*+R }
QS "R_NONCORONAL" { *+P,*+PD,*+B,*+M,*+K,*+KD,*+G,*+NG,*+F,*+V,*+HH,*+Y,*+W }
QS "R_ANTERIOR" { *+P,*+PD,*+B,*+M,*+T,*+TD,*+D,*+DD,*+N,*+EN,*+F,*+V,*+TH,*+DH,*+S,*+Z,*+L,*+EL,*+W }
QS "R_NONANTERIOR" { *+K,*+KD,*+G,*+NG,*+SH,*+HH,*+CH,*+JH,*+R,*+Y }
QS "R_CONTINUENT" { *+M,*+N,*+EN,*+NG,*+F,*+V,*+TH,*+DH,*+S,*+Z,*+SH,*+HH,*+L,*+EL,*+R,*+Y,*+W }
QS "R_NONCONTINUENT" { *+P,*+PD,*+B,*+T,*+TD,*+D,*+DD,*+K,*+KD,*+G,*+CH,*+JH }
QS "R_STRIDENT" { *+S,*+Z,*+SH,*+CH,*+JH }
QS "R_NONSTRIDENT" { *+F,*+V,*+TH,*+DH,*+HH }
QS "R_UNSTRIDENT" { *+P,*+PD,*+B,*+M,*+T,*+TD,*+D,*+DD,*+N,*+EN,*+K,*+KD,*+G,*+NG,*+L,*+EL,*+R,*+Y,*+W }
QS "R_GLIDE" { *+HH,*+L,*+EL,*+R,*+Y,*+W }
QS "R_SYLLABIC" { *+EN,*+M,*+L,*+EL,*+ER }
QS "R_UNVOICED-CONS" { *+P,*+PD,*+T,*+TD,*+K,*+KD,*+S,*+SH,*+F,*+TH,*+HH,*+CH }
QS "R_VOICED-CONS" { *+JH,*+B,*+D,*+DD,*+DH,*+G,*+Y,*+L,*+EL,*+M,*+N,*+EN,*+NG,*+R,*+V,*+W,*+Z}
QS "R_UNVOICED-ALL" { *+P,*+PD,*+T,*+TD,*+K,*+KD,*+S,*+SH,*+F,*+TH,*+HH,*+CH,*+sil }
QS "R_LONG" { *+IY,*+AA,*+OW,*+AO,*+UW,*+EN,*+M,*+L,*+EL }
QS "R_SHORT" { *+EH,*+EY,*+AA,*+IH,*+AY,*+OY,*+AH,*+AX,*+UH }
QS "R_DIPTHONG" { *+EY,*+AY,*+OY,*+AA,*+ER,*+EN,*+M,*+L,*+EL }
QS "R_FRONT-START" { *+EY,*+AA,*+ER }
QS "R_FRONTING" { *+AY,*+EY,*+OY }
QS "R_HIGH" { *+IH,*+UW,*+AA,*+AX,*+IY }
QS "R_MEDIUM" { *+EY,*+ER,*+AA,*+AX,*+EH,*+EN,*+M,*+L,*+EL }
QS "R_LOW" { *+EH,*+AY,*+AA,*+AW,*+AO,*+OY }
QS "R_ROUNDED" { *+AO,*+UW,*+AA,*+AX,*+OY,*+W }
QS "R_UNROUNDED" { *+EH,*+IH,*+AA,*+ER,*+AY,*+EY,*+IY,*+AW,*+AH,*+AX,*+EN,*+M,*+HH,*+L,*+EL,*+R,*+Y }
QS "R_NONAFFRICATE" { *+S,*+SH,*+Z,*+F,*+V,*+TH,*+DH }
QS "R_AFFRICATE" { *+CH,*+JH }
QS "R_IVOWEL" { *+IH,*+IY }
QS "R_EVOWEL" { *+EH,*+EY }
QS "R_AVOWEL" { *+EH,*+AA,*+ER,*+AY,*+AW }
QS "R_OVOWEL" { *+AO,*+OY,*+AA }
QS "R_UVOWEL" { *+AA,*+AX,*+EN,*+M,*+L,*+EL,*+UW }
QS "R_VOICED-STOP" { *+B,*+D,*+DD,*+G }
QS "R_UNVOICED-STOP" { *+P,*+PD,*+T,*+TD,*+K,*+KD }
QS "R_FRONT-STOP" { *+P,*+PD,*+B }
QS "R_CENTRAL-STOP" { *+T,*+TD,*+D,*+DD }
QS "R_BACK-STOP" { *+K,*+KD,*+G }
QS "R_VOICED-FRIC" { *+Z,*+SH,*+DH,*+CH,*+V }
QS "R_UNVOICED-FRIC" { *+S,*+SH,*+TH,*+F,*+CH }
QS "R_FRONT-FRIC" { *+F,*+V }
QS "R_CENTRAL-FRIC" { *+S,*+Z,*+TH,*+DH }
QS "R_BACK-FRIC" { *+SH,*+CH,*+JH }
QS "R_AA" { *+AA }
QS "R_AE" { *+AE }
QS "R_AH" { *+AH }
QS "R_AO" { *+AO }
QS "R_AW" { *+AW }
QS "R_AX" { *+AX }
QS "R_AY" { *+AY }
QS "R_B" { *+B }
QS "R_CH" { *+CH }
QS "R_D" { *+D }
QS "R_DD" { *+DD }
QS "R_DH" { *+DH }
QS "R_DX" { *+DX }
QS "R_EH" { *+EH }
QS "R_EL" { *+EL }
QS "R_EN" { *+EN }
QS "R_ER" { *+ER }
QS "R_EY" { *+EY }
QS "R_F" { *+F }
QS "R_G" { *+G }
QS "R_HH" { *+HH }
QS "R_IH" { *+IH }
QS "R_IY" { *+IY }
QS "R_JH" { *+JH }
QS "R_K" { *+K }
QS "R_KD" { *+KD }
QS "R_L" { *+L }
QS "R_M" { *+M }
QS "R_N" { *+N }
QS "R_NG" { *+NG }
QS "R_OW" { *+OW }
QS "R_OY" { *+OY }
QS "R_P" { *+P }
QS "R_PD" { *+PD }
QS "R_R" { *+R }
QS "R_S" { *+S }
QS "R_SH" { *+SH }
QS "R_T" { *+T }
QS "R_TD" { *+TD }
QS "R_TH" { *+TH }
QS "R_TS" { *+TS }
QS "R_UH" { *+UH }
QS "R_UW" { *+UW }
QS "R_V" { *+V }
QS "R_W" { *+W }
QS "R_Y" { *+Y }
QS "R_Z" { *+Z }
QS "L_NONBOUNDARY" { *-* }
QS "L_SILENCE" { sil-* }
QS "L_STOP" { P-*,PD-*,B-*,T-*,TD-*,D-*,DD-*,K-*,KD-*,G-* }
QS "L_NASAL" { M-*,N-*,EN-*,NG-* }
QS "L_FRICATIVE" { S-*,SH-*,Z-*,F-*,V-*,CH-*,JH-*,TH-*,DH-* }
QS "L_LIQUID" { L-*,EL-*,R-*,W-*,Y-*,HH-* }
QS "L_VOWEL" { EH-*,IH-*,AO-*,AA-*,UW-*,AH-*,AX-*,ER-*,AY-*,OY-*,EY-*,IY-*,OW-* }
QS "L_C-FRONT" { P-*,PD-*,B-*,M-*,F-*,V-*,W-* }
QS "L_C-CENTRAL" { T-*,TD-*,D-*,DD-*,EN-*,N-*,S-*,Z-*,SH-*,TH-*,DH-*,L-*,EL-*,R-* }
QS "L_C-BACK" { SH-*,CH-*,JH-*,Y-*,K-*,KD-*,G-*,NG-*,HH-* }
QS "L_V-FRONT" { IY-*,IH-*,EH-* }
QS "L_V-CENTRAL" { EH-*,AA-*,ER-*,AO-* }
QS "L_V-BACK" { UW-*,AA-*,AX-*,UH-* }
QS "L_FRONT" { P-*,PD-*,B-*,M-*,F-*,V-*,W-*,IY-*,IH-*,EH-* }
QS "L_CENTRAL" { T-*,TD-*,D-*,DD-*,EN-*,N-*,S-*,Z-*,SH-*,TH-*,DH-*,L-*,EL-*,R-*,EH-*,AA-*,ER-*,AO-* }
QS "L_BACK" { SH-*,CH-*,JH-*,Y-*,K-*,KD-*,G-*,NG-*,HH-*,AA-*,UW-*,AX-*,UH-* }
QS "L_FORTIS" { P-*,PD-*,T-*,TD-*,K-*,KD-*,F-*,TH-*,S-*,SH-*,CH-* }
QS "L_LENIS" { B-*,D-*,DD-*,G-*,V-*,DH-*,Z-*,SH-*,JH-* }
QS "L_UNFORTLENIS" { M-*,N-*,EN-*,NG-*,HH-*,L-*,EL-*,R-*,Y-*,W-* }
QS "L_CORONAL" { T-*,TD-*,D-*,DD-*,N-*,EN-*,TH-*,DH-*,S-*,Z-*,SH-*,CH-*,JH-*,L-*,EL-*,R-* }
QS "L_NONCORONAL" { P-*,PD-*,B-*,M-*,K-*,KD-*,G-*,NG-*,F-*,V-*,HH-*,Y-*,W-* }
QS "L_ANTERIOR" { P-*,PD-*,B-*,M-*,T-*,TD-*,D-*,DD-*,N-*,EN-*,F-*,V-*,TH-*,DH-*,S-*,Z-*,L-*,EL-*,W-* }
QS "L_NONANTERIOR" { K-*,KD-*,G-*,NG-*,SH-*,HH-*,CH-*,JH-*,R-*,Y-* }
QS "L_CONTINUENT" { M-*,N-*,EN-*,NG-*,F-*,V-*,TH-*,DH-*,S-*,Z-*,SH-*,HH-*,L-*,EL-*,R-*,Y-*,W-* }
QS "L_NONCONTINUENT" { P-*,PD-*,B-*,T-*,TD-*,D-*,DD-*,K-*,KD-*,G-*,CH-*,JH-* }
QS "L_STRIDENT" { S-*,Z-*,SH-*,CH-*,JH-* }
QS "L_NONSTRIDENT" { F-*,V-*,TH-*,DH-*,HH-* }
QS "L_UNSTRIDENT" { P-*,PD-*,B-*,M-*,T-*,TD-*,D-*,DD-*,N-*,EN-*,K-*,KD-*,G-*,NG-*,L-*,EL-*,R-*,Y-*,W-* }
QS "L_GLIDE" { HH-*,L-*,EL-*,R-*,Y-*,W-* }
QS "L_SYLLABIC" { EN-*,M-*,L-*,EL-*,ER-* }
QS "L_UNVOICED-CONS" { P-*,PD-*,T-*,TD-*,K-*,KD-*,S-*,SH-*,F-*,TH-*,HH-*,CH-* }
QS "L_VOICED-CONS" { JH-*,B-*,D-*,DD-*,DH-*,G-*,Y-*,L-*,EL-*,M-*,N-*,EN-*,NG-*,R-*,V-*,W-*,Z-*}
QS "L_UNVOICED-ALL" { P-*,PD-*,T-*,TD-*,K-*,KD-*,S-*,SH-*,F-*,TH-*,HH-*,CH-*,sil-* }
QS "L_LONG" { IY-*,AA-*,OW-*,AO-*,UW-*,EN-*,M-*,L-*,EL-* }
QS "L_SHORT" { EH-*,EY-*,AA-*,IH-*,AY-*,OY-*,AH-*,AX-*,UH-* }
QS "L_DIPTHONG" { EY-*,AY-*,OY-*,AA-*,ER-*,EN-*,M-*,L-*,EL-* }
QS "L_FRONT-START" { EY-*,AA-*,ER-* }
QS "L_FRONTING" { AY-*,EY-*,OY-* }
QS "L_HIGH" { IH-*,UW-*,AA-*,AX-*,IY-* }
QS "L_MEDIUM" { EY-*,ER-*,AA-*,AX-*,EH-*,EN-*,M-*,L-*,EL-* }
QS "L_LOW" { EH-*,AY-*,AA-*,AW-*,AO-*,OY-* }
QS "L_ROUNDED" { AO-*,UW-*,AA-*,AX-*,OY-*,W-* }
QS "L_UNROUNDED" { EH-*,IH-*,AA-*,ER-*,AY-*,EY-*,IY-*,AW-*,AH-*,AX-*,EN-*,M-*,HH-*,L-*,EL-*,R-*,Y-* }
QS "L_NONAFFRICATE" { S-*,SH-*,Z-*,F-*,V*,TH-*,DH-* }
QS "L_AFFRICATE" { CH-*,JH-* }
QS "L_IVOWEL" { IH-*,IY-* }
QS "L_EVOWEL" { EH-*,EY-* }
QS "L_AVOWEL" { EH-*,AA-*,ER-*,AY-*,AW-* }
QS "L_OVOWEL" { AO-*,OY-*,AA-* }
QS "L_UVOWEL" { AA-*,AX-*,EN-*,M-*,L-*,EL-*,UW-* }
QS "L_VOICED-STOP" { B-*,D-*,DD-*,G-* }
QS "L_UNVOICED-STOP" { P-*,PD-*,T-*,TD-*,K-*,KD-* }
QS "L_FRONT-STOP" { P-*,PD-*,B-* }
QS "L_CENTRAL-STOP" { T-*,TD-*,D-*,DD-* }
QS "L_BACK-STOP" { K-*,KD-*,G-* }
QS "L_VOICED-FRIC" { Z-*,SH-*,DH-*,CH-*,V-* }
QS "L_UNVOICED-FRIC" { S-*,SH-*,TH-*,F-*,CH-* }
QS "L_FRONT-FRIC" { F-*,V-* }
QS "L_CENTRAL-FRIC" { S-*,Z-*,TH-*,DH-* }
QS "L_BACK-FRIC" { SH-*,CH-*,JH-* }
QS "L_AA" { AA-* }
QS "L_AE" { AE-* }
QS "L_AH" { AH-* }
QS "L_AO" { AO-* }
QS "L_AW" { AW-* }
QS "L_AX" { AX-* }
QS "L_AY" { AY-* }
QS "L_B" { B-* }
QS "L_CH" { CH-* }
QS "L_D" { D-* }
QS "L_DD" { DD-* }
QS "L_DH" { DH-* }
QS "L_DX" { DX-* }
QS "L_EH" { EH-* }
QS "L_EL" { EL-* }
QS "L_EN" { EN-* }
QS "L_ER" { ER-* }
QS "L_EY" { EY-* }
QS "L_F" { F-* }
QS "L_G" { G-* }
QS "L_HH" { HH-* }
QS "L_IH" { IH-* }
QS "L_IY" { IY-* }
QS "L_JH" { JH-* }
QS "L_K" { K-* }
QS "L_KD" { KD-* }
QS "L_L" { L-* }
QS "L_M" { M-* }
QS "L_N" { N-* }
QS "L_NG" { NG-* }
QS "L_OW" { OW-* }
QS "L_OY" { OY-* }
QS "L_P" { P-* }
QS "L_PD" { PD-* }
QS "L_R" { R-* }
QS "L_S" { S-* }
QS "L_SH" { SH-* }
QS "L_T" { T-* }
QS "L_TD" { TD-* }
QS "L_TH" { TH-* }
QS "L_TS" { TS-* }
QS "L_UH" { UH-* }
QS "L_UW" { UW-* }
QS "L_V" { V-* }
QS "L_W" { W-* }
QS "L_Y" { Y-* }
QS "L_Z" { Z-* }
TR 2
TB 350.0 "ST_AA_2_" {("AA","*-AA+*","AA+*","*-AA").state[2]}
TB 350.0 "ST_AE_2_" {("AE","*-AE+*","AE+*","*-AE").state[2]}
TB 350.0 "ST_AH_2_" {("AH","*-AH+*","AH+*","*-AH").state[2]}
TB 350.0 "ST_AO_2_" {("AO","*-AO+*","AO+*","*-AO").state[2]}
TB 350.0 "ST_AW_2_" {("AW","*-AW+*","AW+*","*-AW").state[2]}
TB 350.0 "ST_AY_2_" {("AY","*-AY+*","AY+*","*-AY").state[2]}
TB 350.0 "ST_B_2_" {("B","*-B+*","B+*","*-B").state[2]}
TB 350.0 "ST_CH_2_" {("CH","*-CH+*","CH+*","*-CH").state[2]}
TB 350.0 "ST_D_2_" {("D","*-D+*","D+*","*-D").state[2]}
TB 350.0 "ST_EH_2_" {("EH","*-EH+*","EH+*","*-EH").state[2]}
TB 350.0 "ST_ER_2_" {("ER","*-ER+*","ER+*","*-ER").state[2]}
TB 350.0 "ST_EY_2_" {("EY","*-EY+*","EY+*","*-EY").state[2]}
TB 350.0 "ST_F_2_" {("F","*-F+*","F+*","*-F").state[2]}
TB 350.0 "ST_G_2_" {("G","*-G+*","G+*","*-G").state[2]}
TB 350.0 "ST_HH_2_" {("HH","*-HH+*","HH+*","*-HH").state[2]}
TB 350.0 "ST_IH_2_" {("IH","*-IH+*","IH+*","*-IH").state[2]}
TB 350.0 "ST_IY_2_" {("IY","*-IY+*","IY+*","*-IY").state[2]}
TB 350.0 "ST_JH_2_" {("JH","*-JH+*","JH+*","*-JH").state[2]}
TB 350.0 "ST_K_2_" {("K","*-K+*","K+*","*-K").state[2]}
TB 350.0 "ST_L_2_" {("L","*-L+*","L+*","*-L").state[2]}
TB 350.0 "ST_M_2_" {("M","*-M+*","M+*","*-M").state[2]}
TB 350.0 "ST_N_2_" {("N","*-N+*","N+*","*-N").state[2]}
TB 350.0 "ST_OW_2_" {("OW","*-OW+*","OW+*","*-OW").state[2]}
TB 350.0 "ST_P_2_" {("P","*-P+*","P+*","*-P").state[2]}
TB 350.0 "ST_R_2_" {("R","*-R+*","R+*","*-R").state[2]}
TB 350.0 "ST_S_2_" {("S","*-S+*","S+*","*-S").state[2]}
TB 350.0 "ST_sil_2_" {("sil","*-sil+*","sil+*","*-sil").state[2]}
TB 350.0 "ST_sp_2_" {("sp","*-sp+*","sp+*","*-sp").state[2]}
TB 350.0 "ST_T_2_" {("T","*-T+*","T+*","*-T").state[2]}
TB 350.0 "ST_TH_2_" {("TH","*-TH+*","TH+*","*-TH").state[2]}
TB 350.0 "ST_UW_2_" {("UW","*-UW+*","UW+*","*-UW").state[2]}
TB 350.0 "ST_V_2_" {("V","*-V+*","V+*","*-V").state[2]}
TB 350.0 "ST_W_2_" {("W","*-W+*","W+*","*-W").state[2]}
TB 350.0 "ST_Y_2_" {("Y","*-Y+*","Y+*","*-Y").state[2]}
TB 350.0 "ST_Z_2_" {("Z","*-Z+*","Z+*","*-Z").state[2]}
TB 350.0 "ST_AA_3_" {("AA","*-AA+*","AA+*","*-AA").state[3]}
TB 350.0 "ST_AE_3_" {("AE","*-AE+*","AE+*","*-AE").state[3]}
TB 350.0 "ST_AH_3_" {("AH","*-AH+*","AH+*","*-AH").state[3]}
TB 350.0 "ST_AO_3_" {("AO","*-AO+*","AO+*","*-AO").state[3]}
TB 350.0 "ST_AW_3_" {("AW","*-AW+*","AW+*","*-AW").state[3]}
TB 350.0 "ST_AY_3_" {("AY","*-AY+*","AY+*","*-AY").state[3]}
TB 350.0 "ST_B_3_" {("B","*-B+*","B+*","*-B").state[3]}
TB 350.0 "ST_CH_3_" {("CH","*-CH+*","CH+*","*-CH").state[3]}
TB 350.0 "ST_D_3_" {("D","*-D+*","D+*","*-D").state[3]}
TB 350.0 "ST_EH_3_" {("EH","*-EH+*","EH+*","*-EH").state[3]}
TB 350.0 "ST_ER_3_" {("ER","*-ER+*","ER+*","*-ER").state[3]}
TB 350.0 "ST_EY_3_" {("EY","*-EY+*","EY+*","*-EY").state[3]}
TB 350.0 "ST_F_3_" {("F","*-F+*","F+*","*-F").state[3]}
TB 350.0 "ST_G_3_" {("G","*-G+*","G+*","*-G").state[3]}
TB 350.0 "ST_HH_3_" {("HH","*-HH+*","HH+*","*-HH").state[3]}
TB 350.0 "ST_IH_3_" {("IH","*-IH+*","IH+*","*-IH").state[3]}
TB 350.0 "ST_IY_3_" {("IY","*-IY+*","IY+*","*-IY").state[3]}
TB 350.0 "ST_JH_3_" {("JH","*-JH+*","JH+*","*-JH").state[3]}
TB 350.0 "ST_K_3_" {("K","*-K+*","K+*","*-K").state[3]}
TB 350.0 "ST_L_3_" {("L","*-L+*","L+*","*-L").state[3]}
TB 350.0 "ST_M_3_" {("M","*-M+*","M+*","*-M").state[3]}
TB 350.0 "ST_N_3_" {("N","*-N+*","N+*","*-N").state[3]}
TB 350.0 "ST_OW_3_" {("OW","*-OW+*","OW+*","*-OW").state[3]}
TB 350.0 "ST_P_3_" {("P","*-P+*","P+*","*-P").state[3]}
TB 350.0 "ST_R_3_" {("R","*-R+*","R+*","*-R").state[3]}
TB 350.0 "ST_S_3_" {("S","*-S+*","S+*","*-S").state[3]}
TB 350.0 "ST_sil_3_" {("sil","*-sil+*","sil+*","*-sil").state[3]}
TB 350.0 "ST_sp_3_" {("sp","*-sp+*","sp+*","*-sp").state[3]}
TB 350.0 "ST_T_3_" {("T","*-T+*","T+*","*-T").state[3]}
TB 350.0 "ST_TH_3_" {("TH","*-TH+*","TH+*","*-TH").state[3]}
TB 350.0 "ST_UW_3_" {("UW","*-UW+*","UW+*","*-UW").state[3]}
TB 350.0 "ST_V_3_" {("V","*-V+*","V+*","*-V").state[3]}
TB 350.0 "ST_W_3_" {("W","*-W+*","W+*","*-W").state[3]}
TB 350.0 "ST_Y_3_" {("Y","*-Y+*","Y+*","*-Y").state[3]}
TB 350.0 "ST_Z_3_" {("Z","*-Z+*","Z+*","*-Z").state[3]}
TB 350.0 "ST_AA_4_" {("AA","*-AA+*","AA+*","*-AA").state[4]}
TB 350.0 "ST_AE_4_" {("AE","*-AE+*","AE+*","*-AE").state[4]}
TB 350.0 "ST_AH_4_" {("AH","*-AH+*","AH+*","*-AH").state[4]}
TB 350.0 "ST_AO_4_" {("AO","*-AO+*","AO+*","*-AO").state[4]}
TB 350.0 "ST_AW_4_" {("AW","*-AW+*","AW+*","*-AW").state[4]}
TB 350.0 "ST_AY_4_" {("AY","*-AY+*","AY+*","*-AY").state[4]}
TB 350.0 "ST_B_4_" {("B","*-B+*","B+*","*-B").state[4]}
TB 350.0 "ST_CH_4_" {("CH","*-CH+*","CH+*","*-CH").state[4]}
TB 350.0 "ST_D_4_" {("D","*-D+*","D+*","*-D").state[4]}
TB 350.0 "ST_EH_4_" {("EH","*-EH+*","EH+*","*-EH").state[4]}
TB 350.0 "ST_ER_4_" {("ER","*-ER+*","ER+*","*-ER").state[4]}
TB 350.0 "ST_EY_4_" {("EY","*-EY+*","EY+*","*-EY").state[4]}
TB 350.0 "ST_F_4_" {("F","*-F+*","F+*","*-F").state[4]}
TB 350.0 "ST_G_4_" {("G","*-G+*","G+*","*-G").state[4]}
TB 350.0 "ST_HH_4_" {("HH","*-HH+*","HH+*","*-HH").state[4]}
TB 350.0 "ST_IH_4_" {("IH","*-IH+*","IH+*","*-IH").state[4]}
TB 350.0 "ST_IY_4_" {("IY","*-IY+*","IY+*","*-IY").state[4]}
TB 350.0 "ST_JH_4_" {("JH","*-JH+*","JH+*","*-JH").state[4]}
TB 350.0 "ST_K_4_" {("K","*-K+*","K+*","*-K").state[4]}
TB 350.0 "ST_L_4_" {("L","*-L+*","L+*","*-L").state[4]}
TB 350.0 "ST_M_4_" {("M","*-M+*","M+*","*-M").state[4]}
TB 350.0 "ST_N_4_" {("N","*-N+*","N+*","*-N").state[4]}
TB 350.0 "ST_OW_4_" {("OW","*-OW+*","OW+*","*-OW").state[4]}
TB 350.0 "ST_P_4_" {("P","*-P+*","P+*","*-P").state[4]}
TB 350.0 "ST_R_4_" {("R","*-R+*","R+*","*-R").state[4]}
TB 350.0 "ST_S_4_" {("S","*-S+*","S+*","*-S").state[4]}
TB 350.0 "ST_sil_4_" {("sil","*-sil+*","sil+*","*-sil").state[4]}
TB 350.0 "ST_sp_4_" {("sp","*-sp+*","sp+*","*-sp").state[4]}
TB 350.0 "ST_T_4_" {("T","*-T+*","T+*","*-T").state[4]}
TB 350.0 "ST_TH_4_" {("TH","*-TH+*","TH+*","*-TH").state[4]}
TB 350.0 "ST_UW_4_" {("UW","*-UW+*","UW+*","*-UW").state[4]}
TB 350.0 "ST_V_4_" {("V","*-V+*","V+*","*-V").state[4]}
TB 350.0 "ST_W_4_" {("W","*-W+*","W+*","*-W").state[4]}
TB 350.0 "ST_Y_4_" {("Y","*-Y+*","Y+*","*-Y").state[4]}
TB 350.0 "ST_Z_4_" {("Z","*-Z+*","Z+*","*-Z").state[4]}
TR 2
AU "fulllist"
CO "tiedlist"
ST "trees"
8. author global.ded script file
AS sp
RS cmu
MP sil sil sp
TC
9. Create fulllist of triphones. (This is referenced by tree.hed) Up until now, we have only considered the triphones that occur in the training data, however, test data may include ones not encountered yet.
HDMan.exe -b sp -n fulllist -g global.ded -l flog dict-tri dict
Append following manually to newly createdfulllist:
AY
EY
IY
OW
10. Create hmm13. This version will use the heuristics found intree.hed file to perform state tying of the triphone models.
HHEd.exe -B -H hmm12/macros -H hmm12/hmmdefs -M hmm13 tree.hed triphones1 > log
11. modify gauss_split.hed
MU 4 {*.state[2-4].mix}
12. Modifyhmm13using the new macros to split the Gaussians used in each state.
HHed.exe -H hmm13/macros -H hmm13/hmmdefs -M hmm13gauss_split.hed tiedlist
13. Retrainhmm13.
HERest.exe -C config -I wintri.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm13/macros -H hmm13/hmmdefs -M hmm13 tiedlist
HERest.exe -C config -I wintri.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm13/macros -H hmm13/hmmdefs -M hmm13tiedlist
14. modify gauss_split.hed
MU 8{*.state[2-4].mix}
15. Makehmm14using the new macros to split the Gaussians used in each state.
HHed.exe -H hmm13/macros -H hmm13/hmmdefs -M hmm14gauss_split.hed tiedlist
16. Retrainhmm14.
HERest.exe -C config -I wintri.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm14/macros -H hmm14/hmmdefs -M hmm14tiedlist
HERest.exe -C config -I wintri.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm14/macros -H hmm14/hmmdefs -M hmm14tiedlist
17. Create hmm15. Hmm15 will be the final model used for recognition. (final re-estimation with the tied models and 8 Gaussians)
HERest.exe -C config -I wintri.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm14/macros -H hmm14/hmmdefs -M hmm15 tiedlist
Running the HTK Word Recognizer
1. the same procedure of importing and coding the test data to MFCC must be done. Run the following commands to prepare the data.
mkdir an4_test_audio an4_test_mfcc
find /an4/wav/an4test_clstk -name "*.sph" | xargs -i cp {} an4_test_audio
find /htktut/an4_test_audio > pre_codetest.scp
./make_codetest.pl > codetest.scp
HCopy.exe -T 1 -C config_hcopy -S codetest.scp
2. Create test word MLF and phone-level MLF
cp /an4/etc/an4_test.transcription /htktut
./an4_test_mlf_maker.pl > testwords.mlf
HLEd.exe -l '*' -d dict -i testphones.mlf mkphones0.led testwords.mlf
3. Run the recognizer, HVite against hmm15. Recout.mlf will be a phone-level MLF containing the hypothesis of the recognizer
HVite.exe -H hmm15/macros -H hmm15/hmmdefs -S test.scp -l '*' -i recout.mlf -w wdnet -p 0.0 -s 5.0 dict tiedlist
4. View the result data statistics by comparing the resulting MLF to the transcription MLF.
HResults -I testwords.mlf tiedlist recout.mlf