Isolated Word Recognition
EEN 540 - Computer Project III
                by Richard Juszkiewicz
 

This project uses the principles of dynamic time warping (DTW) to develop a speaker dependent word recognition system.  The words required to be recognized are the spoken digits zero through nine.  Part A of this project deals with the recording of words that are used for the remainder of the project.  The next part involves the processing of the recorded words and an implementation of the recognition algorithm.  The third part deals with optional enhancements that were made to the algorithm developed in part B.  Commentary on the results of this project can be found in the final section of this page.  All algorithms were implemented using the MATLAB software package.  When necessary, links to the source code are provided in each part. 

Part A - Test Material

In this part of the project the database of words to be used was created.  Each digit was recorded five times at a sampling rate of 8kHz with a resolution of 16-bits.  Four of the five recordings of each digit were randomly selected and used for the reference database.  The fifth (remaining) recording of each digit  was used as the test utterance.  End point detection was assumed to be done manually during this step while recording the files.  The zip file of the test material used for the project can be downloaded here

Part B - System Development

Before DTW is applied to the words they must be processed.  All audio is cut up into frames of length 512 samples with 50% overlap and windowed using a 512 point Hamming window.  A 1024 (512 non-redundant) point FFT of each of these frames is then taken laying the groundwork for the frame to be divided into sub-bands.  In this case the sub-bands correspond to the 17 critical bands of human hearing between 0 and 4000Hz.  Sub-bands with such a division are often referred to as Bark bands.  The following equation was used to convert from frequency to Bark:

z(i) = 13*atan(0.76*(f/1000)) + 3.5*atan(((f/1000)/7.5).^2)

The results of this equation do not correspond exactly with the table of critical bands that was provided in the project description but the discrepancies introduced are negligible.  All of the aforementioned work was used to make a function named energy_per_frame.  The source code can be downloaded by clicking the name of the function.  The input of this function is a wav file and the output is a matrix of width 17 (corresponding to the number of sub-bands) that varies in length.  The length depends on how many frames are contained in the inputted audio file.

Now that the audio is adequately processed a DTW implementation must be developed to compare two given wav files.  To handle this, a dtw function was coded based on the examples in the class notes.  The input to the function is two matrices, each of which correspond to the outputs of the previously established energy_per_frame function.  One of the two inputs is the reference sound and the other is the test sound.  In this section this function returns just the final distance.  The code can be downloaded here

With the functions described in the prior two paragraphs a word recognition system can now be implemented.  For testing purposes all 10 files were tested against the 40 reference files in one pass of the program.  The output of this program is a 10x40 matrix with each column representing the computed distances for the given test file against all of the reference files.  This matrix is shown below.  The blue numbers across the top are the test sounds and the blue numbers along the side are the reference sounds.  The four smallest distances in each column are highlighted in red providing a good indicator of the accuracy of the system. 

0 1 2 3 4 5 6 7 8 9
0 331.00 1044.50 472.42 500.41 691.78 983.29 545.09 810.46 586.80 870.42
0 750.40 1581.50 1029.40 1221.20 1551.90 1731.00 1512.00 1565.60 1280.30 1584.30
0 405.50 1101.30 587.71 528.68 788.73 1013.40 767.86 889.42 579.53 908.39
0 534.39 1375.30 1104.40 978.04 1253.70 1316.10 1301.20 1176.50 933.05 1194.90
1 1165.30 491.52 1104.00 732.35 878.93 709.24 806.94 691.96 627.86 480.55
1 1695.30 569.95 1724.60 1402.00 1770.40 1655.80 1663.10 1412.90 1196.60 1161.20
1 1126.10 524.47 1165.90 787.36 1040.40 995.75 1026.60 839.74 797.96 687.19
1 1069.70 586.71 1216.90 830.63 1085.30 963.87 1045.20 777.87 716.00 682.59
2 607.36 1233.30 190.72 500.67 888.03 1276.30 677.79 1048.90 837.33 1138.90
2 815.64 1192.50 391.92 428.22 1161.60 1485.50 878.04 1212.30 1006.70 1234.50
2 662.51 1150.40 457.30 644.40 1200.90 1283.30 711.91 1143.70 896.27 1062.40
2 718.13 1437.30 578.23 930.22 1252.90 1527.00 937.45 1383.60 1179.60 1416.40
3 775.68 975.66 613.79 287.53 549.45 614.97 310.37 668.92 288.10 770.74
3 692.70 824.65 564.99 303.67 675.02 641.21 252.10 634.32 356.32 524.23
3 689.10 956.49 503.32 187.50 697.51 785.98 384.50 707.30 408.11 763.48
3 613.47 946.44 518.58 260.18 626.41 777.26 358.66 660.47 380.52 699.00
4 886.75 1388.00 1105.50 924.24 327.81 931.70 871.91 803.67 930.05 1210.80
4 526.35 1261.10 775.31 776.62 405.84 810.94 496.68 788.47 860.98 1052.20
4 697.98 1146.20 984.49 787.73 214.70 721.30 546.71 749.86 678.83 890.14
4 602.03 1179.20 632.06 637.15 345.71 744.33 470.88 745.70 811.50 967.18
5 1329.50 1305.70 1483.20 1128.90 918.46 249.58 545.84 613.76 779.64 888.44
5 1393.30 1392.80 1520.70 1151.10 889.67 291.01 738.77 612.31 966.05 964.31
5 1284.50 1283.80 1463.00 1090.30 852.22 227.46 505.55 558.67 708.76 803.88
5 1282.70 1344.20 1500.00 1111.00 935.40 292.39 486.72 523.29 815.72 857.54
6 815.77 1140.80 702.19 475.72 462.17 517.58 95.51 508.14 519.71 718.46
6 795.14 1237.80 701.56 463.09 526.22 511.59 68.66 520.91 535.73 694.41
6 615.29 1159.40 702.22 496.45 512.81 564.15 118.18 483.66 462.57 750.97
6 691.70 1252.00 616.80 579.73 548.71 560.55 101.71 562.14 584.95 722.65
7 1034.20 1084.00 1073.00 744.38 650.84 407.92 393.29 244.23 608.56 724.21
7 896.59 1048.90 1051.20 693.68 664.20 385.42 400.05 275.35 571.04 646.42
7 1103.60 1283.80 1206.40 838.82 715.90 400.79 394.69 260.14 784.08 831.90
7 989.36 1112.30 1130.80 749.70 715.82 425.95 368.79 291.56 644.26 719.82
8 824.02 808.84 649.96 393.56 1005.30 974.37 787.76 887.75 315.69 650.28
8 718.42 738.85 665.11 340.30 817.61 867.11 648.37 761.43 136.02 565.51
8 788.33 724.72 719.11 381.54 859.69 801.13 637.62 752.01 116.62 468.48
8 684.84 820.53 663.39 380.26 872.74 840.62 570.31 756.95 168.38 642.44
9 1004.60 788.38 1139.40 779.36 997.44 783.23 715.10 713.20 474.80 325.96
9 980.90 789.41 1063.90 706.48 936.13 714.73 714.50 693.62 448.86 316.27
9 1064.20 782.31 1192.50 856.38 1040.80 817.79 861.51 764.38 579.11 364.77
9 983.87 696.82 1218.50 884.70 1107.60 846.75 763.93 810.86 562.99 470.28

The results of the above table can be summarized in a confusion matrix, shown below. 

  Confusion Matrix
Actual
Predicted   0 1 2 3 4 5 6 7 8 9
0 3 0 0 0 1 0 0 0 0 0
1 0 4 0 0 0 0 0 0 0 0
2 1 0 3 0 0 0 0 0 0 0
3 0 0 0 4 0 0 0 0 0 0
4 0 0 0 0 4 0 0 0 0 0
5 0 0 0 0 0 4 0 0 0 0
6 0 0 0 0 0 0 4 0 0 0
7 0 0 0 0 0 0 0 4 0 0
8 0 0 0 0 0 0 0 0 4 0
9 0 0 0 0 0 0 0 0 1 3
Error = 10%

The implementation code for this part is available here.  A 'one at a time' implementation was also developed.  This program prompts the user for the input (the sound to be recognized) and then calculates the DTW against all of the 40 reference sounds.  The detected value is simply the number that corresponds to the shortest distance of the forty.  This code is available here.  Outputting the number with the lowest distance always resulted in correct detection, thus giving it a percentage error of 0. 

Part C - Optional Enhancements

1. The first of two optional enhancements to be implemented is the use of a threshold in recognition.  Using this threshold will allow for the possibility of words outside of the vocabulary to be rejected; it will also allow for the rejection of a word if it is too similar to more than one word in the database.  The threshold of 495 was selected to be the slightly higher than highest low value in the results table shown above.  It is obvious just from the data in the table that there will be plenty of recognition errors with such a high threshold.  The recognition error for known words is actually 90% (only zero is recognized correctly) with this high of a threshold.  Two 'outside' words (ten and eleven) were tested on this system as well.  Ten was detected as eight and eleven was detected as not being similar enough to any in the database.  The latter is correct, the former is obviously not correct.  This gives makes the false acceptance rate 50%.  The code for this part can be found here.  If the threshold was lowered better results would be attained with the sacrifice of one never being able to be detected because it is the reason that such a high threshold is needed.  Using a threshold in the manner just described is not a very good idea.  This sets the stage for the additional enhancement used in the next part. 

2. To complete this part another function was written and the original DTW function from above was modified.  The modification was made to not only return the final distance but to return the distance matrix that is used to obtain the final distance.  The distance matrix is needed to calculate the optimal path which is of great importance in this section.

The new function written was the most complex of the entire project.  It takes as an input a value of 0 through 9 and outputs a single sequence based on the four reference files for the inputted number.  The first step in doing this is to calculate the shortest file for the inputted number and this becomes the target length for the other three files so that they can be averaged to create the final (single) sequence.  The shortest file is then used as the reference in the modified DTW function while the other three are used as test files and the optimal path for each is calculated based on the distance matrix outputted from the DTW function.  Horizontal parts of the optimal path represent frames in the file being tested that can be averaged in order to shorten the length.  This averaging is always done in pairs to combine two frames into one.  After every average operation the new length of the sequence is compared to that of the reference (shortest) sequence.  If the sequence is not short enough the next horizontal part of the optimal path is detected and another average is performed.  This process is repeated until the desired length is obtained.  After all four of the reference sequences are of the same length they are averaged and then that single sequence is returned by the function.  The code for this function can be found here

In the main implementation for this part of the project the previously explained averaging function is called 10 times, once for each of the possible word values.  The results of this operation are compared one at a time to each of the test files thus generating a matrix similar to the one shown above, but of a smaller size.  This one is only 10x10 rather than 40x10 because all four of the references are combined into one thanks to the function explained earlier.  This matrix is shown below with the lowest value in each column highlighted in red. 

0 1 2 3 4 5 6 7 8 9
0 384.35 1222.9 694.57 755.01 1147.3 1222.3 1098.2 1065.7 762.29 1082
1 1180 318.95 1232.7 890.97 1153.1 1026 1127.1 879.4 729.65 668.64
2 558.82 1164.2 244.38 511.83 1029.5 1342.7 757.86 1122.1 844.86 1161.2
3 679.3 871.14 529.42 211.63 593.06 645.77 268.64 632.15 271.21 665.69
4 647.87 1146.7 817.62 668.19 216.34 694.48 520.31 726.82 681.06 938.68
5 1270.4 1247.6 1473.7 1098 866.45 200.82 577.06 604.2 718.91 803.34
6 768.96 1111.5 814.72 504.43 534.38 491.14 100.71 539.46 366.1 668.6
7 990.13 1062.7 1141.6 738.33 665.43 367.62 380.03 295.52 575.84 683.91
8 765.96 755.43 666.21 326.37 863.19 860.1 646.6 783.66 144.36 566.99
9 930.37 709.04 1076.2 729.73 982.3 751.7 656.08 685.1 423.64 249.34

There is a 0% error for this technique.  It proves to be the more successful than the previous methods.  As the system becomes trained with more variations of the same sound this system will most likely increase in accuracy.  The implementation code for this section is available here

Part D - Summary and Discussion of Results

At first the implementation in Part B was done by creating a large three-dimensional array of the frame data for all of the numbers in the reference database.  Storing these values in such a way resulted in frames of zeros being appended on numbers that don't have the same number of frames as the longest file.  The presence of these zeros slightly altered the results of the DTW and made Part C more difficult--this is the reason for creating functions for everything.  It takes longer to run but it is more accurate.  When outputting the number corresponding to the index of the shortest distance this system is 100% accurate.  The only limitation of this process is that the words being inputted must be in the known vocabulary because this system doesn't have any sort of false detection algorithm implemented yet.  When analyzing the results of all four reference values for a given input the system is 90% accurate. 

Adding a threshold to reject words not in the system and to detect when an inputted word is similar to more than one number in the database proved to not be very accurate thus demonstrating the limitations of DTW (See above for percentage values).  DTW is an effective method of word recognition when only a very limited vocabulary is used. 

Exploiting the principles of time warping to combine all of the reference values for a given number into one increases the accuracy of the system while using less computational resources and less memory to store the reference database.  This is the best way to implement DTW.  A threshold may be added to this part and the results would be better than those of the previous section but they would still not be 100% accurate.  Once again, this proves that DTW is best when the user will always input from a finite and small set of known words which in this case are the spoken digits from zero to nine.