This project uses the principles of dynamic time warping (DTW) to
develop a speaker dependent word recognition system. The words
required to be recognized are the spoken digits zero through nine.
Part A of this project deals with the recording of words that are
used for the remainder of the project. The next part involves
the processing of the recorded words and an implementation of the
recognition algorithm. The third part deals with optional
enhancements that were made to the algorithm developed in part B.
Commentary on the results of this project can be found in the final
section of this page. All algorithms were implemented using
the MATLAB software package. When necessary, links to the
source code are provided in each part.
Part A - Test Material
In this part of the project the database of
words to be used was created. Each digit was recorded five
times at a sampling rate of 8kHz with a resolution of 16-bits.
Four of the five recordings of each digit were randomly selected and
used for the reference database. The fifth (remaining)
recording of each digit was used as the test utterance.
End point detection was assumed to be done manually during this step
while recording the files. The zip file of the test material
used for the project can be downloaded here.
Part B - System Development
Before DTW is applied to the words they
must be processed. All audio is cut up into frames of length
512 samples with 50% overlap and windowed using a 512 point Hamming
window. A 1024 (512 non-redundant) point FFT of each of these
frames is then taken laying the groundwork for the frame to be
divided into sub-bands. In this case the sub-bands correspond
to the 17 critical bands of human hearing between 0 and 4000Hz.
Sub-bands with such a division are often referred to as Bark bands.
The following equation was used to convert from frequency to Bark:
z(i) =
13*atan(0.76*(f/1000)) + 3.5*atan(((f/1000)/7.5).^2)
The results of this equation do not
correspond exactly with the table of critical bands that was provided
in the project description but the discrepancies introduced are
negligible. All of the aforementioned work was used to make a
function named energy_per_frame.
The source code can be downloaded by clicking the name of the
function. The input of this function is a wav file and the
output is a matrix of width 17 (corresponding to the number of
sub-bands) that varies in length. The length depends on how
many frames are contained in the inputted audio file.
Now that the audio is adequately processed a DTW implementation must
be developed to compare two given wav files. To handle this, a
dtw function was coded based on the examples in the class notes.
The input to the function is two matrices, each of which correspond
to the outputs of the previously established energy_per_frame
function. One of the two inputs is the reference sound and the
other is the test sound. In this section this function returns
just the final distance. The code can be downloaded
here.
With the functions described in the prior two paragraphs a word
recognition system can now be implemented. For testing
purposes all 10 files were tested against the 40 reference files in
one pass of the program. The output of this program is a 10x40
matrix with each column representing the computed distances for the
given test file against all of the reference files. This
matrix is shown below. The blue numbers across the top are the
test sounds and the blue numbers along the side are the reference
sounds. The four smallest distances in each column are
highlighted in red providing a good indicator of the accuracy of the
system.
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
0 |
331.00 |
1044.50 |
472.42 |
500.41 |
691.78 |
983.29 |
545.09 |
810.46 |
586.80 |
870.42 |
0 |
750.40 |
1581.50 |
1029.40 |
1221.20 |
1551.90 |
1731.00 |
1512.00 |
1565.60 |
1280.30 |
1584.30 |
0 |
405.50 |
1101.30 |
587.71 |
528.68 |
788.73 |
1013.40 |
767.86 |
889.42 |
579.53 |
908.39 |
0 |
534.39 |
1375.30 |
1104.40 |
978.04 |
1253.70 |
1316.10 |
1301.20 |
1176.50 |
933.05 |
1194.90 |
1 |
1165.30 |
491.52 |
1104.00 |
732.35 |
878.93 |
709.24 |
806.94 |
691.96 |
627.86 |
480.55 |
1 |
1695.30 |
569.95 |
1724.60 |
1402.00 |
1770.40 |
1655.80 |
1663.10 |
1412.90 |
1196.60 |
1161.20 |
1 |
1126.10 |
524.47 |
1165.90 |
787.36 |
1040.40 |
995.75 |
1026.60 |
839.74 |
797.96 |
687.19 |
1 |
1069.70 |
586.71 |
1216.90 |
830.63 |
1085.30 |
963.87 |
1045.20 |
777.87 |
716.00 |
682.59 |
2 |
607.36 |
1233.30 |
190.72 |
500.67 |
888.03 |
1276.30 |
677.79 |
1048.90 |
837.33 |
1138.90 |
2 |
815.64 |
1192.50 |
391.92 |
428.22 |
1161.60 |
1485.50 |
878.04 |
1212.30 |
1006.70 |
1234.50 |
2 |
662.51 |
1150.40 |
457.30 |
644.40 |
1200.90 |
1283.30 |
711.91 |
1143.70 |
896.27 |
1062.40 |
2 |
718.13 |
1437.30 |
578.23 |
930.22 |
1252.90 |
1527.00 |
937.45 |
1383.60 |
1179.60 |
1416.40 |
3 |
775.68 |
975.66 |
613.79 |
287.53 |
549.45 |
614.97 |
310.37 |
668.92 |
288.10 |
770.74 |
3 |
692.70 |
824.65 |
564.99 |
303.67 |
675.02 |
641.21 |
252.10 |
634.32 |
356.32 |
524.23 |
3 |
689.10 |
956.49 |
503.32 |
187.50 |
697.51 |
785.98 |
384.50 |
707.30 |
408.11 |
763.48 |
3 |
613.47 |
946.44 |
518.58 |
260.18 |
626.41 |
777.26 |
358.66 |
660.47 |
380.52 |
699.00 |
4 |
886.75 |
1388.00 |
1105.50 |
924.24 |
327.81 |
931.70 |
871.91 |
803.67 |
930.05 |
1210.80 |
4 |
526.35 |
1261.10 |
775.31 |
776.62 |
405.84 |
810.94 |
496.68 |
788.47 |
860.98 |
1052.20 |
4 |
697.98 |
1146.20 |
984.49 |
787.73 |
214.70 |
721.30 |
546.71 |
749.86 |
678.83 |
890.14 |
4 |
602.03 |
1179.20 |
632.06 |
637.15 |
345.71 |
744.33 |
470.88 |
745.70 |
811.50 |
967.18 |
5 |
1329.50 |
1305.70 |
1483.20 |
1128.90 |
918.46 |
249.58 |
545.84 |
613.76 |
779.64 |
888.44 |
5 |
1393.30 |
1392.80 |
1520.70 |
1151.10 |
889.67 |
291.01 |
738.77 |
612.31 |
966.05 |
964.31 |
5 |
1284.50 |
1283.80 |
1463.00 |
1090.30 |
852.22 |
227.46 |
505.55 |
558.67 |
708.76 |
803.88 |
5 |
1282.70 |
1344.20 |
1500.00 |
1111.00 |
935.40 |
292.39 |
486.72 |
523.29 |
815.72 |
857.54 |
6 |
815.77 |
1140.80 |
702.19 |
475.72 |
462.17 |
517.58 |
95.51 |
508.14 |
519.71 |
718.46 |
6 |
795.14 |
1237.80 |
701.56 |
463.09 |
526.22 |
511.59 |
68.66 |
520.91 |
535.73 |
694.41 |
6 |
615.29 |
1159.40 |
702.22 |
496.45 |
512.81 |
564.15 |
118.18 |
483.66 |
462.57 |
750.97 |
6 |
691.70 |
1252.00 |
616.80 |
579.73 |
548.71 |
560.55 |
101.71 |
562.14 |
584.95 |
722.65 |
7 |
1034.20 |
1084.00 |
1073.00 |
744.38 |
650.84 |
407.92 |
393.29 |
244.23 |
608.56 |
724.21 |
7 |
896.59 |
1048.90 |
1051.20 |
693.68 |
664.20 |
385.42 |
400.05 |
275.35 |
571.04 |
646.42 |
7 |
1103.60 |
1283.80 |
1206.40 |
838.82 |
715.90 |
400.79 |
394.69 |
260.14 |
784.08 |
831.90 |
7 |
989.36 |
1112.30 |
1130.80 |
749.70 |
715.82 |
425.95 |
368.79 |
291.56 |
644.26 |
719.82 |
8 |
824.02 |
808.84 |
649.96 |
393.56 |
1005.30 |
974.37 |
787.76 |
887.75 |
315.69 |
650.28 |
8 |
718.42 |
738.85 |
665.11 |
340.30 |
817.61 |
867.11 |
648.37 |
761.43 |
136.02 |
565.51 |
8 |
788.33 |
724.72 |
719.11 |
381.54 |
859.69 |
801.13 |
637.62 |
752.01 |
116.62 |
468.48 |
8 |
684.84 |
820.53 |
663.39 |
380.26 |
872.74 |
840.62 |
570.31 |
756.95 |
168.38 |
642.44 |
9 |
1004.60 |
788.38 |
1139.40 |
779.36 |
997.44 |
783.23 |
715.10 |
713.20 |
474.80 |
325.96 |
9 |
980.90 |
789.41 |
1063.90 |
706.48 |
936.13 |
714.73 |
714.50 |
693.62 |
448.86 |
316.27 |
9 |
1064.20 |
782.31 |
1192.50 |
856.38 |
1040.80 |
817.79 |
861.51 |
764.38 |
579.11 |
364.77 |
9 |
983.87 |
696.82 |
1218.50 |
884.70 |
1107.60 |
846.75 |
763.93 |
810.86 |
562.99 |
470.28 |
The results of the above table can be
summarized in a confusion matrix, shown below.
|
Confusion Matrix |
|
Actual |
Predicted |
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
0 |
3 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
2 |
1 |
0 |
3 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
3 |
0 |
0 |
0 |
4 |
0 |
0 |
0 |
0 |
0 |
0 |
4 |
0 |
0 |
0 |
0 |
4 |
0 |
0 |
0 |
0 |
0 |
5 |
0 |
0 |
0 |
0 |
0 |
4 |
0 |
0 |
0 |
0 |
6 |
0 |
0 |
0 |
0 |
0 |
0 |
4 |
0 |
0 |
0 |
7 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4 |
0 |
0 |
8 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4 |
0 |
9 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
3 |
Error = 10%
|
The implementation code for this part is
available here. A 'one at a
time' implementation was also developed. This program prompts
the user for the input (the sound to be recognized) and then
calculates the DTW against all of the 40 reference sounds. The
detected value is simply the number that corresponds to the shortest
distance of the forty. This code is available
here. Outputting
the number with the lowest distance always resulted in correct
detection, thus giving it a percentage error of 0.
Part C - Optional
Enhancements
1. The first of two optional enhancements
to be implemented is the use of a threshold in recognition.
Using this threshold will allow for the possibility of words outside
of the vocabulary to be rejected; it will also allow for the
rejection of a word if it is too similar to more than one word in
the database. The threshold of 495 was selected to be the
slightly higher than highest low value in the results table shown
above. It is obvious just from the data in the table that
there will be plenty of recognition errors with such a high
threshold. The recognition error for known words is actually
90% (only zero is recognized correctly) with this high of a
threshold. Two 'outside' words (ten and eleven) were tested on
this system as well. Ten was detected as eight and eleven was
detected as not being similar enough to any in the database.
The latter is correct, the former is obviously not correct.
This gives makes the false acceptance rate 50%. The code for
this part can be found
here. If
the threshold was lowered better results would be attained with the
sacrifice of one never being able to be detected because it is the
reason that such a high threshold is needed. Using a threshold
in the manner just described is not a very good idea. This
sets the stage for the additional enhancement used in the next part.
2. To complete this part another function was written and the
original DTW function from above was modified. The
modification was made to not only return the final distance but to
return the distance matrix that is used to obtain the final
distance. The distance matrix is needed to calculate the
optimal path which is of great importance in this section.
The new function written was the most complex of the entire project.
It takes as an input a value of 0 through 9 and outputs a single
sequence based on the four reference files for the inputted number.
The first step in doing this is to calculate the shortest file for
the inputted number and this becomes the target length for the other
three files so that they can be averaged to create the final
(single) sequence. The shortest file is then used as the
reference in the modified DTW function
while the other three are used as test files and the optimal path
for each is calculated based on the distance matrix outputted from
the DTW function. Horizontal parts of the optimal path
represent frames in the file being tested that can be averaged in
order to shorten the length. This averaging is always done in
pairs to combine two frames into one. After every average
operation the new length of the sequence is compared to that of the
reference (shortest) sequence. If the sequence is not short
enough the next horizontal part of the optimal path is detected and
another average is performed. This process is repeated until
the desired length is obtained. After all four of the
reference sequences are of the same length they are averaged and
then that single sequence is returned by the function. The
code for this function can be found here.
In the main implementation for this part of the project the
previously explained averaging function is called 10 times, once for
each of the possible word values. The results of this
operation are compared one at a time to each of the test files thus
generating a matrix similar to the one shown above, but of a smaller
size. This one is only 10x10 rather than 40x10 because all
four of the references are combined into one thanks to the function
explained earlier. This matrix is shown below with the lowest
value in each column highlighted in red.
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
0 |
384.35 |
1222.9 |
694.57 |
755.01 |
1147.3 |
1222.3 |
1098.2 |
1065.7 |
762.29 |
1082 |
1 |
1180 |
318.95 |
1232.7 |
890.97 |
1153.1 |
1026 |
1127.1 |
879.4 |
729.65 |
668.64 |
2 |
558.82 |
1164.2 |
244.38 |
511.83 |
1029.5 |
1342.7 |
757.86 |
1122.1 |
844.86 |
1161.2 |
3 |
679.3 |
871.14 |
529.42 |
211.63 |
593.06 |
645.77 |
268.64 |
632.15 |
271.21 |
665.69 |
4 |
647.87 |
1146.7 |
817.62 |
668.19 |
216.34 |
694.48 |
520.31 |
726.82 |
681.06 |
938.68 |
5 |
1270.4 |
1247.6 |
1473.7 |
1098 |
866.45 |
200.82 |
577.06 |
604.2 |
718.91 |
803.34 |
6 |
768.96 |
1111.5 |
814.72 |
504.43 |
534.38 |
491.14 |
100.71 |
539.46 |
366.1 |
668.6 |
7 |
990.13 |
1062.7 |
1141.6 |
738.33 |
665.43 |
367.62 |
380.03 |
295.52 |
575.84 |
683.91 |
8 |
765.96 |
755.43 |
666.21 |
326.37 |
863.19 |
860.1 |
646.6 |
783.66 |
144.36 |
566.99 |
9 |
930.37 |
709.04 |
1076.2 |
729.73 |
982.3 |
751.7 |
656.08 |
685.1 |
423.64 |
249.34 |
There is a 0% error for this
technique. It proves to be the more successful than the
previous methods. As the system becomes trained with more
variations of the same sound this system will most likely increase
in accuracy. The implementation code for this section is
available here.
Part D - Summary and Discussion
of Results
At first the implementation in Part
B was done by creating a large three-dimensional array of the frame
data for all of the numbers in the reference database. Storing
these values in such a way resulted in frames of zeros being
appended on numbers that don't have the same number of frames as the
longest file. The presence of these zeros slightly altered the
results of the DTW and made Part C more difficult--this is the
reason for creating functions for everything. It takes longer
to run but it is more accurate. When outputting the number
corresponding to the index of the shortest distance this system is
100% accurate. The only limitation of this process is that the
words being inputted must be in the known vocabulary because this
system doesn't have any sort of false detection algorithm
implemented yet. When analyzing the results of all four
reference values for a given input the system is 90% accurate.
Adding a threshold to reject words not in the system and to detect
when an inputted word is similar to more than one number in the
database proved to not be very accurate thus demonstrating the
limitations of DTW (See above for percentage values). DTW is
an effective method of word recognition when only a very limited
vocabulary is used.
Exploiting the principles of time warping
to combine all of the reference values for a given number into one
increases the accuracy of the system while using less computational
resources and less memory to store the reference database.
This is the best way to implement DTW. A threshold may be
added to this part and the results would be better than those of the
previous section but they would still not be 100% accurate.
Once again, this proves that DTW is best when the user will always
input from a finite and small set of known words which in this case
are the spoken digits from zero to nine.
|