Speech recognition Algorithm:

X(t)  = Input speech signal.

This  input the speech signal  is in time domain we convert it into frequency domain because in time domain sound can be easily differentiate from one another.

Actually when we convert it into the frequency domain the speech attributes become visible in the frequency domain which is not visible in the time domain.

when we convert it into a frequency domain, not all the frequency is help full.

MFCC:

start  initially take STFT

signal

X(t) ===short time Fourier transform (STFT)

Basically, we perform a Fourier transform. In common but it is a very lengthy process.

X[n] =     sum x(t) . e-jwn

Fast Fourier transform is nothing but it is an algorithm for implementing Fourier transform. Basically, it is a method of implementing the algorithm .these are for implementaion.we use fast Fourier transform because it is an efficient algorithm

 

Lets us say that we have a sound.

Sound ====> She has gone

If we take Fourier transform of this sound.signal Fourier transform will give us frequencies overall in this sound signal. This Fourier transform gives us overall frequencies contained in this speech signal.In speech recognition not all of these frequencies are helpful because it is not our aim to find out how much frequencies are contained in this speech signal

We are only interested that what frequencies stands with

                                     she has gone

                                     fff   fff    ffff

                                   Required frequencies

Secondly, we need time information with these frequencies. Fourier transform provides us combined frequencies of all the sound with no time information. what frequency at what time.

Frequency along with the time information is very helpful for understanding that this sound is spoken at this time.etc

 

STFT:

Short time means that instead of taking the Fourier transform of the whole sound signal.we just pick a signal of 10msec duration from the above sound signal.

                           S            H             E

                           |

                         10ms

and take Fourier transform of it so we get frequencies in that region

                            S             H

                               10ms

Then 10ms for H and take Fourier to transform so we get frequencies in     H    E    region

This process is called framing

Framing has some advantages and also some disadvantages.

The advantages are that we get some time information. But the disadvantages is that when we take this sharp 10ms then signal edges which are called leakage or cut signal.

      

diagram

 

so it gives us additional frequencies which are called wrong frequencies.sharp edges are due to wrong frequencies.

To eliminate these wrong frequencies or sharp edges or discontinuities at the input signal and at the end Windowing is performed

In windowing  rather than 10ms. we will take 25ms duration and take the Fourier transform of these signals.And then associate these frequencies in the middle. that is found from the Fourier transform.

Window duration is usually taken double than the frame duration. The basic purpose of windowing is to smooth the sharp edges in the 25ms. Due to the high duration, these sharp edges become smooth.

There is also a limit for window length.if we increase much more then double then the other characteristics will also count in it. Usually, take double the windowing then the frame duration.

   Windowing duration  =  2 Frame duration

STFT:

The same Fourier transform but only it is short-time Fourier transform.

why we take a short duration.

As we know that speech is a continuously varying signal .if we do not consider time then we will be loss continuous variation in the signal.

The frequency we obtained is in hertz (Hz) the frequency we obtained is also in Hz.

     diagram

 

As we know the human ear does not respond on the Hz scale. It gives a response in mel scale.

Human ear working:

Let suppose initially frequencies from  0 to 500 Hz  .the change in sound means change in its loudness is L.

500 Hz === L

Secondly.

If we increase 500Hz then the change in the sound loudness will not be equal to L. The L time change for the second half will come above 500Hz.

let say

700 Hz

500 Hz ==L

700 Hz ==L

Similarly, the third change in loudness will come at 1000Hz,

500 Hz == L

700 Hz == L

1000Hz == L

Let suppose we take this Fourier in Hertz (Hz)

 

DDDDAIgram

 

the change in loudness from 0 Hz to 500 Hz.

Frequency (Increase) =  1/Sensitivity (decrease)

At high frequencies, the change in loudness is not proportion.so what we do to it change it to linear. We convert the Hz scale into mel scale. So what is mel scale conversion? In it, we just map that frequencies of 500Hz to f1 frequency 700Hz to f2 frequency and 1000Hz to f3 frequency.

Increase in the above Hz scale the loudness was not at equal distance.

But in the mel scale the mapping we had done in the form in which we normalized the increase in loudness in mel scale.

Here in mel scale, the increase in loudness is linearly proportional to f. so we can say that mel is linear scale while Hz in not a linear one.

How to convert from Hz to mel.

1:Short time Fourier transform.

2:Take the log of the spectrum which is in Hertz.

      Log(spectrum)  when we take the log of the spectrum then it is             known as a log spectrum.

3:Then in the second step we will take a Discrete cosine transform of the log(spectrum). DCT

then we will get MFCC.

The coefficient in MFCC is mel scale coefficients. Mel scale frequency coefficient is usually 13 as a standard.

The first coefficient are taken.

When we take a log and after log, we take DCT of spectrum then it is known as cepstrum. Here it contains frequency components but these components are in mel frequency coefficients.

HMM:

we picked 13 features from MFCC. It is assumed that these 13 features are the most discriminative features..take the example of our eyes. hand etc. It is our feature but they are not discriminative mean they are not helpful. As we have more feature but we are taking only 13 features which are considered to be the most discriminative feature.

We pick 13 feature and we trained HMM on this. Basically in HMM for Phonemes models are made.

let for example:

for a the thirteen feature model is designed.From model we mean that for a the thirteen features has the following values like  1,2,3,4,6---------13 as example

For B 2,4,8,9,11,13

1,2,3,4--------13

these are called model components.

How many phonemes we have that many models will be made and in that each model we will have 13 speech components.

In this way, each phoneme will have a model.

when a new test data has arrived in HMM.Hmm extract 13 features from that test data and compared it with these models. After comparison it gets finds whether it is A, B, C, or D vice versa.

Now how it compares:

In reality, we designed this model for one person and took 13 features of that person.

If the second person says the same thing just say a.It will not be exactly the same thing. let suppose it will come.

0.1      1 2 3 4 5------13

0.2      2 4 6 8 10 10 11

0.4

 

The mel scale named by stevens Woltman and Newman in 1937 is a perceptual scale of pitches judged by listeners to be equal in distance from one another.

popular formula:

To convert Hz into m mel.

m=2595 log 10 (1+f/700)

let 2o Hz

m=2595 log 10(1+20/700)

m=2595 log 10(720/700)

 


Post a Comment

Auto Gate HMI Animation VB.NET | Automatic Gate Open Close Application S...

Code: Public Class Form1     Private g As Graphics     'Private xPosition As Integer     Private yposition As Integer    Private Sub For...

 
Top