当前位置：网站首页>Speech recognition learning summary

Speech recognition learning summary

2022-07-05 08:30:00 【... Manmu mountains and rivers】

Learning summary

After a semester of study , Have a superficial understanding of the direction of speech recognition , Now I am writing this blog to sort out what I have learned , The content may be messy , But write it to yourself , It will be updated continuously in the future , Improve your professional level .

Speech recognition process

1. Traditional speech recognition

First, receive the sound through the microphone , Because sound is a kind of wave , Propagation by vibration , Sound waves will cause the vibration of microphone elements , Produce amplitudes of different sizes , It will also produce different current values , This converts analog signals into digital signals , A one-dimensional sequence signal in time domain , The waveform is drawn in the coordinate axis , Then the computer processes the waveform , Filter out useless information , Extract useful information , And produce a text sequence . The auditory mechanism of human ear is to distinguish sound through the frequency domain of sound , The waveforms produced when the pronunciation is similar may also be very different , Therefore, it is difficult to find the pronunciation rules from the waveform , The required waveform is further processed , Transform the waveform in time domain into the waveform in frequency domain through Fourier transform , Then the frequency domain features are processed , Learn rules from them . Because sound is a short-term stable signal , So in processing , Divide the sound into small segments and deal with them , It's a frame , It can be considered that the state of the sound in this short segment is unchanged . Then recognize these frames into corresponding states , Then several states are combined into a phoneme , Then combine the phonemes into the pronunciation of words , For example, in Chinese speech recognition , Phonemes correspond to the initials and finals of a word , Then predict the corresponding text with the pronunciation of the word , Splice the recognized text into a sentence , It completes one sentence speech recognition .
Usually complete the above traditional speech recognition process , Three independent models are needed , Namely ：
1. Acoustic models , Recognize the frame as the corresponding state , Then three states are combined to form a phoneme
2. Articulation model , Combine phonemes into the pronunciation of the corresponding word
3. Language model , Predict the corresponding text according to the pronunciation of the word
These three models are trained independently , The training process is complicated , Therefore, it increases the entry difficulty of speech recognition .

2. End to end speech recognition

In recent years , Thanks to the development of neural network and the improvement of software and hardware technology , It has a large number of phonetic corpora , An end-to-end system . To simplify the network , Directly convert speech into text in a model , So this system is called end-to-end system . The general idea of end-to-end speech recognition , It uses a unified and optimized model to realize speech recognition , Simplify the training process of speech recognition , The input of the model is voice , The output is the corresponding text , The text here can be letters 、 Subwords or words . The main principles of end-to-end speech recognition include the use of CTC、RNN、Attention etc. .
The next task is , Read front end beamformer Code for , And how to prepare multi-channel data , Build a multi-channel speech recognition system baseline.
For the first time to use csdn Write an article ,markdown The user is not proficient , The typesetting is relatively simple , The content written is also relatively small , Continue to study and stick to csdn Write an article , Next time, I will write about the preparation process of multi-channel data .

原网站

版权声明
本文为[... Manmu mountains and rivers]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202140544005990.html