当前位置:网站首页>Speech recognition learning summary
Speech recognition learning summary
2022-07-05 08:30:00 【... Manmu mountains and rivers】
Learning summary
After a semester of study , Have a superficial understanding of the direction of speech recognition , Now I am writing this blog to sort out what I have learned , The content may be messy , But write it to yourself , It will be updated continuously in the future , Improve your professional level .
Speech recognition process
1. Traditional speech recognition
First, receive the sound through the microphone , Because sound is a kind of wave , Propagation by vibration , Sound waves will cause the vibration of microphone elements , Produce amplitudes of different sizes , It will also produce different current values , This converts analog signals into digital signals , A one-dimensional sequence signal in time domain , The waveform is drawn in the coordinate axis , Then the computer processes the waveform , Filter out useless information , Extract useful information , And produce a text sequence . The auditory mechanism of human ear is to distinguish sound through the frequency domain of sound , The waveforms produced when the pronunciation is similar may also be very different , Therefore, it is difficult to find the pronunciation rules from the waveform , The required waveform is further processed , Transform the waveform in time domain into the waveform in frequency domain through Fourier transform , Then the frequency domain features are processed , Learn rules from them . Because sound is a short-term stable signal , So in processing , Divide the sound into small segments and deal with them , It's a frame , It can be considered that the state of the sound in this short segment is unchanged . Then recognize these frames into corresponding states , Then several states are combined into a phoneme , Then combine the phonemes into the pronunciation of words , For example, in Chinese speech recognition , Phonemes correspond to the initials and finals of a word , Then predict the corresponding text with the pronunciation of the word , Splice the recognized text into a sentence , It completes one sentence speech recognition .
Usually complete the above traditional speech recognition process , Three independent models are needed , Namely :
1. Acoustic models , Recognize the frame as the corresponding state , Then three states are combined to form a phoneme
2. Articulation model , Combine phonemes into the pronunciation of the corresponding word
3. Language model , Predict the corresponding text according to the pronunciation of the word
These three models are trained independently , The training process is complicated , Therefore, it increases the entry difficulty of speech recognition .
2. End to end speech recognition
In recent years , Thanks to the development of neural network and the improvement of software and hardware technology , It has a large number of phonetic corpora , An end-to-end system . To simplify the network , Directly convert speech into text in a model , So this system is called end-to-end system . The general idea of end-to-end speech recognition , It uses a unified and optimized model to realize speech recognition , Simplify the training process of speech recognition , The input of the model is voice , The output is the corresponding text , The text here can be letters 、 Subwords or words . The main principles of end-to-end speech recognition include the use of CTC、RNN、Attention etc. .
The next task is , Read front end beamformer Code for , And how to prepare multi-channel data , Build a multi-channel speech recognition system baseline.
For the first time to use csdn Write an article ,markdown The user is not proficient , The typesetting is relatively simple , The content written is also relatively small , Continue to study and stick to csdn Write an article , Next time, I will write about the preparation process of multi-channel data .
边栏推荐
- 【NOI模拟赛】汁树(树形DP)
- Sql Server的存儲過程詳解
- Imx6ull bare metal development learning 2- use C language to light LED indicator
- MHA High available Cluster for MySQL
- Go dependency injection -- Google open source library wire
- Arduino+a4988 control stepper motor
- Shell script realizes the reading of serial port and the parsing of message
- [trio basic from introduction to mastery tutorial 20] trio calculates the arc center and radius through three points of spatial arc
- Semiconductor devices (III) FET
- 实例001:数字组合 有四个数字:1、2、3、4,能组成多少个互不相同且无重复数字的三位数?各是多少?
猜你喜欢
Management and use of DokuWiki (supplementary)
Example 001: the number combination has four numbers: 1, 2, 3, 4. How many three digits can be formed that are different from each other and have no duplicate numbers? How many are each?
[trio basic from introduction to mastery tutorial XIV] trio realizes unit axis multi-color code capture
实例004:这天第几天 输入某年某月某日,判断这一天是这一年的第几天?
Working principle and type selection of common mode inductor
Installation and use of libjpeg and ligpng
实例005:三数排序 输入三个整数x,y,z,请把这三个数由小到大输出。
Bluebridge cup internet of things basic graphic tutorial - GPIO input key control LD5 on and off
[tutorial 19 of trio basic from introduction to proficiency] detailed introduction of trio as a slave station connecting to the third-party bus (anybus PROFIBUS DP...)
Daily question - input a date and output the day of the year
随机推荐
PIP installation
Use indent to format code
On boost circuit
Why is 1900 not a leap year
[trio basic tutorial 18 from introduction to proficiency] trio motion controller UDP fast exchange data communication
DCDC circuit - function of bootstrap capacitor
Bluebridge cup internet of things competition basic graphic tutorial - clock selection
Classic application of MOS transistor circuit design (1) -iic bidirectional level shift
Cinq détails de conception du régulateur de tension linéaire
Arduino operation stm32
My-basic application 2: my-basic installation and operation
STM32---ADC
Stm32--- systick timer
Live555 push RTSP audio and video stream summary (III) flower screen problem caused by pushing H264 real-time stream
2020-05-21
Explication de la procédure stockée pour SQL Server
Imx6ull bare metal development learning 2- use C language to light LED indicator
MySQL MHA high availability cluster
Soem EtherCAT source code analysis I (data type definition)
STM32 single chip microcomputer -- volatile keyword