当前位置:网站首页>Speech recognition learning summary
Speech recognition learning summary
2022-07-05 08:30:00 【... Manmu mountains and rivers】
Learning summary
After a semester of study , Have a superficial understanding of the direction of speech recognition , Now I am writing this blog to sort out what I have learned , The content may be messy , But write it to yourself , It will be updated continuously in the future , Improve your professional level .
Speech recognition process
1. Traditional speech recognition
First, receive the sound through the microphone , Because sound is a kind of wave , Propagation by vibration , Sound waves will cause the vibration of microphone elements , Produce amplitudes of different sizes , It will also produce different current values , This converts analog signals into digital signals , A one-dimensional sequence signal in time domain , The waveform is drawn in the coordinate axis , Then the computer processes the waveform , Filter out useless information , Extract useful information , And produce a text sequence . The auditory mechanism of human ear is to distinguish sound through the frequency domain of sound , The waveforms produced when the pronunciation is similar may also be very different , Therefore, it is difficult to find the pronunciation rules from the waveform , The required waveform is further processed , Transform the waveform in time domain into the waveform in frequency domain through Fourier transform , Then the frequency domain features are processed , Learn rules from them . Because sound is a short-term stable signal , So in processing , Divide the sound into small segments and deal with them , It's a frame , It can be considered that the state of the sound in this short segment is unchanged . Then recognize these frames into corresponding states , Then several states are combined into a phoneme , Then combine the phonemes into the pronunciation of words , For example, in Chinese speech recognition , Phonemes correspond to the initials and finals of a word , Then predict the corresponding text with the pronunciation of the word , Splice the recognized text into a sentence , It completes one sentence speech recognition .
Usually complete the above traditional speech recognition process , Three independent models are needed , Namely :
1. Acoustic models , Recognize the frame as the corresponding state , Then three states are combined to form a phoneme
2. Articulation model , Combine phonemes into the pronunciation of the corresponding word
3. Language model , Predict the corresponding text according to the pronunciation of the word
These three models are trained independently , The training process is complicated , Therefore, it increases the entry difficulty of speech recognition .
2. End to end speech recognition
In recent years , Thanks to the development of neural network and the improvement of software and hardware technology , It has a large number of phonetic corpora , An end-to-end system . To simplify the network , Directly convert speech into text in a model , So this system is called end-to-end system . The general idea of end-to-end speech recognition , It uses a unified and optimized model to realize speech recognition , Simplify the training process of speech recognition , The input of the model is voice , The output is the corresponding text , The text here can be letters 、 Subwords or words . The main principles of end-to-end speech recognition include the use of CTC、RNN、Attention etc. .
The next task is , Read front end beamformer Code for , And how to prepare multi-channel data , Build a multi-channel speech recognition system baseline.
For the first time to use csdn Write an article ,markdown The user is not proficient , The typesetting is relatively simple , The content written is also relatively small , Continue to study and stick to csdn Write an article , Next time, I will write about the preparation process of multi-channel data .
边栏推荐
- OC and OD gate circuit
- Explain task scheduling based on Cortex-M3 in detail (Part 1)
- Bluetooth hc-05 pairing process and precautions
- Go dependency injection -- Google open source library wire
- STM32 --- serial port communication
- 【NOI模拟赛】汁树(树形DP)
- [NAS1](2021CVPR)AttentiveNAS: Improving Neural Architecture Search via Attentive Sampling (未完)
- Use indent to format code
- Meizu Bluetooth remote control temperature and humidity access homeassistant
- 【三层架构及JDBC总结】
猜你喜欢
FIO测试硬盘性能参数和实例详细总结(附源码)
Classic application of MOS transistor circuit design (1) -iic bidirectional level shift
实例008:九九乘法表
Example 010: time to show
Example 001: the number combination has four numbers: 1, 2, 3, 4. How many three digits can be formed that are different from each other and have no duplicate numbers? How many are each?
My-basic application 2: my-basic installation and operation
Explain task scheduling based on Cortex-M3 in detail (Part 2)
UE pixel stream, come to a "diet pill"!
实例002:“个税计算” 企业发放的奖金根据利润提成。利润(I)低于或等于10万元时,奖金可提10%;利润高于10万元,低于20万元时,低于10万元的部分按10%提成,高于10万元的部分,可提成7.
One question per day - replace spaces
随机推荐
One question per day - replace spaces
Explication de la procédure stockée pour SQL Server
Simple design description of MIC circuit of ECM mobile phone
List of linked lists
Example 006: Fibonacci series
STM32 --- configuration of external interrupt
Several important parameters of LDO circuit design and type selection
Bluebridge cup internet of things competition basic graphic tutorial - clock selection
【云原生 | 从零开始学Kubernetes】三、Kubernetes集群管理工具kubectl
[trio basic from introduction to mastery tutorial XIV] trio realizes unit axis multi-color code capture
Live555 push RTSP audio and video stream summary (I) cross compilation
Arduino operation stm32
Classic application of MOS transistor circuit design (1) -iic bidirectional level shift
Circleq of linked list
What are the test items of power battery ul2580
MySQL之MHA高可用集群
[three tier architecture]
Zero length array in GNU C
[noi simulation] juice tree (tree DP)
H264 (I) i/p/b frame gop/idr/ and other parameters