当前位置:网站首页>Industry insight | is speech recognition really beyond human ears?
Industry insight | is speech recognition really beyond human ears?
2022-07-28 03:09:00 【Magic Data】

In recent years , With the development of artificial intelligence technology , The performance of speech recognition has been significantly improved . Many companies claim , The accuracy of speech recognition technology has reached 98% above , Is speech recognition really more effective than human ears ?
Of course, this conclusion cannot be drawn . After all, the human brain is the most accurate instrument in the world . There is a saying on the Internet that is very good ,“ Leaving the test set to say accuracy is like playing hooligans ”. When in a quiet environment , The recognition accuracy is about 98%, But when in a noisy environment , The accuracy will drop rapidly .
When at a party , It is difficult for speech recognition machines to pick up the speech of the target speaker from the overlapping speech , More difficult to identify accurately , This is a classic problem in the field of speech recognition —— Cocktail party problem (Cooktail Party Problem). In the mix of various sounds , Hear the voice you want to pay attention to , It's human instinct . But for machines , This is it. “ explode ”, It must be through speech separation technology , First, separate the target speech , Then it can be identified .
Speech separation algorithm based on Neural Network
Speech separation is the solution in speech recognition “ cocktail lounge ” The first step of the problem . Add speech separation technology to the front end of speech recognition , Separating the voice of the target speaker from other interference can improve the robustness of the speech recognition system . Cocktail party problem refers to the collected audio signal except for the main speaker , There is also interference from other people's voices and noise . The goal of speech separation is to separate the main speaker's speech from these interferences .
At present, the mainstream speech separation algorithm is based on Neural Network , The main purpose of neural network is to learn an ideal binary masking (Ideal Binary Mask,IBM), To determine which time-frequency units of the target signal in the spectrum (Time-frequency Units) Take the lead . If an auditory signal is divided into two dimensions: time domain and frequency domain ( Time-frequency two-dimensional ) To said , We can put the hour 、 The two dimensions of frequency are expressed as a two-dimensional matrix , Each element in this matrix is called a time-frequency unit . If you don't need to divide the target signal so carefully , Just once in a while —— It belongs to the target sound source , Or background noise , Then the time-frequency unit can be quantized as 2 It's worth , such as 0 and 1, This is binary . such , From the perspective of ideal binary masking , This problem becomes a supervised learning (Supervised Learning) The problem of classification .
Speech separation algorithm based on multimodal fusion
In addition to the above pure speech, do speech separation , Solve the cocktail party problem , Recently, there are many articles to solve the cocktail party problem with multimodal methods . Google from YouTube I searched for 10 10000 high-quality lectures and speech videos to generate training samples , Adoption of the covenant 2000 Hours of video clip analysis , Train a neural network based on multi stream convolution (CNN) Model of , Segment the synthetic cocktail party segment into separate audio streams for each speaker in the video . In the experiments , The input is one or more vocal objects , A video that is simultaneously disturbed by other objects or a noisy background . The output is to decompose the audio track of the input video into pure audio tracks , And correspond to the corresponding speaker .
Whether multimodal or monomodal Speech Separation Algorithm , Can not be separated from the support of voice data , The cost of voice data acquisition for multiple speakers is high 、 Difficulty in marking . and Magic Data As the world's leading AI Data service provider , It can provide many high-quality data for Algorithm Engineers , Provide experimental machine tools for solving cocktail party problems .
Edward Colin Cherry Published in 1957 Year of On Human Communication The book says :“ up to now , No machine algorithm can solve ‘ cocktail lounge ’ problem .” I didn't expect that so far , This assertion is still not completely overturned .
边栏推荐
- [wechat applet development (V)] the interface is intelligently configured according to the official version of the experience version of the development version
- Is the securities account given by qiniu safe? Can qiniu open an account and buy funds
- Data center construction (III): introduction to data center architecture
- 智能工业设计软件公司天洑C轮数亿元融资
- ROS的调试经验
- Design and practice of unified security authentication for microservice architecture
- Design of the multi live architecture in different places of the king glory mall
- Interview experience: first tier cities move bricks and face software testing posts. 5000 is enough
- trivy【1】工具扫描运用
- Is it safe to buy funds on Alipay? I want to make a fixed investment in the fund
猜你喜欢

蓝桥杯原题
![[signal denoising] signal denoising based on Kalman filter with matlab code](/img/9e/9e569c83dc3106570cf7571056867f.png)
[signal denoising] signal denoising based on Kalman filter with matlab code

Redis AOF log persistence

trivy【1】工具扫描运用

嵌入式开发:提示和技巧——用C进行防御性编程的最佳实践

ROS的调试经验

How do gateways and chirpstacks in lorawan communicate? UDP? GRPC? MQTT?

Why is it that when logging in, you clearly use the account information already in the database, but still display "user does not exist"?

Commissioning experience of ROS

Interview experience: first tier cities move bricks and face software testing posts. 5000 is enough
随机推荐
Pychart shortcut key for quickly modifying all the same names on the whole page
Ci/cd from hardware programming to software platform
Which of the four solutions of distributed session do you think is the best?
stm32F407-------FPU学习
What "posture" does JD cloud have to promote industrial digitalization to climb to a "new level"?
NPDP考生!7月31号考试要求在这里看!
数据中台夯实数据基础
[image defogging] image defogging based on dark channel and non-mean filtering with matlab code
Vscode debug displays multiple columns of data
Promise object
Redis群集
On the problem that sqli labs single quotation marks do not report errors
MySQL essay
[acnoi2022] one step short
Flutter God operation learning (full level introduction)
Interview experience: first tier cities move bricks and face software testing posts. 5000 is enough
Day 19 of leetcode
分布式事务——Senta(一)
为什么登录时,明明使用的是数据库里已经有的账号信息,但依旧显示“用户不存在”?
谈一谈百度 科大讯飞 云知声的语音合成功能