当前位置:网站首页>Industry insight | is speech recognition really beyond human ears?
Industry insight | is speech recognition really beyond human ears?
2022-07-28 03:09:00 【Magic Data】

In recent years , With the development of artificial intelligence technology , The performance of speech recognition has been significantly improved . Many companies claim , The accuracy of speech recognition technology has reached 98% above , Is speech recognition really more effective than human ears ?
Of course, this conclusion cannot be drawn . After all, the human brain is the most accurate instrument in the world . There is a saying on the Internet that is very good ,“ Leaving the test set to say accuracy is like playing hooligans ”. When in a quiet environment , The recognition accuracy is about 98%, But when in a noisy environment , The accuracy will drop rapidly .
When at a party , It is difficult for speech recognition machines to pick up the speech of the target speaker from the overlapping speech , More difficult to identify accurately , This is a classic problem in the field of speech recognition —— Cocktail party problem (Cooktail Party Problem). In the mix of various sounds , Hear the voice you want to pay attention to , It's human instinct . But for machines , This is it. “ explode ”, It must be through speech separation technology , First, separate the target speech , Then it can be identified .
Speech separation algorithm based on Neural Network
Speech separation is the solution in speech recognition “ cocktail lounge ” The first step of the problem . Add speech separation technology to the front end of speech recognition , Separating the voice of the target speaker from other interference can improve the robustness of the speech recognition system . Cocktail party problem refers to the collected audio signal except for the main speaker , There is also interference from other people's voices and noise . The goal of speech separation is to separate the main speaker's speech from these interferences .
At present, the mainstream speech separation algorithm is based on Neural Network , The main purpose of neural network is to learn an ideal binary masking (Ideal Binary Mask,IBM), To determine which time-frequency units of the target signal in the spectrum (Time-frequency Units) Take the lead . If an auditory signal is divided into two dimensions: time domain and frequency domain ( Time-frequency two-dimensional ) To said , We can put the hour 、 The two dimensions of frequency are expressed as a two-dimensional matrix , Each element in this matrix is called a time-frequency unit . If you don't need to divide the target signal so carefully , Just once in a while —— It belongs to the target sound source , Or background noise , Then the time-frequency unit can be quantized as 2 It's worth , such as 0 and 1, This is binary . such , From the perspective of ideal binary masking , This problem becomes a supervised learning (Supervised Learning) The problem of classification .
Speech separation algorithm based on multimodal fusion
In addition to the above pure speech, do speech separation , Solve the cocktail party problem , Recently, there are many articles to solve the cocktail party problem with multimodal methods . Google from YouTube I searched for 10 10000 high-quality lectures and speech videos to generate training samples , Adoption of the covenant 2000 Hours of video clip analysis , Train a neural network based on multi stream convolution (CNN) Model of , Segment the synthetic cocktail party segment into separate audio streams for each speaker in the video . In the experiments , The input is one or more vocal objects , A video that is simultaneously disturbed by other objects or a noisy background . The output is to decompose the audio track of the input video into pure audio tracks , And correspond to the corresponding speaker .
Whether multimodal or monomodal Speech Separation Algorithm , Can not be separated from the support of voice data , The cost of voice data acquisition for multiple speakers is high 、 Difficulty in marking . and Magic Data As the world's leading AI Data service provider , It can provide many high-quality data for Algorithm Engineers , Provide experimental machine tools for solving cocktail party problems .
Edward Colin Cherry Published in 1957 Year of On Human Communication The book says :“ up to now , No machine algorithm can solve ‘ cocktail lounge ’ problem .” I didn't expect that so far , This assertion is still not completely overturned .
边栏推荐
- [wechat applet development (VI)] draw the circular progress bar of the music player
- style=“width: ___“ VS width=“___“
- els 定时器
- Where do I go to open an account for stock speculation? Is it safe to open an account on my mobile phone
- JS 事件对象2 e.charcode字符码 e.keyCode键码 盒子上下左右移动
- 阿憨的故事
- 每日刷题巩固知识
- Is it you who are not suitable for learning programming?
- CSDN Top1 "how does a Virgo procedural ape" become a blogger with millions of fans through writing?
- Pytorch 相关-梯度回传
猜你喜欢

Data Lake: database data migration tool sqoop

Docker advanced -redis cluster configuration in docker container

Design and practice of unified security authentication for microservice architecture

@The function of valid (cascade verification) and the explanation of common constraint annotations

Games101 review: ray tracing

每日刷题巩固知识

Docker高级篇-Docker容器内Redis集群配置

R 笔记 MICE

Stop paging with offset and limit. The performance is too poor!

Interview experience: first tier cities move bricks and face software testing posts. 5000 is enough
随机推荐
【红队】ATT&CK - 文件隐藏
Using pytorch's tensorboard visual deep learning indicators | pytorch series (25)
超参数调整和实验-训练深度神经网络 | PyTorch系列(二十六)
Confusion matrix in CNN | pytorch series (XXIII)
[QNX hypervisor 2.2 user manual]9.10 pass
[wechat applet development (VI)] draw the circular progress bar of the music player
Center Based 3D object detection and tracking (centerpoint) paper notes
每日刷题巩固知识
嵌入式开发:提示和技巧——用C进行防御性编程的最佳实践
Qt官方示例:Fridge Magnets Example(冰箱贴)
分布式事务——Senta(一)
Niuke-top101-bm340
Promise object
exness:日本物价上涨收入下降,英镑/日元突破 165
Vscode debug displays multiple columns of data
clientY vs pageY
Data center construction (III): introduction to data center architecture
MySQL index learning
[ACNOI2022]总差一步
Actual case of ROS communication