当前位置:网站首页>Big coffee face to face | Dr. Chen Guoguo talks about intelligent voice
Big coffee face to face | Dr. Chen Guoguo talks about intelligent voice
2022-06-24 03:10:00 【Dark Blue College】
Intelligent voice has been a hot topic in recent years , Commercial applications are also increasing , stay 10 month 10 Dark blue of No & In the face-to-face activities , We invited voice industry leaders Dr. chenguoguo , Share and discuss the current problems in the voice field .
Catalog 1. Introduction to guests 2. Live broadcast essence 3. Select netizens to ask questions
1. Introduction to guests
Chen Guoguo ,SEASALT.AI cofounder , Dr. Johns Hopkins University , Tsinghua University .
Learn from the most popular open source speech recognition tool Kaldi The father of Daniel Povey, And the American Center of excellence in human language technology (HLTCOE) And Johns Hopkins language and speech processing center (CLSP) Of Sanjeev Khudanpur professor , The main research direction is speech recognition and keyword retrieval .
Doctor's period is Google Developed Google The awakening words of Okay Google The prototype of the , It has been used on hundreds of millions of Android devices . During his PhD, he also participated in the development of open source speech recognition system Kaldi, And neural network open source tools CNTK.
After graduation , Co founded KITT.Al, Committed to the research and development of voice wake-up and natural voice interaction technology , The company has been CBInsights Selected as the first AI 100 company ,2017 After being acquired by Baidu in, he joined Duer Business Department of Baidu , As the chief architect .2020 Left Baidu in , Co founded Seasalt.Al as well as Vobil.com, Focus on enterprise services related to speech recognition and naturallanguageprocessing .2020 At the same time, a volunteer organization was launched SpeechColab, And publish GigaSpeech Data sets , Include 10000 Hour marked English speech recognition data , as well as 33000 Hour and a half supervision 、 Unsupervised English speech recognition data .
2. Live broadcast essence
1. At present, the voice field ( Including speech recognition , Wake up the ) The progress of the , And the difficulties encountered in the actual landing
The development of recognition and awakening is quite different . About the function of wakeup , The first time I was google When , Made a base on DNN Wake up engine , Then deploy it on Android phones . There was relatively little work on awakening at that time , There are also many challenges in the process of function implementation , For example, how to reduce the probability of false awakening . But after so many years of development , The function of wake-up is quite mature .
First, the effect is very good , This is reflected in the case of high-frequency wake-up , False wakeup can be guaranteed at a very low probability . The second is that the power consumption of hardware is decreasing , Our early work may need to be based on mobile phones or high-performance chips , Today, , Supported by batteries , Low power devices can maintain the normal operation of the wake-up function . therefore , From a personal point of view , I think the function of wake-up has developed very mature .
Speech recognition is similar , If you look at the development in the past ten years , You will find that the progress is very fast .
I am a 2010 He started his doctoral studies in , At that time, there were some products of companies such as Google and Microsoft , The effect of speech recognition is still very bad . In terms of job opportunities , stay 08 After the financial crisis , The job opportunities in this part are relatively “ cannot meet the needs of the people ” Of . But since 12 year Siri Appearance , There are more and more jobs , Many companies are investing more and more , The effect of speech recognition is also improved very quickly , So there is a view that : Speech recognition is a problem that has been solved , Because in many scenarios , Speech recognition has been able to achieve high accuracy .
But if you study it deeply , You will find , In fact, there are still many challenging tasks in speech recognition . First, from the perspective of effect , Face the noisy scene , Such as party, etc , The effect of speech recognition is still very bad , There are often some ironic judgments ; Second, from the perspective of computing resources , For example, how to protect users' privacy ? When applying some large-scale models , How to cut to apply to some small chips , And can guarantee very good effect ? How to realize data backflow and iteration ? These are all unsolved problems .
overall , Awakening is a problem that I think has been basically solved , And the effect optimization of speech recognition in complex scenes , In the low-power devices, such as the transplant , There are still many contents worth studying and discussing .
2. Intelligent voice is floor mounted on the embedded device , Compared with the server , What are the special considerations ?
I think data backflow is a headache . When the user's data is returned to the server , We can protect users' privacy while , Train the model iteratively , So as to make the effect of the network better and better .
And if it is deployed on a low-power chip , These data are difficult to send back to the server , This means that it is difficult for manufacturers to receive data for training . This is a big challenge . How to do this when data cannot be returned , Improve the effect of the model ? Personally, I think federal learning is a good direction , But what we are doing is not mature enough .
The second is power consumption , The resources on the device side are usually limited , Sometimes devices need to rely on batteries for power supply , Naturally, we prefer the product with low power consumption , So there's a lot of work to do . such as , We realize the wake-up function on the headset , You need to cut and compress the model , And instruction set optimization ; Another example is to use efficient assembly instructions to realize FFT And so on . I personally think , The deployment of speech recognition on embedded devices , Although there are still some problems such as non-uniform standards , But it is still a trend in the development of this technology .
3. For those scientific research colleagues who are engaged in the field of phonetics and students in school , What are the suggestions ?
The voice field is developing very fast , The knowledge updating iteration is also very fast . I think as a student , If we can make a usable speech recognition system , It is of great exercise value .
My advice to students is to get involved in more practical work , This is very helpful for job hunting and research , Don't limit yourself to magic changes to some datasets and some open source solutions . Because a lot of times , These things are hard to land .
For example Kaldi In this job , We did a lot of parallelization work , So as to make the system more practical and usable . At the same time, we also noticed some problems , The difference between companies and schools at present is that companies have a large amount of computing resources and data to use , The lack of school resources may make students unable to carry out research , therefore , We are also passing GigaSpeech Wait for work to solve this problem .
4. About speech recognition datasets GigaSpeech Introduction to
In fact, the earliest time in Baidu , We want to build a large open source Chinese voice open source data set , However, due to various reasons, it is not open . therefore , later , We just want to make a more general data set with our partners , It is also to ensure the universality of the data set , We chose English as the language of the data set .
Why do we want to do GigaSpeech This job ?
One reason is that “ effect ”, Speech recognition algorithm in libriSpeech The data set has been optimized very well , The recognition accuracy is also very high , So we hope to provide new datasets as training and testing options . The second reason is that in recent years, the industry tends to use large-scale data sets made by the company itself for training , The data set used by the academic community is smaller . The original intention of our work , It also wants to provide a large-scale and open-source data set for academia and industry , image GigaSpeech This has been optimized enough .
How is this data set implemented ?
The first step is to collect data , At first, we wanted to extract the voice and corresponding text from the podcast program , But there are not enough data sources for podcasts , So we got a lot of data from audio books , Another source is Youtube A variety of videos . Our requirement for these audio is to have manually generated text , And if the text is automatically generated by algorithm , We will also screen out .
The second step is to standardize the text , For example, adjust the case of letters , Removal of special characters , The transformation from numbers to texts .
The third step is to force alignment , This work is very important , Because a lot of audio and subtitles are not completely aligned . The way we adopted later was , Splice audio and text information separately , Then do the forced alignment , So as to mark the time of each word .
The fourth step is to break sentences , For example, if the mute exceeds a certain time , Or a sudden pause in the course of speaking , We will break sentences , meanwhile , For abnormal data such as long single sentence time or high noise , We will also choose to remove .
The fifth step is to verify , After forced alignment with a simple decoder , Many sentences make mistakes , For example, at the pause of modal particles in some sentences , And face some problems such as ”I mean“,”you know“ 's phrases , Transcribing may cause errors , therefore , We later applied a decoding diagram designed by ourselves . The advantage of this decoding graph is that when we perform forced alignment , Some predefined modal particles and garbage words can be allowed . When the final decoding effect is the same as the reference , We will choose to keep the statement .
The next task is to evaluate , We process the manually annotated test set , Analyze the classification results at the frame level , And adjust the parameters . In order to ensure that the last 10000 hours of available data , We need to put the word error rate (word error rate) Control in 4% about .
After the data is completed , We need to manually annotate the test set , All in all 40 Hour test set , It's still quite big . At the same time, in order to avoid libriSpeech Test sets for overlap , We don't have a test set that includes audio books . Besides , We are still maintaining a leaderboard , To show the best network .
And in the future , in the light of GigaSpeech This volunteer project , First , We plan to add more languages , The second is to open more data for evaluation , The purpose is to let everyone have more data to make a fair evaluation of the experimental results , The third and fourth points are that we hope to open some pre training and fine-tuning networks so that we can use them more conveniently . Besides , We also hope to share some useful decoders , And we are also maintaining a code base , It's called PySpeechColab, What has been achieved so far is GigaSpeech Downloading and installing datasets , Other functions are still under discussion .
3. Select netizens to ask questions
1. When the new graduates choose the voice direction employment company in the autumn recruitment process , What do you need to focus on ?
I'm not very experienced , Because I haven't actually found a job in qiuzhao . From some of my own feelings , I think teams and managers are important .
Whether the team culture matches your personal temperament , Whether the team will further invest in the direction you are interested in , All need to be considered . For companies that suddenly decide to make voice products and then go crazy to hire people , We should be careful , Because the content of voice may not be strongly related to the core products of these companies , So it is very possible to give up later .
in addition , I pay more attention to whether there are good managers in my department , Communicate with yourself , Whether the future planning is in harmony , This is also important .
2. Is it ready to carry out the work of Chinese voice data set ?
We are also doing this work . About data collation and annotation , In fact, our processing flow has been improved . But we are more concerned about the source of data , Whether the data set can include more and richer data sources , Such as telephone voice , It is the aspect we want to improve or the problem we want to solve . You are also welcome to put forward more opinions to jointly solve this problem .
3. The future direction of speech recognition , What are the possible commercialization prospects ?
As I understand it , Voice is more of a tool , The future development may take voice as a convenient and easy-to-use tool , This means that the threshold of speech recognition needs to be lower and lower , It is more and more convenient to use . At present, the main commercial application of voice is the customer service center (call center), Many companies are willing to pay for it .
And about intelligent voice , With speaker ,API This kind of product form , In addition, the profit-making methods also include the government's intelligent projects and some cloud services . Of course , There may also be more and richer business models in the future , This is also very difficult to expect .
边栏推荐
- 2022-2028 Global Industry Survey and trend analysis report on portable pressure monitors for wards
- US Treasury secretary says extortion software poses a threat to the economy, Google warns 2billion chrome users | global network security hotspot
- How to query trademark registration? Where should I check?
- [Tencent cloud] how can the MySQL database on the cloud fully back up the data base script?
- What does cloud desktop mean? What are the characteristics of cloud desktop?
- Ner's past, present and future Overview - past
- Grpc: based on cloud native environment, distinguish configuration files
- Permission maintenance topic: domain controller permission maintenance
- [summary of interview questions] zj6 redis
- UI automation based on Selenium
猜你喜欢

2022-2028 global cell-based seafood industry research and trend analysis report
![[51nod] 2102 or minus and](/img/68/0d966b0322ac1517dd2800234d386d.jpg)
[51nod] 2102 or minus and
![[51nod] 3395 n-bit gray code](/img/b5/2c072a11601de82cb92ade94672ecd.jpg)
[51nod] 3395 n-bit gray code
![[51nod] 2653 section XOR](/img/2d/cb4bf4e14939ce432cac6d35b6a41b.jpg)
[51nod] 2653 section XOR
![[51nod] 3216 Awards](/img/94/fdb32434d1343040d711c76568b281.jpg)
[51nod] 3216 Awards
![[51nod] 2106 an odd number times](/img/af/59b441420aa4f12fd50f5062a83fae.jpg)
[51nod] 2106 an odd number times

What is etcd and its application scenarios

2022-2028 global marine clutch industry research and trend analysis report

2022-2028 global aircraft front wheel steering system industry research and trend analysis report

2022-2028 global anti counterfeiting label industry research and trend analysis report
随机推荐
Shopee Clickhouse cold and hot data separation storage architecture and Practice
US Treasury secretary says extortion software poses a threat to the economy, Google warns 2billion chrome users | global network security hotspot
How do I check the trademark registration number? Where do I need to check?
Hunan data security governance Summit Forum was held, and Tencent built the best practice of government enterprise data security
Is your posture correct—— A detailed discussion on horizontal sub database and sub table
Why can't the fortress machine open the port? There is a problem with the use of the fortress machine port
Tstor onecos, focusing on a large number of object scenes
Sinclair radio stopped broadcasting many TV stations, suspected of being attacked by blackmail software
How to strengthen prison security measures? Technologies you can't imagine
2022-2028 global genome editing mutation detection kit industry survey and trend analysis report
Grpc: implement service end flow restriction
Dry goods how to build a data visualization project from scratch?
Grp: how to gracefully shutdown a process?
The server size of the cloud desktop. The cloud desktop faces the server configuration requirements
Vscode common shortcut keys, updating
11111dasfada and I grew the problem hot hot I hot vasser shares
New Google brain research: how does reinforcement learning learn to observe with sound?
2022-2028 Global Industry Survey and trend analysis report on portable pressure monitors for wards
2022-2028 global tungsten copper alloy industry research and trend analysis report
How to query trademark registration? Where should I check?