当前位置:网站首页>"One year after graduation, I won ACL best paper"
"One year after graduation, I won ACL best paper"
2022-07-06 16:55:00 【ByteDance Technology】
Natural language processing, which ended not long ago (NLP) Top academic conferences in the field ACL 2021 On , Bytes to beat AI Lab Researcher xujingjing finished her speech .
After sharing at the global summit , Xu Jingjing feels very happy :“ little does one think , Such a low-level research has attracted everyone's interest , Our hard work for several months still pays off .”
you 're right , the 「 Research on the bottom layer 」 Is to win this year ACL The best paper 《Vocabulary Learning via Optimal Transport for Neural Machine Translation》, From byte skipping AI Lab Xu Jingjing of 、 Zhou Hao 、 Gan Chun 、 Zheng Zaixiang 、 Five students of Li Lei are the authors of this study .
ACL It is the top meeting in the field of international natural language processing , Every summer ACL The conference will not only attract the attention of world-renowned scientific research institutions , It is also the focus of major global technology companies . This year's ACL share 3350 Papers submitted , Only one best paper will be selected , It is the highest award of the whole meeting .
Besides , In this session ACL On , Bytes to beat AI Lab A total of 11 Papers .
「 Thesaurus 」,NLP Bottom research
This research won the best paper , The main focus is 「 Thesaurus 」 Direction .
Thesaurus , It is a set of data that breaks the complete sentence , You can disassemble by word 、 Disassemble by letter 、 Split by syllable , Each disassembly method can have different meanings and understandings .
Just like in Chinese 「 attend class;class begins 」 It means the same thing ,「 On 」 and 「 course 」 Two words taken out alone have different meanings .
We are familiar with all kinds of NLP Application of direction , For example, machine translation 、 Text correction 、 Chat robots, etc , Can not be separated from the vocabulary of this foundation , Vocabulary is the basic data of machine learning , It is to realize all kinds of AI Functional nourishment .
so to speak , The vocabulary is NLP Applied in all directions 「 The foundation 」, Make a good vocabulary , Can improve a variety of different NLP The performance of the task .
In this paper , Bytes to beat AI Lab Through experiments, our classmates have obtained some relations between the size of vocabulary and the amount of vocabulary information and the training of machine learning model , These laws can further promote NLP The scientific research community solves 「 What is a good vocabulary 」 This problem .
On top of that , Researchers of ByteDance have also proposed a new vocabulary learning scheme 「VOLT」, In common English - German translation 、 Britain - French translation and multilingual automatic translation ,VOLT It can not only get better translation results than traditional methods , The volume of thesaurus data used is also greatly reduced .
In England, for example - German translation , This new method reduces the vocabulary data required by traditional methods 70%.
The bottom two behaviors
VOLT Compared with traditional methods, the vocabulary data volume is reduced
Besides , A series of ByteDance NLP Related research , It has been successfully applied in volcano translation and watermelon video 、 In the translation function of products such as flying book , Whether in the office communication of byte classmates or flybook customers , Or when users watch foreign language videos , These studies are constantly improving the user experience from the most basic dimension .
The first job after graduation becomes Best paper
A work of this research is Xu Jingjing 2020 He graduated from Peking University in 1987 , This research on vocabulary is also her first work after adding ByteDance to her proofreading .
Onboarding byte AI Lab after , Xu Jingjing found that the atmosphere here is highly self driven :“ In our group , Your research work is not Leader Give you a direct direction , Instead, find out the direction you are interested in and put it forward , If this direction is really important , Then you can devote yourself to it .”
Thesauri are all kinds of NLP The first step of research , Several research directions proposed by Xu Jingjing also include vocabulary direction :“ About the vocabulary , Previous studies have had a method , We just follow this method , No one has thoroughly studied whether the current method is the optimal solution .”
See that Xu Jingjing wants to do basic research on vocabulary ,Leader Make an impression : The company is doing 「 Volcano translation 」 Business , Machine translation is a hard technology , The better the translation technology , The more products can be recognized by customers , Basic components like thesaurus , The research can improve the effect of machine translation , It plays a great role in business .
In this way, Xu Jingjing found the intersection between her personal interests and the overall development direction of the company .
But how to find the best vocabulary , In fact, it is a difficult problem that no one has ever done . She first collected a lot of vocabulary data , Experiment repeatedly to explore the relationship between different vocabulary and specific training tasks , Preliminarily found the rules between different vocabulary and training tasks .
Found these laws , You can take the law to find the best vocabulary , It's like the prince holding Cinderella's crystal shoes , Looking for Cinderella herself all over the world .
But there are thousands of girls in the kingdom , There are countless word lists in the world , How on earth can we find the best vocabulary ? Xu Jingjing's research has stagnated .
The experiment is done day after day , The spiritual light of truth has never flashed . At a time when there is nothing to do , An internal sharing of the team inspired Xu Jingjing .
Jump in bytes AI Lab, Students from different backgrounds will regularly share what they are good at , Some students have excellent mathematical thinking 、 Some students have rich multilingual backgrounds 、 Some students are right NLP Profound theoretical research . In this internal sharing , A classmate who majored in statistics made a report , Talked about some mathematical theory knowledge related to machine learning . With the help of these theories , Xu Jingjing found that the previously discovered law can be written into an objective function , Introduced into Economics 「 Marginal benefits 」 The concept of , Find the best vocabulary through discrete optimization , lock 「 Cinderella 」 The area .
Xu Jingjing was very happy afterwards :“ If the classmate didn't make this report at that time , Maybe our research on vocabulary is over . Diversity of team backgrounds , It really helps everyone broaden their cognition , Inspire research .”
Although inspired by new inspiration , But in the vast sea of people looking for 「 Cinderella 」 The process is still full of hardships . For months , Xu Jingjing is repeatedly proposing solutions every day 、 Run through the experiment 、 Find that the method doesn't work 、 Communicate with other students and find an infinite cycle of inspiration .
“ Although you can ask others , But the students in the group are not the leaders of this work , They will help me model 、 Give some advice or solve other problems , But the core work must be thought by yourself 、 Think about the problem 、 Reflect on why the experiment failed . Most of the time, I am still communicating with myself , Must endure loneliness .”
The depressing time keeps repeating ,“ I used to be depressed for a long time , But scientific research is such a thing , You will spend a long time in a very depressed time , You can't figure it out , It's like being unable to solve a math problem , A lot of pain .”
Xu Jingjing even thought about giving up , Study other directions , But the students told her :“ The vocabulary is very valuable NLP fundamental research , This direction is very promising , And you have done so much work , Don't give up halfway !”
stay Leader Under the encouragement of , Xu Jingjing insisted for another month . Until one day , Xu Jingjing, who failed another experiment, ran to the canteen dejectedly , Look at the food in front of you , But my mind is full of experimental ideas and processes . All of a sudden , An inspiration came : Simplify the previous method , Is it OK ?
After dinner, she hurried back to the office , The experiment was redeployed according to the new inspiration . Experimental results show that , This research, which lasted for half a year, was successful .
But good experimental results are often only half the success , We also need to publish formal papers to introduce to the scientific research community . Xu Jingjing looked at the time , Machine learning can be ICLR 2021 The solicitation of papers is about to end , The only time left for her is 7 God .
Although the paper is written quickly , But the time is too short ,ICLR Not surprisingly, I rejected the manuscript . but ICLR The review also gave a lot of responsible opinions , She is suggested to add more explanations and experimental evidences .
The research of basic theory is often difficult and obscure , The research author team fell into the process of repeatedly revising the paper , They often put themselves 「 split 」 In two : A person as a researcher , Describe the content of this study ; One person as a reviewer , Try to understand what the paper says .
A big change 3 Months later, , Xu Jingjing submitted the paper to NLP Summit meeting ACL 2021. stay ACL In the rules of , There will be 3 Double blind review ( The author and the reviewer don't know each other ) Read the same paper at the same time , Full marks 5 branch , Most of the papers that can be selected will get 3~3.5 Points around ,4 It's quite rare to score more than , And two reviewers of this paper directly gave 5 branch , The score of the third evaluation is also close to the full score . Under such high scores , Bytes to beat AI Lab This paper of the team was recommended , Finally, this session of ACL Best Paper Award .
Tips for winning : Long time investment in low-level research
After graduation, I got the top job Best Paper, Xujingjing thinks this is closely related to the support of the team :“ Our team has a lot of background , Have good math 、 Have strong engineering ability 、 Yes NLP Strong background , Multiple backgrounds can inspire research ideas , And there are rich training resources , Can support you to do large-scale experiments .”
In addition to multiple backgrounds , What makes Xu Jingjing cherish more is the team 「 Immersive 」 Scientific research atmosphere :“ Can get it Best Paper, First of all, our direction is very important , We didn't choose the mainstream direction of improving on a single task , Instead, I chose a relatively small track , Although basic, there are not many people to study , However, the lack of basic research is a problem faced by the entire industry , Because it takes a long time to think deeply , Not so immediate . The atmosphere of our team is just very relaxed , It won't be push It requires you to achieve results in a short time , You can devote yourself to important things for a long time , Do long-term work .”
stay NLP In the field , If you are studying translation 、 Dialogue and other specific tasks , Make targeted optimization for specific scenes , It works better ; But basic components can be used in every different field . therefore , The improvement of basic research can promote each specific scene .
In Xu Jingjing's opinion , Whole NLP Industries need some innovative things , So that basic research and specific tasks can be improved and developed ,“ The significance of our paper is to make everyone rethink , There is more room for vocabulary .”
Links to winning papers :
https://arxiv.org/abs/2012.15671
GitHub Address :
https://github.com/Jingjing-NLP/VOLT
ByteDance more technical applications
Several papers were selected CVPR 2021, The best dry goods are here
边栏推荐
- [graduation project] QT from introduction to practice: realize imitation of QQ communication, which is also the last blog post in school.
- Submit several problem records of spark application (sparklauncher with cluster deploy mode)
- [unsolved] 7-15 shout mountain
- 7-6 sum of combinatorial numbers
- Shell_ 06_ Judgment and circulation
- How to generate six digit verification code
- 姚班智班齐上阵,竞赛高手聚一堂,这是什么神仙编程大赛?
- 亮相Google I/O,字节跳动是这样应用Flutter的
- Story of [Kun Jintong]: talk about Chinese character coding and common character sets
- ByteDance open source Gan model compression framework, saving up to 97.8% of computing power - iccv 2021
猜你喜欢
Fdog series (4): use the QT framework to imitate QQ to realize the login interface, interface chapter.
~86m rabbit practice
7-5 blessing arrived
~82 style of table
LeetCode 1560. The sector with the most passes on the circular track
Fdog series (VI): use QT to communicate between the client and the client through the server (less information, recommended Collection)
The QT program compiled on CentOS lacks a MySQL driven solution
Cmake Express
Soft music -js find the number of times that character appears in the string - Feng Hao's blog
我走過最迷的路,是字節跳動程序員的腦回路
随机推荐
图像处理一百题(1-10)
Basic principles of video compression coding and audio compression coding
Saw local status change event StatusChangeEvent [timestamp=1644048792587, current=DOWN, previous=UP]
我走过最迷的路,是字节跳动程序员的脑回路
LeetCode 1637. The widest vertical area between two points without any point
Shell_ 00_ First meeting shell
Shell_ 04_ Shell script
字节跳动春招攻略:学长学姐笔经面经,还有出题人「锦囊」
LeetCode 1636. Sort the array in ascending order by frequency
字节跳动多篇论文入选 CVPR 2021,精选干货都在这里了
Cmake Express
Codeforces Round #771 (Div. 2)
Chapter 5 namenode and secondarynamenode
ByteDance 2022 school recruitment R & D advance approval publicity meeting, students' top 10 issues
7-12 inventory code base
Soft music -js find the number of times that character appears in the string - Feng Hao's blog
Solr word segmentation analysis
Fdog series (I): think about it. It's better to write a chat software. Then start with the imitation QQ registration page.
字节跳动新程序员成长秘诀:那些闪闪发光的宝藏mentor们
Chapter 6 datanode