当前位置:网站首页>Hero League | King | cross the line of fire BGM AI score competition sharing
Hero League | King | cross the line of fire BGM AI score competition sharing
2022-07-07 00:33:00 【weixin_ forty-two million one thousand and eighty-nine】
Preface
Recently, I played a multimodal game with my colleagues , Finally, I won the first place . Share with you , Here's an episode , It was always in the leading position during the preliminary round , As a result, he was suddenly overtaken on the last day , But in the defense of the time overturned , The final competition score is mainly through the technical thinking ( Occupy 30%)、 Theoretical depth ( Occupy 30%)、 Live performance ( Occupy 10%) And accuracy ( Occupy 30%, It is converted by the score of the preliminary contest ) Score calculation of four dimensions .
I don't say much nonsense , Start the theme
Competition questions
Simply put, it is to automatically match videos bgm, There are three kinds of games involved : Glory of Kings (HoK)、 Hero alliance (LOL) And crossing the line of fire (CF). The goal is to give a game video, Output its embedding And all candidates bgm District Library embedding.
The evaluation indicators are as follows :
Due to authorization issues , Data sets are not provided here , But I found a similar data on the Internet 【 Infringement can be deleted by contacting the author 】, You can feel it ( because csdn Can't upload , If you want to see it, you can see what the author knows :)
EDA
We have a thorough understanding of the data
(1) Video picture example :
It can be seen that the picture of the same game is actually quite similar
(2) Duration
Training set : Long tail distribution , Median : 67.6s, The average : 94.6s. Test set : The duration is 32s.
(3) There are separated anchor commentary audio data , but After separation BGM The noise is too loud .
(4) Lack of text data ( Such as the title , ASR etc. )
Method
The overall framework
In general, it is also a simple two tower model , The input end has text 、 Audio 、 Three modes of video .
Based on the preliminary investigation , We think the pictures of the video are very similar , Unless we can accurately extract representative frames such as transitions, random frame acquisition is not very helpful , And the anchor's commentary is very critical , tone ( Audio ) It can directly reflect excitement 、 Funny and so on , The second is the content of the commentary ( Text ) It's also crucial .
Note that this is not to say that video is not important , What we are talking about here is cost performance , It's easier to get higher profits in a certain period of time by focusing on the latter two , If we go to the depth of Technology , Video mode must be well explored .
To make a point , Here are just some useful strategies .
(1) Audio
First, feature extraction of audio , At present, people often use VGGish, I also found that most of them are basically VGGish, But we adopted HuBERT, It is based on bert An audio pre training model of thought training , The framework is as follows :
You can see more details :Light Sea:HuBERT: be based on BERT Self supervision of (self-supervised) Pronunciation means learning
(2) Data to enhance
The official data set is very few , How to make full use of data sets is also a consideration , Here we segment the video . With 20s Cut the video for intervals , Video and corresponding BGM Synchronous cutting , Data enhancement . We also carried out experiments on the cutting duration ,5s, 20s, 32s, among 20s The best effect .
The main points of : Training and prediction use the same length , Keep the strategy consistent , Otherwise, the effect will be very poor
(3)loss
The final request is embedding, It can be guessed that the probability is based on the European similarity to judge the result , So we use the method based on comparative learning Triplet Loss
Core code :
Final :
Harvest
See here actually trick That's it , Does it feel very simple , however ! The author believes that the above is not the most important , Next, the most important thing is : The thinking and experimental details in the whole process !
About video mode , Although at the beginning we expected that it would not have much profit , However, the experiment verifies that each video segment is extracted 10 Frame picture , Integrate with dubbing mode , The result is really no gain .
In addition, dubbing ASR, utilize BERT The extracted features , The effect is not improved , This is quite unexpected , It's not in line with expectations , The reason may be ASR The effect is not good , Here is a direct identification of an open source model , The text result also has no punctuation , Because of time, I didn't try again .
Add game category prediction as an auxiliary task , The effect is not improved , A possible explanation here is BGM It's universal .
Another episode is that it was far ahead of the list in the early stage , But on the last night, a classmate suddenly rose 14 name , To the second , Then the last hour came to the first , This once made us very curious about what it was trick? ha-ha , Finally, I learned that the key to the whole is to use two strong audio backbone, Although many players have designed some small trick spot , But it is eclipsed by the absolute strong model , This shows the importance of explanation on this task and the importance of baseline model selection .
In addition, there are some small points of fragmentary thoughts , Don't say , If you are interested, please contact the author
Explore the direction
Here we also give some possible optimization points in the future .
(1) Text mode : Can you provide some, such as the title , tag Wait for text mode and a good ASR Commentary text , This ceiling is still very high .
(2) How to better capture visual modal features ? For example, scene switching 、 Key frames of game moves, special effects, etc .
(3) Dubbing and BGM Audio does not share the same backbone, Our framework is currently shared , Theoretically speaking, explanation and bgm The distribution of is still different
(4) How to better interact with different modal features ?
If you are interested, you can communicate together ~
Focus on
Welcome to your attention , See you next time ~
Welcome to WeChat official account :
github:
Mryangkaitong · GitHubhttps://github.com/Mryangkaitong
You know :
边栏推荐
- "Latex" Introduction to latex mathematical formula "suggestions collection"
- Amazon MemoryDB for Redis 和 Amazon ElastiCache for Redis 的内存优化
- Leecode brush questions record sword finger offer 44 A digit in a sequence of numbers
- 华为mate8电池价格_华为mate8换电池后充电巨慢
- 基于GO语言实现的X.509证书
- Liuyongxin report | microbiome data analysis and science communication (7:30 p.m.)
- 509 certificat basé sur Go
- Use Yum or up2date to install the postgresql13.3 database
- @TableId can‘t more than one in Class: “com.example.CloseContactSearcher.entity.Activity“.
- 2022/2/11 summary
猜你喜欢
互动滑轨屏演示能为企业展厅带来什么
陀螺仪的工作原理
Building lease management system based on SSM framework
Clipboard management tool paste Chinese version
什么是响应式对象?响应式对象的创建过程?
2022/2/12 summary
The way of intelligent operation and maintenance application, bid farewell to the crisis of enterprise digital transformation
Win10 startup error, press F9 to enter how to repair?
What is AVL tree?
48 page digital government smart government all in one solution
随机推荐
How rider uses nuget package offline
509 certificat basé sur Go
37 pages Digital Village revitalization intelligent agriculture Comprehensive Planning and Construction Scheme
Typescript incremental compilation
专为决策树打造,新加坡国立大学&清华大学联合提出快速安全的联邦学习新系统
Interesting wine culture
三维扫描体数据的VTK体绘制程序设计
Cross-entrpy Method
智能运维应用之道,告别企业数字化转型危机
What is AVL tree?
量子时代计算机怎么保证数据安全?美国公布四项备选加密算法
Understand the misunderstanding of programmers: Chinese programmers in the eyes of Western programmers
[automated testing framework] what you need to know about unittest
AI超清修复出黄家驹眼里的光、LeCun大佬《深度学习》课程生还报告、绝美画作只需一行代码、AI最新论文 | ShowMeAI资讯日报 #07.06
How to use vector_ How to use vector pointer
工程师如何对待开源 --- 一个老工程师的肺腑之言
Leecode brushes questions and records interview questions 01.02 Determine whether it is character rearrangement for each other
A way of writing SQL, update when matching, or insert
What is a responsive object? How to create a responsive object?
基于SSM框架的文章管理系统