当前位置:网站首页>Hero League | King | cross the line of fire BGM AI score competition sharing
Hero League | King | cross the line of fire BGM AI score competition sharing
2022-07-07 00:33:00 【weixin_ forty-two million one thousand and eighty-nine】
Preface
Recently, I played a multimodal game with my colleagues , Finally, I won the first place . Share with you , Here's an episode , It was always in the leading position during the preliminary round , As a result, he was suddenly overtaken on the last day , But in the defense of the time overturned , The final competition score is mainly through the technical thinking ( Occupy 30%)、 Theoretical depth ( Occupy 30%)、 Live performance ( Occupy 10%) And accuracy ( Occupy 30%, It is converted by the score of the preliminary contest ) Score calculation of four dimensions .
I don't say much nonsense , Start the theme
Competition questions
Simply put, it is to automatically match videos bgm, There are three kinds of games involved : Glory of Kings (HoK)、 Hero alliance (LOL) And crossing the line of fire (CF). The goal is to give a game video, Output its embedding And all candidates bgm District Library embedding.
The evaluation indicators are as follows :
Due to authorization issues , Data sets are not provided here , But I found a similar data on the Internet 【 Infringement can be deleted by contacting the author 】, You can feel it ( because csdn Can't upload , If you want to see it, you can see what the author knows :)
EDA
We have a thorough understanding of the data
(1) Video picture example :
It can be seen that the picture of the same game is actually quite similar
(2) Duration
Training set : Long tail distribution , Median : 67.6s, The average : 94.6s. Test set : The duration is 32s.
(3) There are separated anchor commentary audio data , but After separation BGM The noise is too loud .
(4) Lack of text data ( Such as the title , ASR etc. )
Method
The overall framework
In general, it is also a simple two tower model , The input end has text 、 Audio 、 Three modes of video .
Based on the preliminary investigation , We think the pictures of the video are very similar , Unless we can accurately extract representative frames such as transitions, random frame acquisition is not very helpful , And the anchor's commentary is very critical , tone ( Audio ) It can directly reflect excitement 、 Funny and so on , The second is the content of the commentary ( Text ) It's also crucial .
Note that this is not to say that video is not important , What we are talking about here is cost performance , It's easier to get higher profits in a certain period of time by focusing on the latter two , If we go to the depth of Technology , Video mode must be well explored .
To make a point , Here are just some useful strategies .
(1) Audio
First, feature extraction of audio , At present, people often use VGGish, I also found that most of them are basically VGGish, But we adopted HuBERT, It is based on bert An audio pre training model of thought training , The framework is as follows :
You can see more details :Light Sea:HuBERT: be based on BERT Self supervision of (self-supervised) Pronunciation means learning
(2) Data to enhance
The official data set is very few , How to make full use of data sets is also a consideration , Here we segment the video . With 20s Cut the video for intervals , Video and corresponding BGM Synchronous cutting , Data enhancement . We also carried out experiments on the cutting duration ,5s, 20s, 32s, among 20s The best effect .
The main points of : Training and prediction use the same length , Keep the strategy consistent , Otherwise, the effect will be very poor
(3)loss
The final request is embedding, It can be guessed that the probability is based on the European similarity to judge the result , So we use the method based on comparative learning Triplet Loss
Core code :
Final :
Harvest
See here actually trick That's it , Does it feel very simple , however ! The author believes that the above is not the most important , Next, the most important thing is : The thinking and experimental details in the whole process !
About video mode , Although at the beginning we expected that it would not have much profit , However, the experiment verifies that each video segment is extracted 10 Frame picture , Integrate with dubbing mode , The result is really no gain .
In addition, dubbing ASR, utilize BERT The extracted features , The effect is not improved , This is quite unexpected , It's not in line with expectations , The reason may be ASR The effect is not good , Here is a direct identification of an open source model , The text result also has no punctuation , Because of time, I didn't try again .
Add game category prediction as an auxiliary task , The effect is not improved , A possible explanation here is BGM It's universal .
Another episode is that it was far ahead of the list in the early stage , But on the last night, a classmate suddenly rose 14 name , To the second , Then the last hour came to the first , This once made us very curious about what it was trick? ha-ha , Finally, I learned that the key to the whole is to use two strong audio backbone, Although many players have designed some small trick spot , But it is eclipsed by the absolute strong model , This shows the importance of explanation on this task and the importance of baseline model selection .
In addition, there are some small points of fragmentary thoughts , Don't say , If you are interested, please contact the author
Explore the direction
Here we also give some possible optimization points in the future .
(1) Text mode : Can you provide some, such as the title , tag Wait for text mode and a good ASR Commentary text , This ceiling is still very high .
(2) How to better capture visual modal features ? For example, scene switching 、 Key frames of game moves, special effects, etc .
(3) Dubbing and BGM Audio does not share the same backbone, Our framework is currently shared , Theoretically speaking, explanation and bgm The distribution of is still different
(4) How to better interact with different modal features ?
If you are interested, you can communicate together ~
Focus on
Welcome to your attention , See you next time ~
Welcome to WeChat official account :
github:
Mryangkaitong · GitHubhttps://github.com/Mryangkaitong
You know :
边栏推荐
- File and image comparison tool kaleidoscope latest download
- 使用yum来安装PostgreSQL13.3数据库
- St table
- 37 page overall planning and construction plan for digital Village revitalization of smart agriculture
- pytest多进程/多线程执行测试用例
- TypeScript中使用类型别名
- Supersocket 1.6 creates a simple socket server with message length in the header
- rancher集成ldap,实现统一账号登录
- Data analysis course notes (III) array shape and calculation, numpy storage / reading data, indexing, slicing and splicing
- Idea automatically imports and deletes package settings
猜你喜欢
Lombok 同时使⽤ @Data 和 @Builder 的坑,你中招没?
Liuyongxin report | microbiome data analysis and science communication (7:30 p.m.)
Building lease management system based on SSM framework
The way of intelligent operation and maintenance application, bid farewell to the crisis of enterprise digital transformation
Interface master v3.9, API low code development tool, build your interface service platform immediately
How to set encoding in idea
2022 PMP project management examination agile knowledge points (9)
Imeta | Chen Chengjie / Xia Rui of South China Agricultural University released a simple method of constructing Circos map by tbtools
【vulnhub】presidential1
Clipboard management tool paste Chinese version
随机推荐
What is a responsive object? How to create a responsive object?
如何判断一个数组中的元素包含一个对象的所有属性值
TypeScript中使用类型别名
How rider uses nuget package offline
[CVPR 2022] semi supervised object detection: dense learning based semi supervised object detection
uniapp实现从本地上传头像并显示,同时将头像转化为base64格式存储在mysql数据库中
Pdf document signature Guide
GPIO簡介
Use source code compilation to install postgresql13.3 database
Personal digestion of DDD
【vulnhub】presidential1
37 page overall planning and construction plan for digital Village revitalization of smart agriculture
Tourism Management System Based on jsp+servlet+mysql framework [source code + database + report]
量子时代计算机怎么保证数据安全?美国公布四项备选加密算法
Geo data mining (III) enrichment analysis of go and KEGG using David database
【CVPR 2022】目标检测SOTA:DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
Leecode brush questions record sword finger offer 43 The number of occurrences of 1 in integers 1 to n
Rails 4 asset pipeline vendor asset images are not precompiled
DAY THREE
PostgreSQL highly available repmgr (1 master 2 slave +1witness) + pgpool II realizes master-slave switching + read-write separation