当前位置：网站首页>Hero League ｜ King ｜ cross the line of fire BGM AI score competition sharing

Hero League ｜ King ｜ cross the line of fire BGM AI score competition sharing

2022-07-07 00:33:00 【weixin_ forty-two million one thousand and eighty-nine】

Preface

Recently, I played a multimodal game with my colleagues , Finally, I won the first place . Share with you , Here's an episode , It was always in the leading position during the preliminary round , As a result, he was suddenly overtaken on the last day , But in the defense of the time overturned , The final competition score is mainly through the technical thinking （ Occupy 30%）、 Theoretical depth （ Occupy 30%）、 Live performance （ Occupy 10%） And accuracy （ Occupy 30%, It is converted by the score of the preliminary contest ） Score calculation of four dimensions .

I don't say much nonsense , Start the theme

Competition questions

Simply put, it is to automatically match videos bgm, There are three kinds of games involved ： Glory of Kings (HoK)、 Hero alliance (LOL) And crossing the line of fire (CF). The goal is to give a game video, Output its embedding And all candidates bgm District Library embedding.

The evaluation indicators are as follows ：

Due to authorization issues , Data sets are not provided here , But I found a similar data on the Internet 【 Infringement can be deleted by contacting the author 】, You can feel it ( because csdn Can't upload , If you want to see it, you can see what the author knows ：)

EDA

We have a thorough understanding of the data

（1） Video picture example :

It can be seen that the picture of the same game is actually quite similar

（2） Duration

Training set : Long tail distribution , Median : 67.6s, The average : 94.6s. Test set : The duration is 32s.

（3） There are separated anchor commentary audio data , but After separation BGM The noise is too loud .

（4） Lack of text data ( Such as the title , ASR etc. )

Method

The overall framework

In general, it is also a simple two tower model , The input end has text 、 Audio 、 Three modes of video .

Based on the preliminary investigation , We think the pictures of the video are very similar , Unless we can accurately extract representative frames such as transitions, random frame acquisition is not very helpful , And the anchor's commentary is very critical , tone （ Audio ） It can directly reflect excitement 、 Funny and so on , The second is the content of the commentary （ Text ） It's also crucial .

Note that this is not to say that video is not important , What we are talking about here is cost performance , It's easier to get higher profits in a certain period of time by focusing on the latter two , If we go to the depth of Technology , Video mode must be well explored .

To make a point , Here are just some useful strategies .

（1） Audio

First, feature extraction of audio , At present, people often use VGGish, I also found that most of them are basically VGGish, But we adopted HuBERT, It is based on bert An audio pre training model of thought training , The framework is as follows ：

You can see more details ：Light Sea：HuBERT： be based on BERT Self supervision of (self-supervised) Pronunciation means learning

（2） Data to enhance

The official data set is very few , How to make full use of data sets is also a consideration , Here we segment the video . With 20s Cut the video for intervals , Video and corresponding BGM Synchronous cutting , Data enhancement . We also carried out experiments on the cutting duration ,5s, 20s, 32s, among 20s The best effect .

The main points of : Training and prediction use the same length , Keep the strategy consistent , Otherwise, the effect will be very poor

（3）loss

The final request is embedding, It can be guessed that the probability is based on the European similarity to judge the result , So we use the method based on comparative learning Triplet Loss

Core code ：

Final ：

Harvest

See here actually trick That's it , Does it feel very simple , however ！ The author believes that the above is not the most important , Next, the most important thing is ： The thinking and experimental details in the whole process ！

About video mode , Although at the beginning we expected that it would not have much profit , However, the experiment verifies that each video segment is extracted 10 Frame picture , Integrate with dubbing mode , The result is really no gain .

In addition, dubbing ASR, utilize BERT The extracted features , The effect is not improved , This is quite unexpected , It's not in line with expectations , The reason may be ASR The effect is not good , Here is a direct identification of an open source model , The text result also has no punctuation , Because of time, I didn't try again .

Add game category prediction as an auxiliary task , The effect is not improved , A possible explanation here is BGM It's universal .

Another episode is that it was far ahead of the list in the early stage , But on the last night, a classmate suddenly rose 14 name , To the second , Then the last hour came to the first , This once made us very curious about what it was trick？ ha-ha , Finally, I learned that the key to the whole is to use two strong audio backbone, Although many players have designed some small trick spot , But it is eclipsed by the absolute strong model , This shows the importance of explanation on this task and the importance of baseline model selection .

In addition, there are some small points of fragmentary thoughts , Don't say , If you are interested, please contact the author

Explore the direction

Here we also give some possible optimization points in the future .

（1） Text mode ： Can you provide some, such as the title , tag Wait for text mode and a good ASR Commentary text , This ceiling is still very high .

（2） How to better capture visual modal features ? For example, scene switching 、 Key frames of game moves, special effects, etc .

（3） Dubbing and BGM Audio does not share the same backbone, Our framework is currently shared , Theoretically speaking, explanation and bgm The distribution of is still different

（4） How to better interact with different modal features ?

If you are interested, you can communicate together ～