当前位置:网站首页>Paper reading - beat tracking by dynamic programming
Paper reading - beat tracking by dynamic programming
2022-06-13 02:13:00 【zjuPeco】
List of articles
1 summary
When splicing short videos with background music , If the splicing point of the two videos happens to be on a beat point of the background music , So the composite video looks , Sound , Very comfortable , This is a bonus item for short video synthesis , This kind of video is what we often call "stuck video" . The premise of making a video clip is to find a point in the background music that can be stuck ,beats It is one of the points that can be stuck , This article is to use the vernacular to talk about the thesis Beat Tracking by Dynamic Programming How to find it beats Of . Common audio signal processing library librosa Medium librosa.beat.beat_track That's how it works . The name of this task is beat tracking, This will be called in the following paragraphs .
2 General framework
The paper introduces beat tracking It can be divided into three steps :
(1) Calculation Onset Strength Envelope(Onset Energy envelope of )
(2) Calculate the global Tempo
(3) Based on dynamic programming beats
I saw these three steps for the first time , It's also confused , If it's someone who specializes in signal processing , You may already know what happened , But the target audience of this article is like me , A little friend who has learned or understood signal processing but has forgotten about it . therefore , These three steps will be analyzed in detail below .
If you just want to know how to use it ,librosa Everything has been packed for us , Three lines of code .
import librosa
y, sr = librosa.load(“your/music/file/path”)
tempo, beats = librosa.beat.beat_track(y=y, sr=sr)
3. Calculation Onset Strength Envelope
onset It refers to the starting point of a note , For example, the moment you press the piano key , Another example is the moment when the strings are plucked , The picture below shows onset The location of . In this part , Is to put all of the onset Find out the energy envelope of . The audio given may be a drum track , Piano tracks , The violin track , A mix of vocal tracks, etc . No matter which track , Will be found by us .
look for onset The method of energy envelope is not proposed in this paper , Using an existing common method , be called crude perceptual model. The steps are as follows :
(1) use 8kHz Read audio files at the sampling rate of .
(2) use 32ms Of window_size and 4ms Of step_size, Short time Fourier transform (STFT), Get the picture 2 The top picture in .
(3) Map the spectrum onto the Mel spectrum , The vertical axis is divided into 40 individual bands, Get the picture 2 Figure in the middle of .
(4) Along the timeline for each band First order difference , And set all negative values to 0, And then put all of them at each time point band The positive value after the difference is accumulated .
(5) Yes (4) High pass filtering is performed on the results obtained in , To filter out 0.4kHz The following information , Make its local zero mean , Reuse window_size by 20ms Gaussian window for smoothing , Get the picture 2 The bottom figure in , That is to say onset strength envelope, Write it down as O ( t ) O(t) O(t).
O ( t ) O(t) O(t) The local peaks in the are where the energy increases abruptly , Why does the energy change dramatically ? Of course it's because a new note sounds . Be careful , I don't think the crest here is onsets The exact time , It is onsets The time of the wave crest produced . But different articles seem to refer to this onsets The candidate for . But we don't care onsets Where exactly , As long as there is O ( t ) O(t) O(t) That's enough , So don't worry about the definition .
The other one is , I wonder why the author of the paper should use (5) Such normalization , That's what we got O ( t ) O(t) O(t) There will be many negative values , But we don't want these negative values .Tempo and Beat Tracking I think the normalization method in is more reasonable , Direct subtraction local average, That's all right. . The figure below 3 The top red line in is local average, The result obtained after subtraction is figure 3 The picture in the middle , chart 3 The icons below show local peaks and onsets.
4 Calculate the global Tempo
Tempo It refers to the beat of music , Usually use bpm(beats per minute) To measure , such as 120bpm It means one minute 120 individual beats, The period of the beat is 0.5s. In this paper, we use the period of the beat to express Tempo. A musical Tempo It may change over time , But it's rare , We only discuss the whole audio here tempo Are consistent . Changing Tempo Inspection can be found in predominant local pulse, The general idea is to segment , It's not discussed here .
The beat is a tune that keeps cycling , We have to find out the period of this cycle . Autocorrelation function is suitable for finding period , We calculate different delay times O ( t ) O(t) O(t) The value of the autocorrelation function of , The highest value corresponds to a delay time beat The length of .
But there is a problem , Pictured 4 Of raw autocorrelation Shown , The value of autocorrelation function of periodic function on the multiple of its base period is very large . To solve this problem , This paper introduces a weight coefficient , Make the periodic result tend to a certain empirical value . The formula for calculating the autocorrelation coefficient is
T P S ( τ ) = W ( τ ) ∑ t O ( t ) O ( t − τ ) TPS(\tau)=W(\tau)\sum_t O(t)O(t-\tau) TPS(τ)=W(τ)t∑O(t)O(t−τ)
among , τ \tau τ Is the delay time ,TPS by Tempo Period Strength Abbreviation , It is the name given to this autocorrelation calculation method in this paper , bring T P S ( τ ) TPS(\tau) TPS(τ) The biggest one τ \tau τ, That's the cycle we're looking for .
W ( τ ) W(\tau) W(τ) Is a Gaussian weight coefficient , Expressed as
W ( τ ) = e x p { − 1 2 ( l o g 2 τ / τ 0 σ τ ) 2 } W(\tau) = exp\{-\frac{1}{2}(\frac{log_2 \tau / \tau_0}{\sigma_\tau})^2\} W(τ)=exp{ −21(στlog2τ/τ0)2}
among , τ 0 \tau_0 τ0 Is the default biased cycle size , σ τ \sigma_\tau στ Is a coefficient indicating the degree of bias . τ 0 \tau_0 τ0 and σ τ \sigma_\tau στ All experience values , The paper starts from MIREX-06 Beat Tracking From the centralized statistics of training . The statistical method is to fill in different τ 0 \tau_0 τ0 and σ τ \sigma_\tau στ, bring T P S ( τ ) TPS(\tau) TPS(τ) The maximum value obtained from and marked in the data Tempo The group with the highest consistency τ 0 \tau_0 τ0 and σ τ \sigma_\tau στ Is it .
The final result is τ 0 \tau_0 τ0 by 0.5s, That is to say 120bpm, σ τ \sigma_\tau στ by 1.4. Under this group of parameters T P S ( τ ) TPS(\tau) TPS(τ) Pictured 4 As shown in the figure at the bottom of . In the picture Primary Tempo Period Is the final cycle .
librosa.beat.beat_track One of the input parameters in is start_bpm, Refers to τ 0 \tau_0 τ0, It can be modified manually , The default is 120.
In actual use , Would be right T P S ( τ ) TPS(\tau) TPS(τ) Do some optimization , become
T P S 2 ( τ ) = T P S ( τ ) + 0.5 T P S ( 2 τ ) + 0.25 T P S ( 2 τ − 1 ) + 0.25 T P S ( 2 τ + 1 ) TPS2(\tau) = TPS(\tau) + 0.5 TPS(2\tau) + 0.25 TPS(2\tau - 1) + 0.25 TPS(2\tau + 1) TPS2(τ)=TPS(τ)+0.5TPS(2τ)+0.25TPS(2τ−1)+0.25TPS(2τ+1)
or
T P S 3 ( τ ) = T P S ( τ ) + 0.33 T P S ( 3 τ ) + 0.33 T P S ( 3 τ − 1 ) + 0.33 T P S ( 3 τ + 1 ) TPS3(\tau) = TPS(\tau) + 0.33 TPS(3\tau) + 0.33 TPS(3\tau - 1) + 0.33 TPS(3\tau + 1) TPS3(τ)=TPS(τ)+0.33TPS(3τ)+0.33TPS(3τ−1)+0.33TPS(3τ+1)
Either way , T P S ( τ ) TPS(\tau) TPS(τ) Corresponding to the peak of τ \tau τ That's the cycle we're looking for .
5 Based on dynamic programming beats
The paper 4ms One step (250Hz) Divide the time into , Used the second 3 Section and section 4 The results of this section , The following objective function is designed :
C ( { t i } ) = ∑ i = 1 N O ( t ) + α ∑ i = 2 N F ( t i − t i − 1 , τ p ) C(\{t_i\}) = \sum_{i=1}^N O(t) + \alpha \sum_{i=2}^N F(t_i - t_{i-1}, \tau_p) C({ ti})=i=1∑NO(t)+αi=2∑NF(ti−ti−1,τp)
among , { t i } \{t_i\} { ti} Found for N N N individual beats; O ( t ) O(t) O(t) That is the first. 3 Section Onset Strenghth Envelope; α \alpha α Is the coefficient that balances the two objective terms ; τ p \tau_p τp That is the first. 4 Period obtained in section ; F ( △ t , τ p ) F(\triangle t, \tau_p) F(△t,τp) Is used to measure every two adjacent beats Spacing and τ p \tau_p τp The gap between , This can be defined by itself , In the paper, it is expressed as
F ( △ t , τ p ) = − ( l o g △ t τ p ) 2 F(\triangle t, \tau_p) = -(log \frac{\triangle t}{\tau_p})^2 F(△t,τp)=−(logτp△t)2
Visible when △ t \triangle t △t and τ p \tau_p τp The closer the , F ( △ t , τ p ) F(\triangle t, \tau_p) F(△t,τp) The bigger it is , The maximum is 0, Otherwise, the smaller . besides , The formula is log symmetrical , such as F ( k τ p , τ p ) = F ( τ p / k , τ p ) F(k\tau_p, \tau_p) = F(\tau_p / k, \tau_p) F(kτp,τp)=F(τp/k,τp).
Our goal is to make C ( { t i } ) C(\{t_i\}) C({ ti}) The bigger the better , Look at the , When α = 0 \alpha=0 α=0 when , Select all the time points , C ( { t i } ) C(\{t_i\}) C({ ti}) It's the biggest ; When α = + ∞ \alpha=+\infty α=+∞ when , Select a time interval of τ p \tau_p τp A set of points can make C ( { t i } ) C(\{t_i\}) C({ ti}) The biggest . It's not hard to see. , With α \alpha α Balance , The resulting column of points is spaced at τ p \tau_p τp Fine tuning left and right , And make the point fall O ( t ) O(t) O(t) A group of points near a local peak .
This paper uses the method of dynamic programming , With linear time complexity , That solved the problem . First defined C ∗ ( t ) C^*(t) C∗(t), It means that only the time point not greater than the time is considered t t t when , In time t t t For one beat Of C ( { t i } ) C(\{t_i\}) C({ ti}) The maximum of , This is the state transition function in dynamic programming , The expression is :
C ∗ ( t ) = O ( t ) + max τ = 0... t − 4 m s { α F ( t − τ , τ p ) + C ∗ ( τ ) } C^*(t) = O(t) + \max_{\tau=0...t-4ms} \{\alpha F(t-\tau, \tau_p) + C^*(\tau)\} C∗(t)=O(t)+τ=0...t−4msmax{ αF(t−τ,τp)+C∗(τ)}
This is a little different from the standard dynamic programming , Have a good taste , I've been thinking about this place for a long time , This approach can avoid calculating state transitions , The previous best beats The sequence changes . This C ∗ ( t ) C^*(t) C∗(t) One more benefit , It can be used to find the ending , When C ∗ ( t ) C^*(t) C∗(t) When the increment of , It's just that the music gets weaker , That's the end .
meanwhile , It will also record what makes C ∗ ( t ) C^*(t) C∗(t) The largest sequence is t t t The previous one beat by
P ∗ ( t ) = a r g max τ = 0... t − 4 m s { α F ( t − τ , τ p ) + C ∗ ( τ ) } P^*(t) = arg\max_{\tau=0...t-4ms} \{\alpha F(t-\tau, \tau_p) + C^*(\tau)\} P∗(t)=argτ=0...t−4msmax{ αF(t−τ,τp)+C∗(τ)}
That is to say, take τ \tau τ How much is the . adopt P ∗ ( t ) P^*(t) P∗(t) Can backtrack the choice of beats route , To get the final beats Sequence .
In the actual search τ \tau τ When , Not from 0 To t-4ms To do , This will calculate many unnecessary situations . Yes F The punishment is , We just search τ = t − 2 τ p . . . t − τ p / 2 \tau = t - 2\tau_p ... t - \tau_p / 2 τ=t−2τp...t−τp/2.
We will 0 To the total time , All the time t t t Corresponding C ∗ ( t ) C^*(t) C∗(t) After all the calculations , Take the largest one C ∗ ( t N ) C^*(t_N) C∗(tN), t N t_N tN It's the last one beat spot , then t N − 1 = P ∗ ( t N ) t_{N-1} = P^*(t_N) tN−1=P∗(tN), Then the whole sequence is traced all the way back { t i } \{t_i\} { ti}. One thing is for sure , t N t_N tN Must be in [ total when Long − τ p , total when Long ] [ Total duration -\tau_p, Total duration ] [ total when Long −τp, total when Long ] Within the scope of , Or you can add another one beat.
At this point, the text has been finished . Let's talk about it briefly , Why not follow the standard dynamic planning approach , Otherwise, I may have to think about it for a long time next time . According to standard practice , Will define C ∗ ( t j ) C^*(t_j) C∗(tj) The time point is not greater than the time t j t_j tj when , C ( { t i } ) C(\{t_i\}) C({ ti}) The maximum of . The state transition function is
C ∗ ( t j ) = max { O ( t j ) + max τ = 0... t j − 1 { α F ( t j − P ∗ ( τ ) , τ p ) + C ∗ ( τ ) } , C ∗ ( t j − 1 ) } C^*(t_j) = \max \{ O(t_j) + \max_{\tau=0...t_{j-1}} \{\alpha F(t_j-P^*(\tau), \tau_p) + C^*(\tau)\}, C^*(t_{j-1}) \} C∗(tj)=max{ O(tj)+τ=0...tj−1max{ αF(tj−P∗(τ),τp)+C∗(τ)},C∗(tj−1)}
Explain that , There are two options , The former one is to t j t_j tj As a beat, The other is not to t j t_j tj As a beat. That's the problem P ∗ ( τ ) P^*(\tau) P∗(τ), When we put t j t_j tj As a beat, bring C ∗ ( t j ) C^*(t_j) C∗(tj) The previous one as big as possible beat Is not necessarily P ∗ ( τ ) P^*(\tau) P∗(τ). In other words , hold t j t_j tj As a beat after , The best in front beats May change . If we follow the standard practice , Many time points will be missed as beat The situation of . Fine products , Fine products .
6 reference
[1] Beat Tracking by Dynamic Programming
[2] Tempo and Beat Tracking
边栏推荐
- Use of Arduino series pressure sensors and detected data displayed by OLED (detailed tutorial)
- Easydl related documents and codes
- [51nod.3210] binary Statistics (bit operation)
- Review the history of various versions of ITIL, and find the key points for the development of enterprise operation and maintenance
- [work notes] xr872 codec driver migration and application program example (with chip debugging method)
- 16 embedded C language interview questions (Classic)
- Vscode configuration header file -- Take opencv and its own header file as an example
- 华为设备配置双反射器优化虚拟专用网骨干层
- 回顾ITIL各版本历程,找到企业运维发展的关键点
- 1、 Set up Django automation platform (realize one click SQL execution)
猜你喜欢

一、搭建django自动化平台(实现一键执行sql)

Sqlserver2008 denied select permission on object'***** '(database'*****', schema'dbo')

Ctrip reshapes new Ctrip

Vscode configuration header file -- Take opencv and its own header file as an example

Build MySQL environment under mac

I didn't expect that the index occupies several times as much space as the data MySQL queries the space occupied by each table in the database, and the space occupied by data and indexes. It is used i
![[work notes] the problem of high leakage current in standby mode of dw7888 motor driver chip](/img/d1/c4776e3db1b7560331fa569c40831a.jpg)
[work notes] the problem of high leakage current in standby mode of dw7888 motor driver chip

C语言压缩字符串保存到二进制文件,从二进制文件读取压缩字符串后解压。
![[pytorch]fixmatch code explanation (super detailed)](/img/22/66703bea0f8ee40eceb0687fcb3ad2.jpg)
[pytorch]fixmatch code explanation (super detailed)

Record: how to solve the problem of "the system cannot find the specified path" in the picture message uploaded by transferto() of multipartfile class [valid through personal test]
随机推荐
SWD debugging mode of stm32
拍拍贷母公司信也季报图解:营收24亿 净利5.3亿同比降10%
CCF 201409-1: adjacent number pairs (100 points + problem solving ideas)
[learning notes] xr872 GUI littlevgl 8.0 migration (display part)
万字讲清 synchronized 和 ReentrantLock 实现并发中的锁
[the second day of actual combat of smart lock project based on stm32f401ret6 in 10 days] (lighting with library function and register respectively)
Viewing the ambition of Xiaodu technology from intelligent giant screen TV v86
Parameter measurement method of brushless motor
SQL server deletes all tables and all stored procedures in the database
Sensorless / inductive manufacturing of brushless motor drive board based on stm32
[keras] data of 3D u-net source code analysis py
Ten thousand words make it clear that synchronized and reentrantlock implement locks in concurrency
4.11 introduction to firmware image package
传感器:MQ-5燃气模块测量燃气值(底部附代码)
Bluetooth module: use problem collection
js-dom
SQL Server 删除数据库所有表和所有存储过程
[the fourth day of actual combat of stm32f401ret6 smart lock project in 10 days] voice control is realized by externally interrupted keys
[programming idea] communication interface of data transmission and decoupling design of communication protocol
Uniapp preview function