当前位置：网站首页>Some suggestions on writing code to reproduce the paper!

Some suggestions on writing code to reproduce the paper!

2022-06-12 00:50:00 【Datawhale】

I don't know if we sometimes have a good idea, But I just can't write specific code , Or the code is not efficient enough .

In fact, everyone will encounter this kind of situation ：

scene 1： There is a new feature during the competition , But with pandas Implementation is too slow , Time complexity is too high .
scene 2： A new problem in scientific research or work , Into a new field , Don't know how to start .
scene 3： Reproduce others' in-depth study papers , But it just doesn't work .

scene 1： Code too slow

Now, whether it's a game or a common data processing , Will encounter large-scale files . At this point, if your code is not efficient enough , The code will certainly run very slowly , Basically can not meet the requirements .

step 1： Write basic code

Use a small number of datasets to practice your ideas , Code can be less optimized , Write it first . After writing, it is recommended to package it as a function , Convenient to call .

step 2： Optimize code logic

In the process of increasing the amount of data , You will find that the code is getting slower , Gradually reach the upper limit of your expectations . At this point you should try to optimize your code .

The optimization code has some basic logic ：

Is the code itself efficient enough ？
Does the code make use of all the CPU/GPU？

For example, in use Pandas when , If you don't know the specific grammar , It's easy to write the code as for loop , Refer to the following optimization process .

Subscript loop

df1 = df
for i in range(len(df)):
    if df.iloc[i]['test'] != 1:
        df1.iloc[i]['test'] = 0

Iterrows loop

i = 0
for ind, row in df.iterrows():
    if row['test'] != 1:
        df1.iloc[i]['test'] = 0
    i += 1

Apply loop

df1['test'] = df['test'].apply(lambda x: x if x == 1 else 0)

Built in functions

res = df.sum()

Numpy function

df_values = df.values
res = np.sum(df_values)

step 3： Improve resource utilization

When you step on Pandas and Numpy During the familiarization process , You will find your code running faster and faster . If the final code is implemented with built-in functions , Basically, it is already very good .

But it can also be further optimized , because Pandas Many operations are performed by serial single thread , Therefore, you can manually open multiple threads to further accelerate the data calculation process , Put all the CPU use , Or use cuDF utilize GPU Speed up .

scene 2： There is no way to start a new field

Reading about a new job you already have , So try to stand on the shoulders of giants .

Read about the target area 3-5 The annual summit paper , In particular, review papers .
Collect public events or lists to learn Top Ranked solutions , Contains ideas and code .

No one else can do it , Collect more and organize more , Understand field ideas and routines .

scene 3： Reproduce other people's papers

Scientific research is not from 0 To 1, Be sure to know more about your existing work , And the existing paper code . After reading the paper code , It can be reproduced step by step as follows ：

step 1： Find papers with open source code

stay Github Find historical papers with code on , Although these thesis projects are relatively old , But it is of great reference value .

step 2： Organize the loading of data sets

Figure out how to make a dataset, how to load it, how to input it, how to calculate it, and how to output it , How datasets are handled , How to code .

step 3： Build a paper model

Sort out the model structure based on the idea of the paper , How many layer , Details of each layer , Dimensions of each layer , Build it step by step . Ensure that the model can be trained and predicted normally .

step 4： Identify training details

According to the details of the experimental part of the paper , Identify specific batch、epoch、 Learning rate and optimizer , Make sure there is no problem with the training process .