Using GaN based oversampling technique to improve the accuracy of model for mortality prediction of unbalanced covid-19

author |THILAKADIBOINA
compile |Flin
source |analyticsvidhya

Introduce

This paper introduces the generative countermeasure network （Generative attersarial Networks,GAN） Use , This is a kind of real Covid-19 The technique of oversampling data , Used to predict mortality . This story gives us a better understanding of the data preparation steps （ Such as dealing with unbalanced data ） How to improve model performance .

The data and core model of this article are from Celestine Iwendi、Ali Kashif Bashir、Atharva Peshkar A recent study （2020 year 7 month ）“ Using enhanced random forest algorithm to predict COVID-19 Healthy patients ”. Used in this study ADABOST The model enhanced random forest algorithm predicts the mortality of individual patients , Accuracy rate is 94%. This paper considers the same model and model parameters , Based on the analysis of the use of GAN The improvement of the existing model by oversampling technology of .

For aspiring data scientists , One of the best ways to learn good practice is to participate in hacker competitions on different forums , such as Vidhya、Kaggle Or other forums .

Besides , Obtain resolved cases and data from these forums or published research publications ; How to get to know them , And try to improve the accuracy or reduce the error through additional steps . This will form a solid foundation , It enables us to think deeply about the application of other technologies we have learned in the value chain of Data Science .

The data used in the study was 222 Of a patient 13 To train with . The data is biased ,159 example （72%） Belong to “0” Class or “ Has resumed ” class . Because of its deviant nature , All kinds of undersampling / Oversampling can be applied to data . The problem of skewed data will lead to over fitting of prediction model .

To overcome this limitation , Many studies use oversampling to balance data sets , In order to get more accurate model training . Oversampling is a technique that compensates for data set imbalance by increasing the number of samples in a small number of data .

Conventional methods include random oversampling （ROS）、 Synthesis of a few oversampling techniques （SMOTE） etc. . For more information about handling unbalanced classes using conventional methods , see also ：

https://www.analyticsvidhya.c...

lately , A machine learning model of generative network based on antagonistic learning concept is proposed , It's a generative adversarial network . Generative countermeasures network （Generative atterial Networks,GAN） It is easy to apply to oversampling research , Because the nature of the neural network based on confrontation training allows the generation of artificial data similar to the original data . The oversampling based on generative countermeasure network overcomes the traditional method （ If over fitting ） The limitations of , It allows the establishment of a high-precision prediction model for unbalanced data .

How to generate composite data ？

Two neural networks compete with each other , Learning goal distribution and generating artificial data

Generator network G： Simulation training sample deception discriminator

Discrimination network D： Distinguish training samples from generated samples

Generative antagonistic networks are scenarios based on game theory , Among them, the generating network must compete with its competitors . With GAN Learn to simulate the distribution of data , It's used in all kinds of fields , Such as music 、 Video and natural language , Recently, it has also been used to deal with unbalanced data problems .

The data and basic models used in the study can be found here

import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as sns  
import tensorflow as tf  
from keras.layers import Input, Dense, Reshape, Flatten, Dropout, BatchNormalization, Embedding  
from keras.layers.advanced_activations import LeakyReLU  
from keras.layers.merge import concatenate  
from keras.models import Sequential, Model  
from keras.optimizers import Adam  
from keras.utils import to_categorical  
from keras.layers.advanced_activations import LeakyReLU  
from keras.utils.vis_utils import plot_model  
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, LabelEncoder  
import scipy.stats  
import datetime as dt  
import pydot  
import warnings  
warnings.filterwarnings("ignore")  
%matplotlib inline  

df = pd.read_csv('Covid_Train_Oct32020.csv')  
df = df.drop('id',axis=1)  
df = df.fillna(np.nan,axis=0)  
df['age'] = df['age'].fillna(value=df['age'].mean())  

df['sym_on'] = pd.to_datetime(df['sym_on'])  
df['hosp_vis'] = pd.to_datetime(df['hosp_vis'])  
df['sym_on']= df['sym_on'].map(dt.datetime.toordinal)  
df['hosp_vis']= df['hosp_vis'].map(dt.datetime.toordinal)  
df['diff_sym_hos']= df['hosp_vis'] - df['sym_on']  
df=df.drop(['sym_on', 'hosp_vis'], axis=1)  
df['location'] = df['location'].astype(str)  
df['country'] = df['country'].astype(str)  
df['gender'] = df['gender'].astype(str)  
df['vis_wuhan'] = df['vis_wuhan'].astype(str)  
df['from_wuhan'] = df['from_wuhan'].astype(str)  
df['symptom1'] = df['symptom1'].astype(str)  
df['symptom2'] = df['symptom2'].astype(str)  
df['symptom3'] = df['symptom3'].astype(str)  
df['symptom4'] = df['symptom4'].astype(str)  
df['symptom5'] = df['symptom5'].astype(str)  
df['symptom6'] = df['symptom6'].astype(str)  
df.dtypes

Data description

Column	describe	value （ Used to categorize variables ）	type
id	Patient number	Do not apply	Numbers
location	Where the patient belongs	Cities all over the world	character string , classification
country	The country of the patient	Several countries	character string , classification
gender	The patient's gender	male , Woman	character string , classification
age	Patient age	Do not apply	Numbers
sym_on	The date the patient began to notice the symptoms	Do not apply	date
hosp_vis	The date the patient went to the hospital	Do not apply	date
vis_wuhan	Has the patient ever been to Wuhan, China	yes （1）, no （0）	The number , classification
from_wuhan	Whether the patient belongs to Wuhan, China	yes （1）, no （0）	The number , classification
death	Is the patient due to COVID-19 And die	yes （1）, no （0）	The number , classification
Recov	Whether the patient has recovered	yes （1）, no （0）	The number , classification
symptom1. symptom2, symptom3, symptom4, symptom5, symptom6	Symptoms noticed by the patient	Patients notice a variety of symptoms	character string , classification

The study considered 11 Classification input features and 2 Digital input features . The target variable is death / recovery . New column filled “ diff_sym_hos”, To provide the difference between the symptoms found and received in the hospital that day .

The focus of the study is to improve a few categories of data , Death == 1, A subset is extracted from the training data . Subsets are separated by category and number , And pass it on to GAN Model .

df_minority_data=df.loc[df['death'] == 1]  
 
#Subsetting input features without target variable
df_minority_data_withouttv=df_minority_data.loc[:, df_minority_data.columns != 'death']
numerical_df = df_minority_data_withouttv.select_dtypes("number")  
categorical_df = df_minority_data_withouttv.select_dtypes("object")  
scaling = MinMaxScaler()  
numerical_df_rescaled = scaling.fit_transform(numerical_df)  
get_dummy_df = pd.get_dummies(categorical_df)    

#Seperating Each Category
location_dummy_col = [col for col in get_dummy_df.columns if 'location' in col]  
location_dummy = get_dummy_df[location_dummy_col]  
country_dummy_col = [col for col in get_dummy_df.columns if 'country' in col]  
country_dummy = get_dummy_df[country_dummy_col]  
gender_dummy_col = [col for col in get_dummy_df.columns if 'gender' in col]  
gender_dummy = get_dummy_df[gender_dummy_col]  
vis_wuhan_dummy_col = [col for col in get_dummy_df.columns if 'vis_wuhan' in col]  
vis_wuhan_dummy = get_dummy_df[vis_wuhan_dummy_col]  
from_wuhan_dummy_col = [col for col in get_dummy_df.columns if 'from_wuhan' in col]  
from_wuhan_dummy = get_dummy_df[from_wuhan_dummy_col]  
symptom1_dummy_col = [col for col in get_dummy_df.columns if 'symptom1' in col]  
symptom1_dummy = get_dummy_df[symptom1_dummy_col]  
symptom2_dummy_col = [col for col in get_dummy_df.columns if 'symptom2' in col]  
symptom2_dummy = get_dummy_df[symptom2_dummy_col]  
symptom3_dummy_col = [col for col in get_dummy_df.columns if 'symptom3' in col]  
symptom3_dummy = get_dummy_df[symptom3_dummy_col]  
symptom4_dummy_col = [col for col in get_dummy_df.columns if 'symptom4' in col]  
symptom4_dummy = get_dummy_df[symptom4_dummy_col]  
symptom5_dummy_col = [col for col in get_dummy_df.columns if 'symptom5' in col]  
symptom5_dummy = get_dummy_df[symptom5_dummy_col]  
symptom6_dummy_col = [col for col in get_dummy_df.columns if 'symptom6' in col]  
symptom6_dummy = get_dummy_df[symptom6_dummy_col]

Define generator

The generator takes input from potential space and generates a new composite sample . Leak correction linear element （LeakyReLU） It's a function used in generator and discriminator models to deal with some negative values .

It uses the default recommended value 0.2 And the appropriate weight initialization procedure “ he_uniform” Use . Besides , Use batch normalization between layers to standardize activation from previous layers （ Zero mean and unit variance ） And stabilize the training process .

In the output layer ,softmax Activation functions are used to categorize variables , and sigmoid Functions are used for continuous variables .

def define_generator (catsh1,catsh2,catsh3,catsh4,catsh5,catsh6,catsh7,catsh8,catsh9,catsh10,catsh11,numerical):    
  #Inputting noise  from latent space
    noise = Input(shape = (70,))    
    hidden_1 = Dense(8, kernel_initializer = "he_uniform")(noise)    
    hidden_1 = LeakyReLU(0.2)(hidden_1)    
    hidden_1 = BatchNormalization(momentum = 0.8)(hidden_1)    
    hidden_2 = Dense(16, kernel_initializer = "he_uniform")(hidden_1)    
    hidden_2 = LeakyReLU(0.2)(hidden_2)    
    hidden_2 = BatchNormalization(momentum = 0.8)(hidden_2)    

    #Branch 1 for generating location data

    branch_1 = Dense(32, kernel_initializer = "he_uniform")(hidden_2)    
    branch_1 = LeakyReLU(0.2)(branch_1)    
    branch_1 = BatchNormalization(momentum = 0.8)(branch_1)    
    branch_1 = Dense(64, kernel_initializer = "he_uniform")(branch_1)    
    branch_1 = LeakyReLU(0.2)(branch_1)    
    branch_1 = BatchNormalization(momentum=0.8)(branch_1)    
 
    #Output Layer1
    branch_1_output = Dense(catsh1, activation = "softmax")(branch_1)    

    #Likewise, for all remaining 10 categories branches will be defined    
    #Branch 12 for generating numerical data 
    branch_12 = Dense(64, kernel_initializer = "he_uniform")(hidden_2)    
    branch_12 = LeakyReLU(0.2)(branch_3)    
    branch_12 = BatchNormalization(momentum=0.8)(branch_12)    
    branch_12 = Dense(128, kernel_initializer = "he_uniform")(branch_12)    
    branch_12 = LeakyReLU(0.2)(branch_12)    
    branch_12 = BatchNormalization(momentum=0.8)(branch_12)    
    
    #Output Layer12 
    branch_12_output = Dense(numerical, activation = "sigmoid")(branch_12)    

    #Combined output 
    combined_output = concatenate([branch_1_output, branch_2_output, branch_3_output,branch_4_output,branch_5_output,branch_6_output,branch_7_output,branch_8_output,branch_9_output,branch_10_output,branch_11_output,branch_12_output])    

    #Return model 

    return Model(inputs = noise, outputs = combined_output)    

    
generator = define_generator(location_dummy.shape[1],country_dummy.shape[1],gender_dummy.shape[1],vis_wuhan_dummy.shape[1],from_wuhan_dummy.shape[1],symptom1_dummy.shape[1],symptom2_dummy.shape[1],symptom3_dummy.shape[1],symptom4_dummy.shape[1],symptom5_dummy.shape[1],symptom6_dummy.shape[1],numerical_df_rescaled.shape[1])  
generator.summary()

Defining discriminators

The discriminator model will be derived from our data （ For example, vector ） Get samples from , And output the classification prediction about whether the sample is true or false . It's a binary classification problem , So use... In the output layer sigmoid Activation function , The binary cross entropy loss function is used in the model compilation . Use the learning rate LR by 0.0002 And suggested beta1 The momentum is 0.5 Of Adam optimization algorithm .

def define_discriminator(inputs_n):  
    #Input from generator
    d_input = Input(shape = (inputs_n,))    
    d = Dense(128, kernel_initializer="he_uniform")(d_input)  
    d = LeakyReLU(0.2)(d)  
    d = Dense(64, kernel_initializer="he_uniform")(d)  
    d = LeakyReLU(0.2)(d)  
    d = Dense(32, kernel_initializer="he_uniform")(d)  
    d = LeakyReLU(0.2)(d)  
    d = Dense(16, kernel_initializer="he_uniform")(d)  
    d = LeakyReLU(0.2)(d)  
    d = Dense(8, kernel_initializer="he_uniform")(d)  
    d = LeakyReLU(0.2)(d)  

    #Output Layer
    d_output = Dense(1, activation = "sigmoid")(d)  

    #compile and return model
    model = Model(inputs = d_input, outputs = d_output)  
    model.compile(loss = "binary_crossentropy", optimizer = Adam(lr=0.0002, beta_1=0.5), metrics = ["accuracy"])  
    return model  


inputs_n = location_dummy.shape[1]+country_dummy.shape[1]+gender_dummy.shape[1]+vis_wuhan_dummy.shape[1]+from_wuhan_dummy.shape[1]+symptom1_dummy.shape[1]+symptom2_dummy.shape[1]+symptom3_dummy.shape[1]+symptom4_dummy.shape[1]+symptom5_dummy.shape[1]+symptom6_dummy.shape[1]+numerical_df_rescaled.shape[1]  
discriminator = define_discriminator(inputs_n)  
discriminator.summary()

Combine the generator and discriminator into GAN Model and complete the training . Considering 7,000 Period , And consider the complete minority training data .

Def define_complete_gan(generator, discriminator):  
    discriminator.trainable = False  
    gan_output = discriminator(generator.output)  
    
    #Initialize gan
    model = Model(inputs = generator.input, outputs = gan_output)  

    #Model Compilation
    model.compile(loss = "binary_crossentropy", optimizer = Adam(lr=0.0002, beta_1=0.5))  
    return model  

completegan = define_complete_gan(generator, discriminator)  

def gan_train(gan, generator, discriminator, catsh1,catsh2,catsh3,catsh4,catsh5,catsh6,catsh7,catsh8,catsh9,catsh10,catsh11,numerical, latent_dim, n_epochs, n_batch, n_eval):  
    #Upddte Discriminator with half batch size
    half_batch = int(n_batch / 2)  
    discriminator_loss = []  
    generator_loss = []  
    #generate class labels for fake and real
    valid = np.ones((half_batch, 1))  
    y_gan = np.ones((n_batch, 1))  
    fake = np.zeros((half_batch, 1))  

    #training
    for i in range(n_epochs):  
        #select random batch from real categorical and numerical data
        idx = np.random.randint(0, catsh1.shape[0], half_batch)       
        location_real = catsh1[idx]  
        country_real = catsh2[idx]  
        gender_real = catsh3[idx]  
        vis_wuhan_real = catsh4[idx]  
        from_wuhan_real = catsh5[idx]  
        symptom1_real = catsh6[idx]  
        symptom2_real = catsh7[idx]  
        symptom3_real = catsh8[idx]  
        symptom4_real = catsh9[idx]   
        symptom5_real = catsh10[idx]  
        symptom6_real = catsh11[idx]          
        numerical_real = numerical_df_rescaled[idx]  

        #concatenate categorical and numerical data for the discriminator
        real_data = np.concatenate([location_real, country_real, gender_real,vis_wuhan_real,from_wuhan_real,symptom1_real,symptom2_real,symptom3_real,symptom4_real,symptom5_real,symptom6_real,numerical_real], axis = 1)  
  
        #generate fake samples from the noise
        noise = np.random.normal(0, 1, (half_batch, latent_dim))  
        fake_data = generator.predict(noise)  
     
        #train the discriminator and return losses and acc
        d_loss_real, da_real = discriminator.train_on_batch(real_data, valid)  
        d_loss_fake, da_fake = discriminator.train_on_batch(fake_data, fake)  
        d_loss = 0.5 * np.add(d_loss_real, d_loss_fake)  
        discriminator_loss.append(d_loss)  
       
        #generate noise for generator input and train the generator (to have the discriminator label samples as valid)
        noise = np.random.normal(0, 1, (n_batch, latent_dim))  
        g_loss = gan.train_on_batch(noise, y_gan)  
        generator_loss.append(g_loss)  

        #evaluate progress
        if (i+1) % n_eval == 0:  
            print ("Epoch: %d [Discriminator loss: %f] [Generator loss: %f]" % (i + 1, d_loss, g_loss))  
            
    plt.figure(figsize = (20, 10))  
    plt.plot(generator_loss, label = "Generator loss")  
    plt.plot(discriminator_loss, label = "Discriminator loss")     
    plt.title("Stats from training GAN")  
    plt.grid()  
    plt.legend()  

latent_dim = 100  
gan_train(completegan, generator, discriminator, location_dummy.values,country_dummy.values,gender_dummy.values,vis_wuhan_dummy.values,from_wuhan_dummy.values,symptom1_dummy.values,symptom2_dummy.values,symptom3_dummy.values,symptom4_dummy.values,symptom5_dummy.values,symptom6_dummy.values,numerical_df_rescaled, latent_dim, n_epochs = 7000, n_batch = 63, n_eval = 200)

The trained model is used to generate a few other 96 Bar record , To divide each class equally （159）. Now the generated numerical data and the mean value of the original data , Compare the standard deviation with the variance ; And compare the category data according to the count of each category .

noise = np.random.normal(0, 1, (96, 100))  
generated_mixed_data = generator.predict(noise)  
columns=list(location_dummy.columns)+list(country_dummy.columns)+list(gender_dummy.columns)+list(vis_wuhan_dummy.columns)+list(from_wuhan_dummy.columns)+list(symptom1_dummy.columns)+list(symptom2_dummy.columns)+list(symptom3_dummy.columns)+list(symptom4_dummy.columns)+list(symptom5_dummy.columns)+list(symptom6_dummy.columns)+list(numerical_df.columns)  
mixed_gen_df = pd.DataFrame(data = generated_mixed_data, columns = columns)  
mixed_gen_df.iloc[:,:-3] = np.round(mixed_gen_df.iloc[:,:-3])  
mixed_gen_df.iloc[:,-2:] = scaling.inverse_transform(mixed_gen_df.iloc[:,-2:])  

#Original Data
original_df = pd.concat([location_dummy,country_dummy,gender_dummy,vis_wuhan_dummy,from_wuhan_dummy,symptom1_dummy,symptom2_dummy,symptom3_dummy,symptom4_dummy,symptom5_dummy,symptom6_dummy,numerical_df], axis = 1)  
def normal_distribution(org, noise):  
    org_x = np.linspace(org.min(), org.max(), len(org))  
    noise_x = np.linspace(noise.min(), noise.max(), len(noise))  
    org_y = scipy.stats.norm.pdf(org_x, org.mean(), org.std())  
    noise_y = scipy.stats.norm.pdf(noise_x, noise.mean(), noise.std())  
    n, bins, patches = plt.hist([org, noise], density = True, alpha = 0.5, color = ["green", "red"])  
    xmin, xmax = plt.xlim()  
    plt.plot(org_x, org_y, color = "green", label = "Original data", alpha = 0.5)  
    plt.plot(noise_x, noise_y, color = "red", label = "Generated data", alpha = 0.5)  
    title = f"Original data mean {np.round(org.mean(), 4)}, Original data std {np.round(org.std(), 4)}, Original data var {np.round(org.var(), 4)}\nGenerated data mean {np.round(noise.mean(), 4)}, Generated data {np.round(noise.std(), 4)}, Generated data var {np.round(noise.var(), 2)}"
    plt.title(title)  
    plt.legend()  
    plt.grid()  
    plt.show()  

Numeric_columns=numerical_df.columns  

for column in numerical_df.columns:  
    print(column, "Comparison between Original Data and Generated Data")  
    normal_distribution(original_df
, mixed_gen_df
)

Age comparison between raw data and generated data

Comparison between raw data and generated data

Category comparison between raw data and generated data

features	Raw data		Data generated
	0	1	0	1
location_Hokkaido	61	2	95	1
gender_female	49	14	60	36
symptom2_ cough	62	1	96	0

GAN The data generated by the oversampling method is almost similar to the original data , The error of the original data is about 1％. For some rare categories , Data will not be generated on all category values .

Follow the same data preparation steps as mentioned in the original study , To see through the use of GAN How does supersampling improve the performance of the model compared with the original method . The unique hot coded data of the generated sample is converted to the original data frame format .

# Getting Back Categorical Data in Original_Format from Dummies
location_filter_col = [col for col in mixed_gen_df if col.startswith('location')]  
location=mixed_gen_df[location_filter_col]   
location= pd.get_dummies(location).idxmax(1)  
location= location.replace('location_', '', regex=True)  
df_generated_data = pd.DataFrame()   
df_generated_data['location']=location  

country_filter_col = [col for col in mixed_gen_df if col.startswith('country')]  
country=mixed_gen_df[country_filter_col]   
country= pd.get_dummies(country).idxmax(1)  
country= country.replace('country_', '', regex=True)  
df_generated_data['country']=country  

gender_filter_col = [col for col in mixed_gen_df if col.startswith('gender')]  
gender=mixed_gen_df[gender_filter_col]   
gender= pd.get_dummies(gender).idxmax(1)  
gender= gender.replace('gender_', '', regex=True)  
df_generated_data['gender']=gender  

vis_wuhan_filter_col = [col for col in mixed_gen_df if col.startswith('vis_wuhan')]  
vis_wuhan=mixed_gen_df[vis_wuhan_filter_col]   
vis_wuhan= pd.get_dummies(vis_wuhan).idxmax(1)  
vis_wuhan= vis_wuhan.replace('vis_wuhan_', '', regex=True)  
df_generated_data['vis_wuhan']=vis_wuhan  

from_wuhan_filter_col = [col for col in mixed_gen_df if col.startswith('from_wuhan')]  
from_wuhan=mixed_gen_df[from_wuhan_filter_col]   
from_wuhan= pd.get_dummies(from_wuhan).idxmax(1)  
from_wuhan= from_wuhan.replace('from_wuhan_', '', regex=True)  
df_generated_data['from_wuhan']=from_wuhan  

symptom1_filter_col = [col for col in mixed_gen_df if col.startswith('symptom1')]  
symptom1=mixed_gen_df[symptom1_filter_col]   
symptom1= pd.get_dummies(symptom1).idxmax(1)  
symptom1= symptom1.replace('symptom1_', '', regex=True)  
df_generated_data['symptom1']=symptom1  

symptom2_filter_col = [col for col in mixed_gen_df if col.startswith('symptom2')]  
symptom2=mixed_gen_df[symptom2_filter_col]   
symptom2= pd.get_dummies(symptom2).idxmax(1)  
symptom2= symptom2.replace('symptom2_', '', regex=True)  
df_generated_data['symptom2']=symptom2  

symptom3_filter_col = [col for col in mixed_gen_df if col.startswith('symptom3')]  
symptom3=mixed_gen_df[symptom3_filter_col]   
symptom3= pd.get_dummies(symptom3).idxmax(1)  
symptom3= symptom3.replace('symptom3_', '', regex=True)  
df_generated_data['symptom3']=symptom3  

symptom4_filter_col = [col for col in mixed_gen_df if col.startswith('symptom4')]  
symptom4=mixed_gen_df[symptom4_filter_col]   
symptom4= pd.get_dummies(symptom4).idxmax(1)  
symptom4= symptom4.replace('symptom4_', '', regex=True)  
df_generated_data['symptom4']=symptom4  

symptom5_filter_col = [col for col in mixed_gen_df if col.startswith('symptom5')]  
symptom5=mixed_gen_df[symptom5_filter_col]   
symptom5= pd.get_dummies(symptom5).idxmax(1)  
symptom5= symptom5.replace('symptom5_', '', regex=True)  
df_generated_data['symptom5']=symptom5  

symptom6_filter_col = [col for col in mixed_gen_df if col.startswith('symptom6')]  
symptom6=mixed_gen_df[symptom6_filter_col]   
symptom6= pd.get_dummies(symptom6).idxmax(1)  
symptom6= symptom6.replace('symptom6_', '', regex=True)  
df_generated_data['symptom6']=symptom6  

df_generated_data['death']=1  
df_generated_data['death']=1  

df_generated_data[['age','diff_sym_hos']]=mixed_gen_df[['age','diff_sym_hos']]  
df_generated_data = df_generated_data.fillna(np.nan,axis=0)  

#Encoding Data
encoder_location = preprocessing.LabelEncoder()  
encoder_country = preprocessing.LabelEncoder()  
encoder_gender = preprocessing.LabelEncoder()  
encoder_symptom1 = preprocessing.LabelEncoder()  
encoder_symptom2 = preprocessing.LabelEncoder()  
encoder_symptom3 = preprocessing.LabelEncoder()  
encoder_symptom4 = preprocessing.LabelEncoder()  
encoder_symptom5 = preprocessing.LabelEncoder()  
encoder_symptom6 = preprocessing.LabelEncoder()  

# Loading and Preparing Data
df = pd.read_csv('Covid_Train_Oct32020.csv')  
df = df.drop('id',axis=1)  
df = df.fillna(np.nan,axis=0)  
df['age'] = df['age'].fillna(value=tdata['age'].mean())  
df['sym_on'] = pd.to_datetime(df['sym_on'])  
df['hosp_vis'] = pd.to_datetime(df['hosp_vis'])  
df['sym_on']= df['sym_on'].map(dt.datetime.toordinal)  
df['hosp_vis']= df['hosp_vis'].map(dt.datetime.toordinal)  
df['diff_sym_hos']= df['hosp_vis'] - df['sym_on']  
df = df.drop(['sym_on','hosp_vis'],axis=1)  
      
df['location'] = encoder_location.fit_transform(df['location'].astype(str))  
df['country'] = encoder_country.fit_transform(df['country'].astype(str))  
df['gender'] = encoder_gender.fit_transform(df['gender'].astype(str))  
df[['symptom1']] = encoder_symptom1.fit_transform(df['symptom1'].astype(str))  
df[['symptom2']] = encoder_symptom2.fit_transform(df['symptom2'].astype(str))  
df[['symptom3']] = encoder_symptom3.fit_transform(df['symptom3'].astype(str))  
df[['symptom4']] = encoder_symptom4.fit_transform(df['symptom4'].astype(str))  
df[['symptom5']] = encoder_symptom5.fit_transform(df['symptom5'].astype(str))  
df[['symptom6']] = encoder_symptom6.fit_transform(df['symptom6'].astype(str))  

# Encoding Generated Data  
df_generated_data['location'] = encoder_location.transform(df_generated_data['location'].astype(str))  
df_generated_data['country'] = encoder_country.transform(df_generated_data['country'].astype(str))  
df_generated_data['gender'] = encoder_gender.transform(df_generated_data['gender'].astype(str))  
df_generated_data[['symptom1']] = encoder_symptom1.transform(df_generated_data['symptom1'].astype(str))  
df_generated_data[['symptom2']] = encoder_symptom2.transform(df_generated_data['symptom2'].astype(str))  
df_generated_data[['symptom3']] = encoder_symptom3.transform(df_generated_data['symptom3'].astype(str))  
df_generated_data[['symptom4']] = encoder_symptom4.transform(df_generated_data['symptom4'].astype(str))  
df_generated_data[['symptom5']] = encoder_symptom5.transform(df_generated_data['symptom5'].astype(str))  
df_generated_data[['symptom6']] = encoder_symptom6.transform(df_generated_data['symptom6'].astype(str))  
df_generated_data[['diff_sym_hos']] = df_generated_data['diff_sym_hos'].astype(int)

Model comparison

Divide the raw data into training and testing , take GAN The generated data is added to the training data , To compare performance to the basic model . In practice （ original ） Test model performance on split test data .

from sklearn.metrics import recall_score as rs  
from sklearn.metrics import precision_score as ps  
from sklearn.metrics import f1_score as fs  
from sklearn.metrics import balanced_accuracy_score as bas  
from sklearn.metrics import confusion_matrix as cm  
import numpy as np  
import pandas as pd  
import datetime as dt  
import sklearn  
from scipy import stats  
from sklearn import preprocessing  
from sklearn.model_selection import GridSearchCV  
from sklearn.ensemble import RandomForestClassifier  
from sklearn.ensemble import AdaBoostClassifier  
from sklearn.model_selection import train_test_split  
from sklearn.metrics import recall_score as rs  
from sklearn.metrics import precision_score as ps  
from sklearn.metrics import f1_score as fs  
from sklearn.metrics import log_loss  

rf = RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,  
criterion='gini', max_depth=2, max_features='auto',  
max_leaf_nodes=None, max_samples=None,  
min_impurity_decrease=0.0, min_impurity_split=None,  
min_samples_leaf=2, min_samples_split=2,  
min_weight_fraction_leaf=0.0, n_estimators=100,  
n_jobs=None, oob_score=False, random_state=None,  
verbose=0, warm_start=False)  
classifier = AdaBoostClassifier(rf,50,0.01,'SAMME.R',10)  

#Seperate TV in Generated Data
X1 = df_generated_data.loc[:, df_generated_data.columns != 'death']  
Y1 = df_generated_data['death']  

#Seperate TV in Original Data
X = df.loc[:, df.columns != 'death']  
Y = df['death']  

#Splitting Original Data
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2,random_state=0)  

#Appending Generated Data to X_train
X_train1=X_train.append(X1, sort=False)  
Y_train1=Y_train.append(Y1)  
classifier.fit(X_train1,np.array(Y_train1).reshape(Y_train1.shape[0],1))  
pred = np.array(classifier.predict(X_test))  

recall = rs(Y_test,pred)  
precision = ps(Y_test,pred)  
r1 = fs(Y_test,pred)  
ma = classifier.score(X_test,Y_test)  
print('*** Evaluation metrics for test dataset ***\n')  
print('Recall Score: ',recall)  
print('Precision Score: ',precision)  
print('F1 Score: ',f1)  
print('Accuracy: ',ma)

Metric system	Basic model score *	Scoring with enhanced generated data
Recall score	0.75	0.83
Precision fraction	1	1
F1 fraction	0.86	0.9
accuracy	0.9	0.95

source ： surface 3 Basic model indicators

https://www.ncbi.nlm.nih.gov/...

Conclusion

Compared with the basic model , The proposed model provides more accurate and reliable results , Show based on GAN Over sampling overcomes the limitation of unbalanced data , And appropriately extend a few classes .

Link to the original text ：https://www.analyticsvidhya.c...

Welcome to join us AI Blog station ：
http://panchuang.net/

sklearn Machine learning Chinese official documents ：
http://sklearn123.com/

Welcome to pay attention to pan Chuang blog resource summary station ：
http://docs.panchuang.net/