当前位置：网站首页>Default risk early warning preliminary competition scheme of bond issuing enterprises [AI competition]

Default risk early warning preliminary competition scheme of bond issuing enterprises [AI competition]

2022-07-24 07:50:00 【Ahang 626】

１． Competition questions
2. Interpretation of the contest question
3 The strategy of Shangfen
4 Scheme sharing
5 Key points of the plan

１． Competition questions

Title address ： Early warning of default risk of bond issuing enterprises
The organizers ： Guotai junan
Training set ： This competition question will be provided to bond issuing enterprises 2019-2020 Default data between years is used for model training , In order to predict the bond issuing enterprises in 2021 Annual probability of default risk , Among them, the range of bond issuing enterprises is 2019-2021 Enterprises that issued bonds in . The scope of bond issuing enterprises provided for prediction in the preliminary and semi-finals remains unchanged , On the basis of the preliminary , The semi-finals will increase the shareholder data of bond issuing enterprises 、 Foreign investment data and public opinion data of relevant enterprises .csdn download ,kaggle download
- Basic information of the enterprise （ Only bond issuing enterprises ）ent_info.csv
- 2018-2020 Financial index data for ent_financial_indicator.csv
- 2018-2020 Public opinion information in （ Only bond issuing enterprises ）ent_news.csv
- 2019-2020 Year's record of default ent_default.csv
Test set ： Based on the submission example given in the test set , Based on the enterprises listed in it .csdn download ,kaggle download
Requirements for submission of result documents ： The result file is named answer.csv, Please note that the separator is ’|‘, Not traditional ’,', The submitted is the prediction probability , Instead of categories , Such as ：12345|0.89
Background of the contest ： since 2014 China's bond market “ Rigid cash ” After the myth is broken , Bond defaults are heating up ,2018 The bond market has 160 Only bonds default , involve 44 A bond issuing enterprise , The default balance is as high as 1505.25 One hundred million yuan , The severity of default is the highest in history . Accelerated exposure to credit risk in the bond market 、 Under the background of normalization of events of default , How to effectively evaluate and predict the default risk of bond issuing enterprises in advance has become an important regulatory problem . Because the information is incomplete , Relying solely on financial data has been difficult to fully explain the problem of default risk premium . How to effectively use other data besides finance , For example, public opinion data of bond issuing enterprises 、 Upstream and downstream data of equity , It is of great significance to predict the default risk of bond issuing enterprises .
Match task ： The task of this competition is to use machine learning 、 Deep learning and other methods to train a forecast Model , The model can learn the relevant information of bond issuing enterprises , To predict whether the bond issuing enterprise will have default risk in the future . The difficulty of the competition is that the data set includes a large amount of information about bond issuing enterprises （ Shareholder Information 、 Foreign investment information and public opinion information, etc ）, How to learn from Extract effective features And risk prediction has become the key problem of this competition . Difference in the number of positive and negative samples huge , The proposition side expects you to have many solutions .
Evaluation method ：
- Preliminary evaluation index ：AUC
- Semi final evaluation index ： This task adopts accuracy F1 value （F1-measure, F1）、（Precision, P）、 Recall rate （Recall, R） To evaluate the effect of default prediction of bond issuing enterprises . The accuracy rate is the probability of actually being a positive sample among all the predicted positive samples . Recall rate is the probability of being predicted as a positive sample in the actual positive sample .F1 The values are calculated as follows ：f1 = 2PR / (P + R)

2. Interpretation of the contest question

Interpretation of video , Baseline procedure （ score 0.9696）

2.1 The competition questions are refined

Dichotomous problem
Extract effective features from a large amount of information and predict
The positive and negative sample data are very uneven
AUC evaluation
Available models ： Logical regression 、 Random forests 、GBDT, Figure neural network

2.2 feature extraction

2.2.1 Category features

Encoding mode
- Natural number coding
- Hot coding alone
- count code （ Alternate category characteristics ）
- Target code
Statistical method
- count
- nunique（ Width ）
- ratio（ Preference ）

2.2.2 Numerical characteristics

Cross Statistics
- Row crossing ： mean value 、 Median 、 The most value
- Column crossing construction
Discrete mode
- Points barrels
- Two valued （0/1）

2.3 feature selection

Filtration method
- The correlation coefficient
- Chi square test
- Mutual information
Encapsulation
- Forward search
- Search backward
Embedding method
- Feature ranking based on learning model

2.4 Timing verification

Conduct the training set in time sequence 、 Verify the segmentation of the set
direct K Crossover verification

2.5 Model selection

XGBoost、LightGBM: Low requirements for feature processing , Friendly to categories and continuous features , Missing values do not need to be filled
NN Model ： Used for integration
Fusion method ： There should be characteristic differences between fused samples 、 Sample differences 、 Model differences
- Bagging
- Borsting
- Casting method
- stacking
- Average method
Details to be excavated
- Journalism title useless （ After word segmentation, it turns into a vector 、bert Sentiment analysis ）
- Hyperparameters ： Random search 、vs Search for （ Depth of tree 、 Maximum subtree width, etc ）

3 The strategy of Shangfen

Journalism title Handle ： Stuttering participle —— Splicing —— Dimension reduction
Model fusion ： Average method 、stacking The fusion （ monolayer 、 Multi-storey 、 Combined with other technologies ）（lgb,cat,xgb Wait for output as SVM,LR, Bayesian input , Then use meta learner to fuse ）
Pseudo label ： The prediction probability of model output is smaller than 0.01 As a negative sample , Used as data in training （ A way to add data ）, On features label Some of them do confrontation verification and build fake tags
There are some characteristic skills in automated machine learning

4 Scheme sharing

4.1 The import module

import pandas as pd
import os
import gc
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
from sklearn.preprocessing import MinMaxScaler
from sklearn.utils import shuffle
from gensim.models import Word2Vec
import math
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')
import os, sys

class HiddenPrints:
    def __enter__(self):
        self._original_stdout = sys.stdout
        sys.stdout = open(os.devnull, 'w')

    def __exit__(self, exc_type, exc_val, exc_tb):
        sys.stdout.close()
        sys.stdout = self._original_stdout
#with HiddenPrints:
    #no print here
import torch
import torch.nn as nn# neural network 
import torch.nn.functional as F# Some function operations ,relu,softmax,,,
import torchvision# Vision , Some models 、 database ,,,
import torchvision.transforms as transforms# Image conversion 
import torch.optim as optim# Optimize 
from torch.utils.tensorboard import SummaryWriter# write in tensorboard File formats that can be read 
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader

gpu_flag = torch.cuda.is_available()
if(gpu_flag):
    print("using gpu",torch.version.cuda,"...")
else: print("using cpu...")

print("Pytorch:",torch.__version__)
print("PYtorch.torchvision",torchvision.__version__)
    
torch.set_printoptions(linewidth=120)# Set the line width for output printing

4.2 Read in the data

# Reading data 
news_list = []
for idx, line in enumerate(open('../input/guotaichusai/ent_news.csv', encoding='utf-8')):# Public opinion information of the enterprise 
    if idx == 0:
        cols = line.split('|')
    else:
        line_list = line.split('|')
        line_list = line_list[:8] + [''.join(line_list[8:]).replace('\n','')]# The data is only 9 Column 
        news_list.append(line_list)
news_df = pd.DataFrame(news_list, columns=cols)

ent_default = pd.read_csv('../input/guotaichusai/ent_default.csv', sep='|')# Default records of bond issuing enterprises 
ent_fina = pd.read_csv('../input/guotaichusai/ent_financial_indicator.csv', sep='|')# Financial index data of the enterprise 
ent_info = pd.read_csv('../input/guotaichusai/ent_info.csv', sep='|')# Basic information of the enterprise 
answer = pd.read_csv('../input/guotaichusai/answer.csv', sep='|')# Test set

4.3 Time processing

###  Time processing 
# Only the year is reserved 
ent_default['year'] = ent_default['acu_date'].apply(lambda x:x//10000)
ent_fina['year'] = ent_fina['report_period'].apply(lambda x:x//10000)
news_df['year'] = news_df['publishdate'].apply(lambda x:int(x)//10000)

ent_default['ent_id_year'] = ent_default['ent_id'] + '_' + (ent_default['year'] - 1).astype(str)
ent_fina['ent_id_year'] = ent_fina['ent_id'] + '_' + ent_fina['year'].astype(str)
news_df['ent_id_year'] = news_df['ent_id'] + '_' + news_df['year'].astype(str)
answer['ent_id_year'] = answer['ent_id'].apply(lambda x: x+'_2020')

del ent_fina['year'], news_df['year']


#  duplicate removal 
ent_default_new = ent_default.drop_duplicates(subset=['ent_id_year'], keep='last')

#  Merge 
ent_default_new['default_score'] = 1
answer['year'] = 2021
data = pd.concat([ent_default_new[['ent_id','ent_id_year','year','default_score']], answer], axis=0, ignore_index=True)
del ent_default_new

4.4 Construct negative samples

# Build a negative nutrient 
#ent_ids = [i for i in answer['ent_id'].unique() if i not in ent_default[ent_default['year']==2019]['ent_id'].unique().tolist()]
ent_ids = [i for i in ent_info['ent_id'].unique() if i not in ent_default[ent_default['year']==2019]['ent_id'].unique().tolist()]
ent_ids_df = pd.DataFrame({
    'ent_id':ent_ids})
ent_ids_df['year'] = 2019
ent_ids_df['default_score'] = 0
ent_ids_df['ent_id_year'] = ent_ids_df['ent_id'].apply(lambda x: x+'_2018')
data = pd.concat([data, ent_ids_df], axis=0, ignore_index=True)

#ent_ids = [i for i in answer['ent_id'].unique() if i not in ent_default[ent_default['year']==2020]['ent_id'].unique().tolist()]
ent_ids = [i for i in ent_info['ent_id'].unique() if i not in ent_default[ent_default['year']==2020]['ent_id'].unique().tolist()]
ent_ids_df = pd.DataFrame({
    'ent_id':ent_ids})
ent_ids_df['year'] = 2020
ent_ids_df['default_score'] = 0
ent_ids_df['ent_id_year'] = ent_ids_df['ent_id'].apply(lambda x: x+'_2019')
data = pd.concat([data, ent_ids_df], axis=0, ignore_index=True)

4.5 Feature extraction and processing

# feature extraction 
# ent_info.csv( Basic information of the enterprise )
ent_info_cat_cols = ['industryphy','industryco','enttype','entstatus','prov','city','county','is_bondissuer'] 
ent_info_num_cols = ['regcap']
ent_info_time_cols = ['opfrom','opto','esdate','apprdate'] #  business ( Resident ) The term is from 、 business ( Resident ) The deadline is 、 Date of establishment 、 Approval date 

# ent_financial_indicator.csv( Financial index data of the enterprise )
ent_fina_num_cols = [f for f in ent_fina.columns if f not in ['ent_id','report_period','ent_id_year']]
ent_fina_time_cols = ['report_period'] #  Report period 

#  Natural number coding 
def label_encode(series):
    unique = list(series.unique())
    # unique.sort()
    return series.map(dict(zip(unique, range(series.nunique()))))

for col in ent_info_cat_cols:
    ent_info[col] = label_encode(ent_info[col])
    
#  Merge ent_info、ent_fina
print(data.shape)
ent_info_new = ent_info.drop_duplicates()
data = data.merge(ent_info_new, on=['ent_id'], how='left')

print(data.shape)

ent_fina_new = ent_fina.sort_values('report_period').drop_duplicates(subset=['ent_id_year'], keep='last')
data = data.merge(ent_fina_new, on=['ent_id','ent_id_year'], how='left')
print(data.shape)

# Processing time related features 
data['opfrom_year'] = data['opfrom'].fillna('0000').apply(lambda x:int(x[:4]))
data['opto_year'] = data['opto'].fillna('0000').apply(lambda x:int(x[:4]))
data['esdate_year'] = data['esdate'].fillna('0000').apply(lambda x:int(x[:4]))

# The missing operating period is from 、 The approval date is replaced by the establishment date 
data.loc[data.opfrom.isnull(),'opfrom'] = data.loc[data.opfrom.isnull(),'esdate']
data.loc[data.apprdate.isnull(),'apprdate'] = data.loc[data.apprdate.isnull(),'esdate']

# Build time difference feature 
data['opfrom_esdate_diff'] = data['opfrom'].apply(lambda x:int(x[:4])) - data['esdate'].apply(lambda x:int(x[:4]))
data['apprdate_esdate_diff'] = data['apprdate'].apply(lambda x:int(x[:4])) - data['esdate'].apply(lambda x:int(x[:4]))
data['year_opfrom_diff'] = data['year'] - data['opfrom'].apply(lambda x:int(x[:4]))
data['year_esdate_diff'] = data['year'] - data['esdate'].apply(lambda x:int(x[:4]))
data['year_apprdate_diff'] = data['year'] - data['apprdate'].apply(lambda x:int(x[:4]))

# The reporting period of enterprise financial indicators 
data['report_period_year'] = data['report_period'].apply(lambda x:x//10000)
data['report_period_month'] = data['report_period'].apply(lambda x:x//10000)

# News related features 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.decomposition import TruncatedSVD, SparsePCA

tmp_df = news_df.groupby(['ent_id_year'])['newssource'].agg({
    list}).reset_index()
tmp_df['list'] = tmp_df['list'].apply(lambda x:' '.join([i for i in x]))

tfidf = TfidfVectorizer()
tf = tfidf.fit_transform(tmp_df['list'].fillna('##').values)

decom = TruncatedSVD(n_components=128,random_state=1024)
decom_x = decom.fit_transform(tf)
decom_feas = pd.DataFrame(decom_x)
decom_feas.colums = ['newssource_svd_'+str(i) for i in range(decom_feas.shape[1])]
decom_feas['ent_id_year'] = tmp_df['ent_id_year']

data = data.merge(decom_feas, on=['ent_id_year'], how='left')

for col in ['indextype','index']:
    tmp_df = news_df.groupby(['ent_id_year'])[col].agg({
    list}).reset_index()
    tmp_df['list'] = tmp_df['list'].apply(lambda x:' '.join([str(i) for i in x]))
    
    countv = CountVectorizer()
    cv = countv.fit_transform(tmp_df['list'].fillna("##").values)
    cv_df = pd.DataFrame(cv.toarray())
    cv_df.columns = [col+'_cv_'+str(i) for i in range(cv_df.shape[1])]
    cv_df['ent_id_year'] = tmp_df['ent_id_year']
    
    data = data.merge(cv_df,on=['ent_id_year'],how='left')
    del cv_df
    
# Other characteristics 
#ent_default
tmp_df = ent_default.groupby(['ent_id','year'])['ent_id_year'].agg({
    'count'}).reset_index().sort_values('count')
tmp_df.colums = ['ent_id','year','ent_default_last_year_cnts']
tmp_df['year'] = tmp_df['year']+1
data = data.merge(tmp_df,on=['ent_id','year'],how='left')
#data['ent_default_last_year_cnts'] = data['ent_default_last_year_cnts'].fillna(0)

#evt_info
tmp_df = ent_info.groupby(['ent_id'])['ent_id'].agg({
    'count'}).reset_index().sort_values('count')
tmp_df.colums = ['ent_id','ent_info_cnts']
data = data.merge(tmp_df,on=['ent_id'],how='left')

#ent_fina
tmp_df = ent_fina.groupby(['ent_id_year'])['ent_id_year'].agg({
    'count'}).reset_index().sort_values('count')
tmp_df.colums = ['ent_id_year','ent_fina_last_year_cnts']
data = data.merge(tmp_df,on=['ent_id_year'],how='left')

4.6 Prepare the data

# Training data test data preparation 
features = [f for f in data.columns if f not in ['ent_id','ent_id_year','default_score','is_bondissuer'] + \
                                                ent_info_time_cols + ent_fina_time_cols]

# Scramble data 
#print(data)
#data = shuffle(data).reset_index(drop=True)
#print(data)

tijiao = 0#0 The representative divides a part of the data to test 

train_all = data[data.year!=2021].reset_index(drop=True)#.reset_index(drop=True) Reset index values 
print(" All available data ：",train_all.shape)
train_all_p = train_all[train_all.default_score == 1].reset_index(drop=True)# Reset index values 
mp = train_all_p.shape[0]#151
#train data from p
train_list = [j for j in range(train_all_p.shape[0]) if (j+1)%5 != tijiao]
train_jicheng_p=shuffle(train_all_p.iloc[train_list].reset_index(drop=True)).reset_index(drop=True)#train of p
print(train_jicheng_p.shape)
#test data from p
train_test_list = [j for j in range(train_all_p.shape[0]) if (j+1)%5 == 0]
test_jicheng_p=shuffle(train_all_p.iloc[train_test_list].reset_index(drop=True)).reset_index(drop=True)#test of p
print(test_jicheng_p.shape)
print(" Positive examples of all available data ：",mp)
train_all_n = train_all[train_all.default_score == 0].reset_index(drop=True)# Reset index values 
mn = train_all_n.shape[0]#19921
print(" Negative examples of all available data ：",mn)
recall_r = mp / mn
print(" Then zoom in ：",recall_r)
print(type(train_all))


# Then let's expand and shrink the function 
def recall(y,r):
    y_r = y / (r+y*(1-r))
    return y_r



#print(train_all)
train_list = [i for i in range(train_all.shape[0]) if (i+1)%5 != tijiao]
#print(train.iloc[train_list])
train = train_all.iloc[train_list].reset_index(drop=True)#.reset_index(drop=True)
print(train.shape)
#print(train)

train_test_list = [i for i in range(train_all.shape[0]) if (i+1)%5 == 0]
test = train_all.iloc[train_test_list].reset_index(drop=True)
print(test.shape)
test_final = data[data.year==2021].reset_index(drop=True)# test data
print(test_final.shape)
x_train = train[features]
x_test = test[features]
y_test = test['default_score']
x_test_final = test_final[features]

y_train = train['default_score']



# Integrate learning data 
x_train_jicheng = []
y_train_jicheng = []
x_test_jicheng = []
y_test_jicheng = []

#nn data
x_train_jicheng_nn = []
y_train_jicheng_nn = []
x_test_jicheng_nn = []
y_test_jicheng_nn = []

# Division 120 Group 
train_all_n = shuffle(shuffle(train_all_n).reset_index(drop=True)).reset_index(drop=True)
#print(train_all_n)
jicheng_r = 1/1#/120
jicheng_num = int(120*jicheng_r)
train_all_n_jicheng_list=[]
train_all_n_jicheng=[]
train_jicheng=[]
test_jicheng=[]
train_jicheng_nn=[]
#test_jicheng_nn=[]
for i in range(jicheng_num):
    #divide into 120 groups
    #print(i)
    train_all_n_jicheng_list.append([j for j in range(train_all_n.shape[0]) if (j+1)%jicheng_num == i])
    #print(train_all_n_jicheng_list[i])
    #print(train_all_n.shape)
    train_all_n_jicheng.append(train_all_n.iloc[train_all_n_jicheng_list[i]].reset_index(drop=True))
    #print(train_all_n_jicheng[i].shape)
    
    #in each group 
    #train data from n
    train_list = [j for j in range(train_all_n_jicheng[i].shape[0]) if (j+1)%5 != tijiao]
    train_jicheng.append( pd.concat( [train_all_n_jicheng[i].iloc[train_list].reset_index(drop=True),train_jicheng_p] ).reset_index(drop=True) )#train data for jicheng
    # Let the two types of data be evenly distributed , To facilitate Nn Model 
    cnn_num_n = 0
    cnn_num_p = 0
    ''' for c in range(train_jicheng[i].shape[0]): if (c+1)%(1*jicheng_r) == 0: cnn_train = pd.concat([cnn_train,train_jicheng_p[cnn_num]]).reset_index(drop=True) cnn_num_p = cnn_num_p+1 elif(c == 0): cnn_train = train_jicheng[i].iloc[0] print(type(cnn_train),train_jicheng[i].iloc[128]) else: cnn_train = pd.concat([cnn_train,train_jicheng[i].iloc[cnn_num_n]]).reset_index(drop=True) cnn_num_n = cnn_num_n+1 #train_jicheng_nn.append(cnn_train) train_jicheng_nn.append([torch.tensor([cnn_train[features].values], dtype=torch.float16),torch.tensor([cnn_train['default_score']])]) #print("f:+++++++++++++++++++",cnn_train[features],nn_x_train_jicheng, nn_y_train_jicheng) #del cnn_train #'''
    # If you need to disturb the data, you need to remove the following two lines and other comments  
    train_jicheng[i] = shuffle(train_jicheng[i]).reset_index(drop=True)
    train_jicheng[i] = shuffle(train_jicheng[i]).reset_index(drop=True)
    #print(train_jicheng[i].shape)
    
    #test data from n
    train_test_list = [j for j in range(train_all_n_jicheng[i].shape[0]) if (j+1)%5 == 0]
    #pd.merge(df1, df2, on='B', how='left')
    test_jicheng.append( pd.concat( [train_all_n_jicheng[i].iloc[train_test_list].reset_index(drop=True),test_jicheng_p] ).reset_index(drop=True) )#test data for jicheng
    # If you need to disturb the data, you need to remove the following two lines and other comments 
    #test_jicheng[i] = shuffle(test_jicheng[i]).reset_index(drop=True)
    #test_jicheng[i] = shuffle(test_jicheng[i]).reset_index(drop=True)
    #print(test_jicheng[i].shape)
    
    #list
    x_train_jicheng.append(train_jicheng[i][features])# Every i Corresponding to a batch of data 
    y_train_jicheng.append(train_jicheng[i]['default_score'])
    x_train_jicheng_nn.append(torch.tensor(x_train_jicheng[i].values, dtype=torch.float64))
    y_train_jicheng_nn.append(torch.tensor(y_train_jicheng[i].values))
    x_test_jicheng.append(test_jicheng[i][features])
    y_test_jicheng.append(test_jicheng[i]['default_score'])
    x_test_jicheng_nn.append(torch.tensor(x_test_jicheng[i].values, dtype=torch.float64))
    y_test_jicheng_nn.append(torch.tensor(y_test_jicheng[i].values))
    #nn_x_train_jicheng, nn_y_train_jicheng = torch.tensor(x_train_jicheng[i].values, dtype=torch.float64),torch.tensor(y_train_jicheng[i].values)


''' #print(type(x_train_jicheng[0])) i=0 #print(x_train_jicheng[i],y_train_jicheng[i]) #print(x_train_jicheng_nn[i],y_train_jicheng_nn[i]) print(x_train_jicheng_nn[0].shape) #-10000 Instead of nun nn_x_train_jicheng, nn_y_train_jicheng = torch.where(torch.isnan(x_train_jicheng_nn[i]), torch.full_like(x_train_jicheng_nn[i], -10000), x_train_jicheng_nn[i]), y_train_jicheng_nn[i] nn_x_test_jicheng, nn_y_test_jicheng = torch.where(torch.isnan(x_test_jicheng_nn[i]), torch.full_like(x_test_jicheng_nn[i], -10000), x_test_jicheng_nn[i]), y_test_jicheng_nn[i] #print(nn_x_train_jicheng, nn_y_train_jicheng) print(nn_x_train_jicheng.shape)#317,435 print(nn_x_test_jicheng.shape)#63,435 #'''

4.7 Neural network model

class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(in_features=1*435, out_features=250)
        self.fc2 = nn.Linear(in_features=250, out_features=120)
        
        self.conv1 = nn.Conv1d(in_channels=1, out_channels=6, kernel_size=5)# Will be initialized randomly 
        self.conv2 = nn.Conv1d(in_channels=6, out_channels=12, kernel_size=5)
        
        
        self.fc3 = nn.Linear(in_features=12*27, out_features=120)
        self.fc4 = nn.Linear(in_features=120, out_features=60)
        self.fc5 = nn.Linear(in_features=60, out_features=10)
    
        self.out = nn.Linear(in_features=10, out_features=2)
        
    def forward(self,t):
        #1,1,435
        t = self.fc1(t)
        t = F.tanh(t)
        #1,1,250
        t = self.fc2(t)
        t = F.tanh(t)
        #1,1,120
        t = F.tanh(self.conv1(t))#1,6,116
        t = F.max_pool1d(t, kernel_size=2, stride=2)#1,6,58, It's going to round down 
        
        t = F.tanh(self.conv2(t))#1,12,54
        t = F.max_pool1d(t, kernel_size=2, stride=2)#1,12,27
        
        t = F.tanh(self.fc3(t.reshape(-1,12*t.shape[2])))#1,120
        t = F.tanh(self.fc4(t))
        t = F.tanh(self.fc5(t))
        t = self.out(t)
        t = F.softmax(t,dim=1)
        t = recall(t[:,0],jicheng_r)
        
        return t

4.8 Function function

def pre(pres):
    pres[pres >= 0.5] = 1
    pres[pres < 0.5] =  0
    return pres
  
 def get_correct_num(pres, labels):
    #pres Greater than 0.5 For the 1
    #print(pres.shape)
    pres = pre(pres)
    right_num = pres.eq(labels).sum()#eq(labels) And labels Compare by element , Same as 1, Different for 0
    return right_num

4.9 Network training

# Drawing 
torch.set_grad_enabled(True)# When building a map, you can turn off the response propagation , Save memory 
network = CNN()
#answer Data preparation 
x_test_nn = torch.tensor(x_test_final.values, dtype=torch.float64)
x_test_nn = torch.where(torch.isnan(x_test_nn), torch.full_like(x_test_nn, -10000), x_test_nn)
nn_x_test = x_test_nn.reshape(x_test_nn.shape[0],1,-1).to(torch.float32)#8963,1,435
pre_test_nn_all = torch.zeros(network(nn_x_test).shape)

pre_val_nn_all = torch.zeros_like(network(nn_x_test))
#print(pre_val_nn_all)

lun = 20
for i in range(lun-20,lun):
    # Drawing 
    torch.set_grad_enabled(True)# When building a map, you can turn off the response propagation , Save memory 
    network = CNN()
    # Convert training data 、 Test data 
    nn_x_train_jicheng, nn_y_train_jicheng = torch.where(torch.isnan(x_train_jicheng_nn[i]), torch.full_like(x_train_jicheng_nn[i], -10000), x_train_jicheng_nn[i]), y_train_jicheng_nn[i]
    nn_x_test_jicheng, nn_y_test_jicheng = torch.where(torch.isnan(x_test_jicheng_nn[i]), torch.full_like(x_test_jicheng_nn[i], -10000), x_test_jicheng_nn[i]), y_test_jicheng_nn[i]
    # Self test data preparation 
    nn_x_val = nn_x_test_jicheng.reshape(nn_x_test_jicheng.shape[0],1,-1).to(torch.float32)#63,1,435
    #print(nn_x_val.shape)
    
    
    # Load training data 
    train_jicheng_data = TensorDataset( nn_x_train_jicheng, nn_y_train_jicheng )
    # dataloaders
    batch_size =6
    # make sure to SHUFFLE your data
    train_jicheng_loader = DataLoader(train_jicheng_data, shuffle=True, batch_size=batch_size)
    '''batch = next(iter(train_jicheng_loader)) nn_x,nn_y = batch print(nn_x.shape,nn_y.shape)#torch.Size([1, 435]) torch.Size([1]) nn_x = nn_x.reshape(nn_x.shape[0],1,-1)#.unsqueeze(0) nn_x_test = nn_x_test_jicheng.reshape(nn_x_test_jicheng.shape[0],1,-1).to(torch.float32) print(nn_x.shape)#torch.Size([batch_size, 1, 435])'''
    
    # Training network 
    epoch_num=400
    for epoch in range(epoch_num):
        batch_id=0
        total_loss=0
        total_correct=0
        for batch in train_jicheng_loader:
            nn_x, nn_y = batch
            nn_x = nn_x.reshape(nn_x.shape[0],1,-1)
            nn_x = nn_x.to(torch.float32)

            pres = network(nn_x)
            loss_f = nn.SmoothL1Loss()#Boolean value of Tensor with more than one value is ambiguous
            loss = loss_f(pres, nn_y.to(torch.float32))#Found dtype Long but expected Float
            optimizer = optim.Adam(network.parameters(), lr=0.001)# Optimizer 
            optimizer.zero_grad()#pytorch The gradient value will be accumulated , So we need to clear the gradient and then calculate , Otherwise, the result will be multi-step gradient addition 
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            #print(pres)
            total_correct += get_correct_num(pres, nn_y)
            #print(pres)
            batch_id += 1
            #if(batch_id%20 == 0):
                #print("batch_id:",batch_id,",total_correct:",total_correct,",total_loss:",total_loss)

        correct_rate = total_correct / len(train_jicheng_data)
        pre_val_nn = network(nn_x_val)
        num = get_correct_num(pre_val_nn, nn_y_test_jicheng)#63
        #print("eopch:",epoch,",correct_rate:",correct_rate,",total_loss:",total_loss,"test_correct_num:",num,"test_shape:",nn_y_test_jicheng.shape)
        #print("***++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++***")
    #num += num
    pre_test_nn_all += network(nn_x_test)
    print("i:",i,"pre:",pre_test_nn_all,"grade:",correct_rate,",total_loss:",total_loss,"test_correct_num:",num,"test_shape:",nn_y_test_jicheng.shape)

#num_all = nn_y_test_jicheng.shape[0]
#grate = num/num_all
#print("grate:",grate)

4.10nn Model output

pre_test_nn_all = pre_test_nn_all/20
test_final['default_score'] = ((pre_test_nn_all).cpu().detach().numpy())

answer_cnn = pd.merge(answer[['ent_id']], test_final[['ent_id','default_score']], on=['ent_id'], how='left')
answer_cnn[['ent_id', 'default_score']].to_csv('answer_fcnn_19_400.csv', header=True, index=False, sep='|')

4.11lgb,cgb,cat model building

# Build the model 
def cv_model(clf, train_x, train_y, test_x,test_y, test_final_x, clf_name,flag):
    folds = train_x.shape[0]-1
    print(folds)
    folds=10
    seed = 2022
    kf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=seed)

    train = np.zeros(train_x.shape[0])
    test = np.zeros(test_x.shape[0])
    test_final = np.zeros(test_final_x.shape[0])

    cv_scores = []
    cv_scores_test = []

    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
        print('************************************ {} ************************************'.format(str(i+1)))
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]

        if clf_name == "lgb":
            train_matrix = clf.Dataset(trn_x, label=trn_y)
            valid_matrix = clf.Dataset(val_x, label=val_y)

            params = {
    
                'boosting_type': 'gbdt',
                'objective': 'binary',
                'metric': 'auc',
                'min_child_weight': 5,
                'num_leaves': 2 ** 5,
                'lambda_l2': 10,
                'feature_fraction': 0.8,
                'bagging_fraction': 0.8,
                'bagging_freq': 4,
                #10 fold 0.001,900：lgb_score_mean: 0.9231728810742007 lgb_score_mean_test: 0.9432300033467202
                #10 fold 0.01,900：lgb_score_mean: 0.9365468438097789 lgb_score_mean_test: 0.9592670682730924
                #10 fold 0.1,900：lgb_score_mean: 0.9362564399139585 lgb_score_mean_test: 0.9548477242302542
                #10 fold 0.05,900： lgb_score_mean: 0.9369092381980579 lgb_score_mean_test: 0.9602484939759035***************
                #10 fold 0.05,900,baselinenegative：lgb_score_mean: 0.9473846153846154 lgb_score_mean_test: 0.9585261089275688
                #10 fold 0.01,900,baselinenegative：lgb_score_mean: 0.9452154745838957 lgb_score_mean_test: 0.9586047164514317
                #lgb_score_mean: 0.9365468438097789 lgb_score_mean_test: 0.9592628848728246++++++++++++++++++++++++++++++++++
                #lgb_score_mean: 0.9580903014889222 lgb_score_mean_test: 0.9497874029139975
                #5 fold 0.001,900：lgb_score_mean: 0.9251487496189362 lgb_score_mean_test: 0.95645749665328
                #5 fold 0.005,900：lgb_score_mean: 0.9262832002448256 lgb_score_mean_test: 0.9572573627844712
                #5 fold 0.01,900：lgb_score_mean: 0.9231927998284016 lgb_score_mean_test: 0.9573862115127177
                #8 fold 0.01,900：lgb_score_mean: 0.9306092662273949 lgb_score_mean_test: 0.958592704149933+++++++++++++
                #8 fold 0.05,900：lgb_score_mean: 0.9321498906176219 lgb_score_mean_test: 0.9570511211512718************
                #8 fold 0.05,1200：lgb_score_mean: 0.9324803792387731 lgb_score_mean_test: 0.9570751757028113
                #0.9615
                'learning_rate': 0.01,
                'seed': 2022,
                'n_jobs':-1,
                'verbose': -1,
            }

            model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], 
                              categorical_feature=[], verbose_eval=500, early_stopping_rounds=900)
            val_pred = model.predict(val_x, num_iteration=model.best_iteration)
            if(flag == 1):
                val_pred = recall(val_pred,recall_r)
            if(flag == 2):
                val_pred = recall(val_pred,jicheng_r)
            test_pred = model.predict(test_x, num_iteration=model.best_iteration)
            if(flag == 1):
                test_pred = recall(test_pred,recall_r)
            if(flag == 2):
                test_pred = recall(test_pred,jicheng_r)
            test_final_pred = model.predict(test_final_x, num_iteration=model.best_iteration)
            if(flag == 1):
                test_final_pred = recall(test_final_pred,recall_r)
            if(flag == 2):
                test_final_pred = recall(test_final_pred,jicheng_r)
            # print(list(sorted(zip(features, model.feature_importance("gain")), key=lambda x: x[1], reverse=True))[:20])
        
        if clf_name == "xgb":
            train_matrix = clf.DMatrix(trn_x,label=trn_y)
            valid_matrix = clf.DMatrix(val_x,label=val_y)
            test_matrix = clf.DMatrix(test_x)
            test_final_matrix = clf.DMatrix(test_final_x)
            
            params = {
    'booster':'gbtree',
                     'objective':'binary:logistic',
                      'eval_metric':'auc',
                      'gamma':1,
                      'min_child_weight':1.5,
                      'max_depth':5,
                      'lambda':10,
                      'subsample':0.7,
                      'colsample_bytree':0.7,
                      'colsample_bylevel':0.7,
                      #8 fold 0.005,800：xgb_score_mean: 0.9316989067398125 xgb_score_mean_test: 0.9527265311244979
                      #xgb_score_mean: 0.9375689969683867 xgb_score_mean_test: 0.951910140562249
                      'eta':0.01,
                      'tree_method':'exact',
                      'seed':2022,
                      'nthread':36
                     }
            watchlist = [(train_matrix,'train'),(valid_matrix,'eval')]
            
            model = clf.train(params,train_matrix,num_boost_round=50000,evals=watchlist,verbose_eval=500, early_stopping_rounds=800)
            val_pred = model.predict(valid_matrix,ntree_limit=model.best_ntree_limit)
            if(flag == 1):
                val_pred = recall(val_pred,recall_r)
            if(flag == 2):
                val_pred = recall(val_pred,jicheng_r)
            test_pred = model.predict(test_matrix,ntree_limit=model.best_ntree_limit)
            if(flag == 1):
                test_pred = recall(test_pred,recall_r)
            if(flag == 2):
                test_pred = recall(test_pred,jicheng_r)
            test_final_pred = model.predict(test_final_matrix,ntree_limit=model.best_ntree_limit)
            if(flag == 1):
                test_final_pred = recall(test_final_pred,recall_r)
            if(flag == 2):
                test_final_pred = recall(test_final_pred,jicheng_r)
            
        if clf_name =="cat":
            #8 fold 0.05,800：cat_score_mean: 0.8828433432624545 cat_score_mean_test: 0.9101478831994645
            #cat_score_mean: 0.8703723803584509 cat_score_mean_test: 0.9054350736278447
            params = {
    'learning_rate':0.01,
                     'depth':5,
                      'l2_leaf_reg':10,
                      #'bootstrap_type':''
                     'od_type':"Iter",
                      'od_wait':50,
                      'random_seed':11,
                      #'allow_writing_files':True
                     }
            model = clf(iterations=20000,**params)
            model.fit(trn_x,trn_y,eval_set=(val_x,val_y),cat_features=[],use_best_model=True,verbose=800)
            
            val_pred = model.predict(val_x)
            if(flag == 1):
                val_pred = recall(val_pred,recall_r)
            if(flag == 2):
                val_pred = recall(val_pred,jicheng_r)
            test_pred = model.predict(test_x)
            if(flag == 1):
                test_pred = recall(test_pred,recall_r)
            if(flag == 2):
                test_pred = recall(test_pred,jicheng_r)
            test_final_pred = model.predict(test_final_x)
            if(flag == 1):
                test_final_pred = recall(test_final_pred,recall_r)
            if(flag == 2):
                test_final_pred = recall(test_final_pred,jicheng_r)
        
        train[valid_index] = val_pred
        test += test_pred / kf.n_splits
        test_final += test_final_pred / kf.n_splits
        cv_scores.append(roc_auc_score(val_y, val_pred))
        cv_scores_test.append(roc_auc_score(test_y, test))
        
        print("cv_scores:",cv_scores,"cv_scores_test:",cv_scores_test)
       
    print("%s_scotrainre_list:" % clf_name, cv_scores,"%s_scotrainre_list_test:" % clf_name, cv_scores_test)
    print("%s_score_mean:" % clf_name, np.mean(cv_scores),"%s_score_mean_test:" % clf_name, np.mean(cv_scores_test))
    print("%s_score_std:" % clf_name, np.std(cv_scores),"%s_score_std_test:" % clf_name, np.std(cv_scores_test))
    return train, test, test_final
#flag:0-without recall;1-with recall_r;2-with jicheng_r 
def lgb_model(x_train, y_train, x_test ,y_test, x_test_final,flag):
    lgb_train, lgb_test, lgb_test_final = cv_model(lgb, x_train, y_train, x_test,y_test, x_test_final, "lgb",flag)
    return lgb_train, lgb_test, lgb_test_final
def xgb_model(x_train, y_train, x_test,y_test, x_test_final,flag):
    xgb_train, xgb_test, xgb_test_final = cv_model(xgb, x_train, y_train, x_test,y_test, x_test_final, "xgb",flag)
    return xgb_train, xgb_test, xgb_test_final
def cat_model(x_train, y_train, x_test,y_test, x_test_final,flag):
    cat_train, cat_test, cat_test_final = cv_model(CatBoostRegressor, x_train, y_train, x_test,y_test, x_test_final, "cat",flag)
    return cat_train, cat_test, cat_test_final

4.12lgb,xgb,cat model training

lgb_train, lgb_test, lbg_test_final = lgb_model(x_train, y_train, x_test,y_test, x_test_final,1)

test_final['default_score'] = (lbg_test_final)

answer_lgbrecall_all = pd.merge(answer[['ent_id']], test_final[['ent_id','default_score']], on=['ent_id'], how='left')
answer_lgbrecall_all[['ent_id', 'default_score']].to_csv('answer_lgbrecall_all_0506.csv', header=True, index=False, sep='|')

xgb_train, xgb_test, xgb_test_fianl = xgb_model(x_train, y_train, x_test,y_test, x_test_final,1)

test_final['default_score'] = (xgb_test_fianl)

answer_xgbrecall_all = pd.merge(answer[['ent_id']], test_final[['ent_id','default_score']], on=['ent_id'], how='left')
answer_xgbrecall_all[['ent_id', 'default_score']].to_csv('answer_xgbrecall_all_ptp_0507.csv', header=True, index=False, sep='|')

cat_train, cat_test, cat_test_final = cat_model(x_train, y_train, x_test,y_test, x_test_final,1)

test_final['default_score'] = (cat_test_final)

answer_catrecall_all = pd.merge(answer[['ent_id']], test_final[['ent_id','default_score']], on=['ent_id'], how='left')
answer_catrecall_all[['ent_id', 'default_score']].to_csv('answer_catrecall_all_0506.csv', header=True, index=False, sep='|')

Integrated learning

#lgb  Integrate 
a=b=c=0
for i in range(jicheng_num):
    # close print Output 
    sys.stdout = open(os.devnull, 'w')
    if((i+1)%30==0):
        # open print Output 
        sys.stdout = sys.__stdout__
    lgb_train_jicheng, lgb_test_jicheng, lbg_test_final_jicheng = \
        lgb_model(x_train_jicheng[i], y_train_jicheng[i], x_test_jicheng[i],y_test_jicheng[i], x_test_final,0)
    #a += lgb_train_jicheng
    #b += lgb_test_jicheng
    c += lbg_test_final_jicheng
#lgb_train_jicheng, lgb_test_jicheng, lbg_test_final_jicheng = a/jicheng_num, b/jicheng_num, c/jicheng_num
lbg_test_final_jicheng = c/jicheng_num

test_final['default_score'] = (lbg_test_final_jicheng)

answer_lgbjicheng_all = pd.merge(answer[['ent_id']], test_final[['ent_id','default_score']], on=['ent_id'], how='left')
answer_lgbjicheng_all[['ent_id', 'default_score']].to_csv('answer_lgbjicheng_all_0506.csv', header=True, index=False, sep='|')

#xgb  Integrate 
a=b=c=0
for i in range(jicheng_num):
    # close print Output 
    sys.stdout = open(os.devnull, 'w')
    if((i+1)%30==0):
        # open print Output 
        sys.stdout = sys.__stdout__
    xgb_train_jicheng, xgb_test_jicheng, xbg_test_final_jicheng = \
        xgb_model(x_train_jicheng[i], y_train_jicheng[i], x_test_jicheng[i],y_test_jicheng[i], x_test_final,0)
    #a += xgb_train_jicheng
    #b += xgb_test_jicheng
    c += xbg_test_final_jicheng
#xgb_train_jicheng, xgb_test_jicheng, xbg_test_final_jicheng = a/jicheng_num, b/jicheng_num, c/jicheng_num
xgb_test_final_jicheng = c/jicheng_num

test_final['default_score'] = (xgb_test_final_jicheng)

answer_xgbjicheng_all = pd.merge(answer[['ent_id']], test_final[['ent_id','default_score']], on=['ent_id'], how='left')
answer_xgbjicheng_all[['ent_id', 'default_score']].to_csv('answer_xgbjicheng_all_0506.csv', header=True, index=False, sep='|')

#cat  Integrate 
a=b=c=0
for i in range(jicheng_num):
    # close print Output 
    sys.stdout = open(os.devnull, 'w')
    if((i+1)%30==0):
        # open print Output 
        sys.stdout = sys.__stdout__
    cat_train_jicheng, cat_test_jicheng, cat_test_final_jicheng = \
        cat_model(x_train_jicheng[i], y_train_jicheng[i], x_test_jicheng[i],y_test_jicheng[i], x_test_final,0)
    #a += cat_train_jicheng
    #b += cat_test_jicheng
    c += cat_test_final_jicheng
#cat_train_jicheng, cat_test_jicheng, cat_test_final_jicheng = a/jicheng_num, b/jicheng_num, c/jicheng_num
cat_test_final_jicheng = c/jicheng_num

test_final['default_score'] = (cat_test_final_jicheng)

answer_catjicheng_all = pd.merge(answer[['ent_id']], test_final[['ent_id','default_score']], on=['ent_id'], how='left')
answer_catjicheng_all[['ent_id', 'default_score']].to_csv('answer_catjicheng_all_recall_shuffle_20_0508.csv', header=True, index=False, sep='|')

4.13 Integration method uses code

Separate the models and train at the same time , Output to file respectively , Integrate files directly

import pandas as pd
import os
import gc
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
from sklearn.preprocessing import MinMaxScaler
from sklearn.utils import shuffle
from gensim.models import Word2Vec
import math
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')
import os, sys

class HiddenPrints:
    def __enter__(self):
        self._original_stdout = sys.stdout
        sys.stdout = open(os.devnull, 'w')

    def __exit__(self, exc_type, exc_val, exc_tb):
        sys.stdout.close()
        sys.stdout = self._original_stdout
#with HiddenPrints:
    #no print here

answer = pd.read_csv('../input/xintaijicheng/answer.csv', sep='|')#
answer_050705 = pd.read_csv('../input/xintaijicheng/answer.csv', sep='|')
answer_recall = pd.read_csv('../input/xintaijicheng/answer.csv', sep='|')
answer_jicheng = pd.read_csv('../input/xintaijicheng/answer.csv', sep='|')
answer_jicheng_recall_11 = pd.read_csv('../input/xintaijicheng/answer.csv', sep='|')
answer_jicheng_recall_20 = pd.read_csv('../input/xintaijicheng/answer.csv', sep='|')
#answer_cnn = pd.read_csv('../input/xintaijicheng/answer.csv', sep='|')

''' lgb_recall_all = pd.read_csv('../input/xintaijicheng/answer_lgbrecall_all_0506.csv', sep='|') xgb_recsll_all = pd.read_csv('../input/xintaijicheng/answer_xgbrecall_all_0506.csv', sep='|') cat_recall_all = pd.read_csv('../input/xintaijicheng/answer_catrecall_all_0506.csv', sep='|') lgb_jicheng_all = pd.read_csv('../input/xintaijicheng/answer_lgbjicheng_all_0506.csv', sep='|') xgb_jicheng_all = pd.read_csv('../input/xintaijicheng/answer_xgbjicheng_all_0506.csv', sep='|') cat_jicheng_all = pd.read_csv('../input/xintaijicheng/answer_catjicheng_all_0506.csv', sep='|') lgb_jicheng_all_recall_11_0507 = pd.read_csv('../input/xintaijicheng/answer_lgbjicheng_all_recall_11_0507.csv', sep='|') xgb_jicheng_all_recall_11_0507 = pd.read_csv('../input/xintaijicheng/answer_xgbjicheng_all_recall__11_0507.csv', sep='|') cat_jicheng_all_recall_11_0507 = pd.read_csv('../input/xintaijicheng/answer_catjicheng_all_recall_11_0507.csv', sep='|') '''
lgb_jicheng_all_recall_20_0507 = pd.read_csv('../input/xintaijicheng/answer_lgbjicheng_all_recall_20_0507.csv', sep='|')#0.991451
xgb_jicheng_all_recall_20_0507 = pd.read_csv('../input/xintaijicheng/answer_xgbjicheng_all_recall_20_0507.csv', sep='|')#0.991832
cat_jicheng_all_recall_shuffle_30_0508 = pd.read_csv('../input/xintaijicheng/answer_catjicheng_all_recall__shuffle_30_0508.csv', sep='|')#0.991635

cnn = pd.read_csv('../input/xintaijicheng/answer_cnn.csv', sep='|')
cnn_981987 = pd.read_csv('../input/xintaijicheng/answer_cnn_1_52_10_0.001_shuffle.csv', sep='|')
cnn_19 = pd.read_csv('../input/xintaijicheng/answer_cnn_1_52_10_0.01_shuffle_jc0019.csv', sep='|')
cnn_39 = pd.read_csv('../input/xintaijicheng/answer_cnn_1_52_10_0.01_shuffle_jc0039.csv', sep='|')
cnn_59 = pd.read_csv('../input/xintaijicheng/answer_cnn_1_52_10_0.01_shuffle_jc0059.csv', sep='|')
cnn_79 = pd.read_csv('../input/xintaijicheng/answer_cnn_1_52_10_0.01_shuffle_jc0079.csv', sep='|')
cnn_99 = pd.read_csv('../input/xintaijicheng/answer_cnn_1_52_10_0.01_shuffle_jc0099.csv', sep='|')
cnn_119 = pd.read_csv('../input/xintaijicheng/answer_cnn_1_52_10_0.01_shuffle_jc00119.csv', sep='|')

fcnn = pd.read_csv('../input/xintaijicheng/answer_fcnn_19.csv', sep='|')
jicnn = pd.read_csv('../input/xintaijicheng/answer_jcnn_19.csv', sep='|')

#answer_recall['default_score'] = (lgb_recall_all['default_score']*0.33343 + xgb_recsll_all['default_score']*0.33348 + cat_recall_all['default_score']*0.33309)

#answer_jicheng['default_score'] = (lgb_jicheng_all['default_score']*0.333 + xgb_jicheng_all['default_score']*0.3333 + cat_jicheng_all['default_score']*0.3337)

#answer_jicheng_recall_11['default_score'] = ( lgb_jicheng_all_recall_11_0507['default_score']*0.333144 + xgb_jicheng_all_recall_11_0507['default_score']*0.333481 + cat_jicheng_all_recall_11_0507['default_score']*0.333375 )
#answer_jicheng_recall_20['default_score'] = ( lgb_jicheng_all_recall_20_0507['default_score']*0.333 + xgb_jicheng_all_recall_20_0507['default_score']*0.334 + cat_jicheng_all_recall_shuffle_30_0508['default_score']*0.333 )
answer_jicheng_recall_20['default_score'] = ( lgb_jicheng_all_recall_20_0507['default_score']*0.4 + \
                                             xgb_jicheng_all_recall_20_0507['default_score']*0.2 + \
                                             cat_jicheng_all_recall_shuffle_30_0508['default_score']*0.4 )

#answer_050705['default_score'] = answer_recall['default_score']*0.405 +\
# answer_jicheng_recall_11['default_score']*0.595

answer['default_score'] = cnn_19['default_score']*0.16 +\
                          cnn_39['default_score']*0.17 +\
                          cnn_59['default_score']*0.17 +\
                          cnn_79['default_score']*0.16 +\
                          cnn_99['default_score']*0.17 +\
                          cnn_119['default_score']*0.17

answer['default_score'] = cnn['default_score']*0.45 +\
                          cnn_981987['default_score']*0.45 +\
                          answer['default_score']*0.001 +\
                          answer_jicheng_recall_20['default_score']*0.099

answer['default_score'] = fcnn['default_score']*0.2 +\
                          jicnn['default_score']*0.2 +\
                          answer['default_score']*0.6
''' answer['default_score'] = (cnn['default_score']*0.45 +\ cnn_981987['default_score']*0.45 +\ answer_jicheng_recall_20['default_score']*0.1)*0.5 +\ answer['default_score']*0.5 answer['default_score'] = answer['default_score']*0.97 +\ cnn_19['default_score']*0.01 +\ cnn_39['default_score']*0.01 +\ cnn_59['default_score']*0.01 '''

#print(answer['default_score'],answer_recall['default_score'], answer_jicheng_recall_11['default_score'])

answer[['ent_id', 'default_score']].to_csv('answer_051207.csv', header=True, index=False, sep='|')

5 Key points of the plan

The final score ：0.993535

5.1 Feature Engineering

Manually observed data , Do some extraction and construction , Facilitate machine learning to find functions

5.1 Super parameter adjustment

Partition validation set , Used to select super parameters , Pay attention to the training set when dividing 、 Verify the same distribution of the set , Especially when the positive and negative samples are extremely uneven
It is mainly carried out k Cross validation k It's worth choosing , And learning rate
Final choice 10 Crossover verification ,0.01 step
After selection, train again with all the data

5.2 Positive and negative sample processing

recall And integrated learning
Divide the category with more samples into multiple parts , Each one is of the type with few samples 3 times , One plus a few samples , Form a set of training data , Each group of training data trains a model , Finally, integrate multiple models , Take the mean
advantage ： Ensemble learning has used a few samples for many times , And each score is enough , You can get good training results
shortcoming ： The code is complex , Long training time
effect ：
- lgb score ：0.990894
- xgb score ：0.991045
- cat score ：0.98988
- Mean value fusion of three models ：0.991641
auxiliary ： change recall Equal multiples of positive and negative samples , Repeat... Again
- lgb score ：0.990894
- xgb score ：0.991045
- cat score ：0.98988
- Mean value fusion of three models ：0.991516
Fusion of main model and auxiliary model
Proportional fusion ：0.991756（ Integrate experience ： It is often that a smaller proportion of models with higher scores will have better results ）
- 0.6：0.4：0.99115
- 0.4：0.6：0.991751
- 0.3：0.7：0.991738
- 0.405：0.595：0.991756

5.3 The constructed validation set is also used as training

lgb score ：0.991451, data shuffle Score after disruption :0.991502
xgb score ：0.991832, data shuffle Score after disruption :0.990766
cat score ：0.991325, data shuffle Score after disruption :0.991635
Three models （lgb,xgb, Disrupted cat） The fusion ：0.992045（ Integrate experience ： It is often that a smaller proportion of models with higher scores will have better results ）
- Equivalency ：0.992034
- 0.2：0.5：0.3：0.992023
- 0.3：0.4：0.3：0.991996
- 0.4：0.2：0.4：0.992045
- 0.43：0.15：0.42：0.992041

5.4 Join in NN Model

Calculate the size of each floor
Before disrupting the data NN score ：0.977279（ have only dataloader Time will automatically disrupt ）
shuffle Upset NN score ：0.979368
Two NN And 0.992045 Model fusion of scores ：0.993535（0.45：0.45：0.1）（ Integrate experience ： It is often that a smaller proportion of models with higher scores will have better results ）
Scores in different proportions will fluctuate up and down 0.0005

5.5 Other insights

The details matter
Superparameters have a great influence , But don't always indulge in crazy parameter adjustment , The best parameters and similar parameters have little effect on performance
Looking for teammates , You can open your mind , It is more conducive to model fusion
Mysterious fusion experience ： It is often that a smaller proportion of models with higher scores will have better results

原网站

版权声明
本文为[Ahang 626]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/204/202207221753042345.html