当前位置:网站首页>Default risk early warning preliminary competition scheme of bond issuing enterprises [AI competition]
Default risk early warning preliminary competition scheme of bond issuing enterprises [AI competition]
2022-07-24 07:50:00 【Ahang 626】
- 1. Competition questions
- 2. Interpretation of the contest question
- 3 The strategy of Shangfen
- 4 Scheme sharing
- 4.1 The import module
- 4.2 Read in the data
- 4.3 Time processing
- 4.4 Construct negative samples
- 4.5 Feature extraction and processing
- 4.6 Prepare the data
- 4.7 Neural network model
- 4.8 Function function
- 4.9 Network training
- 4.10nn Model output
- 4.11lgb,cgb,cat model building
- 4.12lgb,xgb,cat model training
- 4.13 Integration method uses code
- 5 Key points of the plan
1. Competition questions
- Title address : Early warning of default risk of bond issuing enterprises
- The organizers : Guotai junan
- Training set : This competition question will be provided to bond issuing enterprises 2019-2020 Default data between years is used for model training , In order to predict the bond issuing enterprises in 2021 Annual probability of default risk , Among them, the range of bond issuing enterprises is 2019-2021 Enterprises that issued bonds in . The scope of bond issuing enterprises provided for prediction in the preliminary and semi-finals remains unchanged , On the basis of the preliminary , The semi-finals will increase the shareholder data of bond issuing enterprises 、 Foreign investment data and public opinion data of relevant enterprises .csdn download ,kaggle download
- Basic information of the enterprise ( Only bond issuing enterprises )ent_info.csv
- 2018-2020 Financial index data for ent_financial_indicator.csv
- 2018-2020 Public opinion information in ( Only bond issuing enterprises )ent_news.csv
- 2019-2020 Year's record of default ent_default.csv
- Test set : Based on the submission example given in the test set , Based on the enterprises listed in it .csdn download ,kaggle download
- Requirements for submission of result documents : The result file is named answer.csv, Please note that the separator is ’|‘, Not traditional ’,', The submitted is the prediction probability , Instead of categories , Such as :12345|0.89
- Background of the contest : since 2014 China's bond market “ Rigid cash ” After the myth is broken , Bond defaults are heating up ,2018 The bond market has 160 Only bonds default , involve 44 A bond issuing enterprise , The default balance is as high as 1505.25 One hundred million yuan , The severity of default is the highest in history . Accelerated exposure to credit risk in the bond market 、 Under the background of normalization of events of default , How to effectively evaluate and predict the default risk of bond issuing enterprises in advance has become an important regulatory problem . Because the information is incomplete , Relying solely on financial data has been difficult to fully explain the problem of default risk premium . How to effectively use other data besides finance , For example, public opinion data of bond issuing enterprises 、 Upstream and downstream data of equity , It is of great significance to predict the default risk of bond issuing enterprises .
- Match task : The task of this competition is to use machine learning 、 Deep learning and other methods to train a forecast Model , The model can learn the relevant information of bond issuing enterprises , To predict whether the bond issuing enterprise will have default risk in the future . The difficulty of the competition is that the data set includes a large amount of information about bond issuing enterprises ( Shareholder Information 、 Foreign investment information and public opinion information, etc ), How to learn from Extract effective features And risk prediction has become the key problem of this competition . Difference in the number of positive and negative samples huge , The proposition side expects you to have many solutions .
- Evaluation method :
- Preliminary evaluation index :AUC
- Semi final evaluation index : This task adopts accuracy F1 value (F1-measure, F1)、(Precision, P)、 Recall rate (Recall, R) To evaluate the effect of default prediction of bond issuing enterprises . The accuracy rate is the probability of actually being a positive sample among all the predicted positive samples . Recall rate is the probability of being predicted as a positive sample in the actual positive sample .F1 The values are calculated as follows :f1 = 2PR / (P + R)
2. Interpretation of the contest question
Interpretation of video , Baseline procedure ( score 0.9696)
2.1 The competition questions are refined
- Dichotomous problem
- Extract effective features from a large amount of information and predict
- The positive and negative sample data are very uneven
- AUC evaluation
- Available models : Logical regression 、 Random forests 、GBDT, Figure neural network
2.2 feature extraction
2.2.1 Category features
- Encoding mode
- Natural number coding
- Hot coding alone
- count code ( Alternate category characteristics )
- Target code
- Statistical method
- count
- nunique( Width )
- ratio( Preference )
2.2.2 Numerical characteristics
- Cross Statistics
- Row crossing : mean value 、 Median 、 The most value
- Column crossing construction
- Discrete mode
- Points barrels
- Two valued (0/1)
2.3 feature selection
- Filtration method
- The correlation coefficient
- Chi square test
- Mutual information
- Encapsulation
- Forward search
- Search backward
- Embedding method
- Feature ranking based on learning model
2.4 Timing verification
- Conduct the training set in time sequence 、 Verify the segmentation of the set
- direct K Crossover verification
2.5 Model selection
- XGBoost、LightGBM: Low requirements for feature processing , Friendly to categories and continuous features , Missing values do not need to be filled
- NN Model : Used for integration
- Fusion method : There should be characteristic differences between fused samples 、 Sample differences 、 Model differences
- Bagging
- Borsting
- Casting method
- stacking
- Average method
- Details to be excavated
- Journalism title useless ( After word segmentation, it turns into a vector 、bert Sentiment analysis )
- Hyperparameters : Random search 、vs Search for ( Depth of tree 、 Maximum subtree width, etc )
3 The strategy of Shangfen
- Journalism title Handle : Stuttering participle —— Splicing —— Dimension reduction
- Model fusion : Average method 、stacking The fusion ( monolayer 、 Multi-storey 、 Combined with other technologies )(lgb,cat,xgb Wait for output as SVM,LR, Bayesian input , Then use meta learner to fuse )
- Pseudo label : The prediction probability of model output is smaller than 0.01 As a negative sample , Used as data in training ( A way to add data ), On features label Some of them do confrontation verification and build fake tags
- There are some characteristic skills in automated machine learning
4 Scheme sharing
4.1 The import module
import pandas as pd
import os
import gc
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
from sklearn.preprocessing import MinMaxScaler
from sklearn.utils import shuffle
from gensim.models import Word2Vec
import math
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')
import os, sys
class HiddenPrints:
def __enter__(self):
self._original_stdout = sys.stdout
sys.stdout = open(os.devnull, 'w')
def __exit__(self, exc_type, exc_val, exc_tb):
sys.stdout.close()
sys.stdout = self._original_stdout
#with HiddenPrints:
#no print here
import torch
import torch.nn as nn# neural network
import torch.nn.functional as F# Some function operations ,relu,softmax,,,
import torchvision# Vision , Some models 、 database ,,,
import torchvision.transforms as transforms# Image conversion
import torch.optim as optim# Optimize
from torch.utils.tensorboard import SummaryWriter# write in tensorboard File formats that can be read
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader
gpu_flag = torch.cuda.is_available()
if(gpu_flag):
print("using gpu",torch.version.cuda,"...")
else: print("using cpu...")
print("Pytorch:",torch.__version__)
print("PYtorch.torchvision",torchvision.__version__)
torch.set_printoptions(linewidth=120)# Set the line width for output printing
4.2 Read in the data
# Reading data
news_list = []
for idx, line in enumerate(open('../input/guotaichusai/ent_news.csv', encoding='utf-8')):# Public opinion information of the enterprise
if idx == 0:
cols = line.split('|')
else:
line_list = line.split('|')
line_list = line_list[:8] + [''.join(line_list[8:]).replace('\n','')]# The data is only 9 Column
news_list.append(line_list)
news_df = pd.DataFrame(news_list, columns=cols)
ent_default = pd.read_csv('../input/guotaichusai/ent_default.csv', sep='|')# Default records of bond issuing enterprises
ent_fina = pd.read_csv('../input/guotaichusai/ent_financial_indicator.csv', sep='|')# Financial index data of the enterprise
ent_info = pd.read_csv('../input/guotaichusai/ent_info.csv', sep='|')# Basic information of the enterprise
answer = pd.read_csv('../input/guotaichusai/answer.csv', sep='|')# Test set
4.3 Time processing
### Time processing
# Only the year is reserved
ent_default['year'] = ent_default['acu_date'].apply(lambda x:x//10000)
ent_fina['year'] = ent_fina['report_period'].apply(lambda x:x//10000)
news_df['year'] = news_df['publishdate'].apply(lambda x:int(x)//10000)
ent_default['ent_id_year'] = ent_default['ent_id'] + '_' + (ent_default['year'] - 1).astype(str)
ent_fina['ent_id_year'] = ent_fina['ent_id'] + '_' + ent_fina['year'].astype(str)
news_df['ent_id_year'] = news_df['ent_id'] + '_' + news_df['year'].astype(str)
answer['ent_id_year'] = answer['ent_id'].apply(lambda x: x+'_2020')
del ent_fina['year'], news_df['year']
# duplicate removal
ent_default_new = ent_default.drop_duplicates(subset=['ent_id_year'], keep='last')
# Merge
ent_default_new['default_score'] = 1
answer['year'] = 2021
data = pd.concat([ent_default_new[['ent_id','ent_id_year','year','default_score']], answer], axis=0, ignore_index=True)
del ent_default_new
4.4 Construct negative samples
# Build a negative nutrient
#ent_ids = [i for i in answer['ent_id'].unique() if i not in ent_default[ent_default['year']==2019]['ent_id'].unique().tolist()]
ent_ids = [i for i in ent_info['ent_id'].unique() if i not in ent_default[ent_default['year']==2019]['ent_id'].unique().tolist()]
ent_ids_df = pd.DataFrame({
'ent_id':ent_ids})
ent_ids_df['year'] = 2019
ent_ids_df['default_score'] = 0
ent_ids_df['ent_id_year'] = ent_ids_df['ent_id'].apply(lambda x: x+'_2018')
data = pd.concat([data, ent_ids_df], axis=0, ignore_index=True)
#ent_ids = [i for i in answer['ent_id'].unique() if i not in ent_default[ent_default['year']==2020]['ent_id'].unique().tolist()]
ent_ids = [i for i in ent_info['ent_id'].unique() if i not in ent_default[ent_default['year']==2020]['ent_id'].unique().tolist()]
ent_ids_df = pd.DataFrame({
'ent_id':ent_ids})
ent_ids_df['year'] = 2020
ent_ids_df['default_score'] = 0
ent_ids_df['ent_id_year'] = ent_ids_df['ent_id'].apply(lambda x: x+'_2019')
data = pd.concat([data, ent_ids_df], axis=0, ignore_index=True)
4.5 Feature extraction and processing
# feature extraction
# ent_info.csv( Basic information of the enterprise )
ent_info_cat_cols = ['industryphy','industryco','enttype','entstatus','prov','city','county','is_bondissuer']
ent_info_num_cols = ['regcap']
ent_info_time_cols = ['opfrom','opto','esdate','apprdate'] # business ( Resident ) The term is from 、 business ( Resident ) The deadline is 、 Date of establishment 、 Approval date
# ent_financial_indicator.csv( Financial index data of the enterprise )
ent_fina_num_cols = [f for f in ent_fina.columns if f not in ['ent_id','report_period','ent_id_year']]
ent_fina_time_cols = ['report_period'] # Report period
# Natural number coding
def label_encode(series):
unique = list(series.unique())
# unique.sort()
return series.map(dict(zip(unique, range(series.nunique()))))
for col in ent_info_cat_cols:
ent_info[col] = label_encode(ent_info[col])
# Merge ent_info、ent_fina
print(data.shape)
ent_info_new = ent_info.drop_duplicates()
data = data.merge(ent_info_new, on=['ent_id'], how='left')
print(data.shape)
ent_fina_new = ent_fina.sort_values('report_period').drop_duplicates(subset=['ent_id_year'], keep='last')
data = data.merge(ent_fina_new, on=['ent_id','ent_id_year'], how='left')
print(data.shape)
# Processing time related features
data['opfrom_year'] = data['opfrom'].fillna('0000').apply(lambda x:int(x[:4]))
data['opto_year'] = data['opto'].fillna('0000').apply(lambda x:int(x[:4]))
data['esdate_year'] = data['esdate'].fillna('0000').apply(lambda x:int(x[:4]))
# The missing operating period is from 、 The approval date is replaced by the establishment date
data.loc[data.opfrom.isnull(),'opfrom'] = data.loc[data.opfrom.isnull(),'esdate']
data.loc[data.apprdate.isnull(),'apprdate'] = data.loc[data.apprdate.isnull(),'esdate']
# Build time difference feature
data['opfrom_esdate_diff'] = data['opfrom'].apply(lambda x:int(x[:4])) - data['esdate'].apply(lambda x:int(x[:4]))
data['apprdate_esdate_diff'] = data['apprdate'].apply(lambda x:int(x[:4])) - data['esdate'].apply(lambda x:int(x[:4]))
data['year_opfrom_diff'] = data['year'] - data['opfrom'].apply(lambda x:int(x[:4]))
data['year_esdate_diff'] = data['year'] - data['esdate'].apply(lambda x:int(x[:4]))
data['year_apprdate_diff'] = data['year'] - data['apprdate'].apply(lambda x:int(x[:4]))
# The reporting period of enterprise financial indicators
data['report_period_year'] = data['report_period'].apply(lambda x:x//10000)
data['report_period_month'] = data['report_period'].apply(lambda x:x//10000)
# News related features
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.decomposition import TruncatedSVD, SparsePCA
tmp_df = news_df.groupby(['ent_id_year'])['newssource'].agg({
list}).reset_index()
tmp_df['list'] = tmp_df['list'].apply(lambda x:' '.join([i for i in x]))
tfidf = TfidfVectorizer()
tf = tfidf.fit_transform(tmp_df['list'].fillna('##').values)
decom = TruncatedSVD(n_components=128,random_state=1024)
decom_x = decom.fit_transform(tf)
decom_feas = pd.DataFrame(decom_x)
decom_feas.colums = ['newssource_svd_'+str(i) for i in range(decom_feas.shape[1])]
decom_feas['ent_id_year'] = tmp_df['ent_id_year']
data = data.merge(decom_feas, on=['ent_id_year'], how='left')
for col in ['indextype','index']:
tmp_df = news_df.groupby(['ent_id_year'])[col].agg({
list}).reset_index()
tmp_df['list'] = tmp_df['list'].apply(lambda x:' '.join([str(i) for i in x]))
countv = CountVectorizer()
cv = countv.fit_transform(tmp_df['list'].fillna("##").values)
cv_df = pd.DataFrame(cv.toarray())
cv_df.columns = [col+'_cv_'+str(i) for i in range(cv_df.shape[1])]
cv_df['ent_id_year'] = tmp_df['ent_id_year']
data = data.merge(cv_df,on=['ent_id_year'],how='left')
del cv_df
# Other characteristics
#ent_default
tmp_df = ent_default.groupby(['ent_id','year'])['ent_id_year'].agg({
'count'}).reset_index().sort_values('count')
tmp_df.colums = ['ent_id','year','ent_default_last_year_cnts']
tmp_df['year'] = tmp_df['year']+1
data = data.merge(tmp_df,on=['ent_id','year'],how='left')
#data['ent_default_last_year_cnts'] = data['ent_default_last_year_cnts'].fillna(0)
#evt_info
tmp_df = ent_info.groupby(['ent_id'])['ent_id'].agg({
'count'}).reset_index().sort_values('count')
tmp_df.colums = ['ent_id','ent_info_cnts']
data = data.merge(tmp_df,on=['ent_id'],how='left')
#ent_fina
tmp_df = ent_fina.groupby(['ent_id_year'])['ent_id_year'].agg({
'count'}).reset_index().sort_values('count')
tmp_df.colums = ['ent_id_year','ent_fina_last_year_cnts']
data = data.merge(tmp_df,on=['ent_id_year'],how='left')
4.6 Prepare the data
# Training data test data preparation
features = [f for f in data.columns if f not in ['ent_id','ent_id_year','default_score','is_bondissuer'] + \
ent_info_time_cols + ent_fina_time_cols]
# Scramble data
#print(data)
#data = shuffle(data).reset_index(drop=True)
#print(data)
tijiao = 0#0 The representative divides a part of the data to test
train_all = data[data.year!=2021].reset_index(drop=True)#.reset_index(drop=True) Reset index values
print(" All available data :",train_all.shape)
train_all_p = train_all[train_all.default_score == 1].reset_index(drop=True)# Reset index values
mp = train_all_p.shape[0]#151
#train data from p
train_list = [j for j in range(train_all_p.shape[0]) if (j+1)%5 != tijiao]
train_jicheng_p=shuffle(train_all_p.iloc[train_list].reset_index(drop=True)).reset_index(drop=True)#train of p
print(train_jicheng_p.shape)
#test data from p
train_test_list = [j for j in range(train_all_p.shape[0]) if (j+1)%5 == 0]
test_jicheng_p=shuffle(train_all_p.iloc[train_test_list].reset_index(drop=True)).reset_index(drop=True)#test of p
print(test_jicheng_p.shape)
print(" Positive examples of all available data :",mp)
train_all_n = train_all[train_all.default_score == 0].reset_index(drop=True)# Reset index values
mn = train_all_n.shape[0]#19921
print(" Negative examples of all available data :",mn)
recall_r = mp / mn
print(" Then zoom in :",recall_r)
print(type(train_all))
# Then let's expand and shrink the function
def recall(y,r):
y_r = y / (r+y*(1-r))
return y_r
#print(train_all)
train_list = [i for i in range(train_all.shape[0]) if (i+1)%5 != tijiao]
#print(train.iloc[train_list])
train = train_all.iloc[train_list].reset_index(drop=True)#.reset_index(drop=True)
print(train.shape)
#print(train)
train_test_list = [i for i in range(train_all.shape[0]) if (i+1)%5 == 0]
test = train_all.iloc[train_test_list].reset_index(drop=True)
print(test.shape)
test_final = data[data.year==2021].reset_index(drop=True)# test data
print(test_final.shape)
x_train = train[features]
x_test = test[features]
y_test = test['default_score']
x_test_final = test_final[features]
y_train = train['default_score']
# Integrate learning data
x_train_jicheng = []
y_train_jicheng = []
x_test_jicheng = []
y_test_jicheng = []
#nn data
x_train_jicheng_nn = []
y_train_jicheng_nn = []
x_test_jicheng_nn = []
y_test_jicheng_nn = []
# Division 120 Group
train_all_n = shuffle(shuffle(train_all_n).reset_index(drop=True)).reset_index(drop=True)
#print(train_all_n)
jicheng_r = 1/1#/120
jicheng_num = int(120*jicheng_r)
train_all_n_jicheng_list=[]
train_all_n_jicheng=[]
train_jicheng=[]
test_jicheng=[]
train_jicheng_nn=[]
#test_jicheng_nn=[]
for i in range(jicheng_num):
#divide into 120 groups
#print(i)
train_all_n_jicheng_list.append([j for j in range(train_all_n.shape[0]) if (j+1)%jicheng_num == i])
#print(train_all_n_jicheng_list[i])
#print(train_all_n.shape)
train_all_n_jicheng.append(train_all_n.iloc[train_all_n_jicheng_list[i]].reset_index(drop=True))
#print(train_all_n_jicheng[i].shape)
#in each group
#train data from n
train_list = [j for j in range(train_all_n_jicheng[i].shape[0]) if (j+1)%5 != tijiao]
train_jicheng.append( pd.concat( [train_all_n_jicheng[i].iloc[train_list].reset_index(drop=True),train_jicheng_p] ).reset_index(drop=True) )#train data for jicheng
# Let the two types of data be evenly distributed , To facilitate Nn Model
cnn_num_n = 0
cnn_num_p = 0
''' for c in range(train_jicheng[i].shape[0]): if (c+1)%(1*jicheng_r) == 0: cnn_train = pd.concat([cnn_train,train_jicheng_p[cnn_num]]).reset_index(drop=True) cnn_num_p = cnn_num_p+1 elif(c == 0): cnn_train = train_jicheng[i].iloc[0] print(type(cnn_train),train_jicheng[i].iloc[128]) else: cnn_train = pd.concat([cnn_train,train_jicheng[i].iloc[cnn_num_n]]).reset_index(drop=True) cnn_num_n = cnn_num_n+1 #train_jicheng_nn.append(cnn_train) train_jicheng_nn.append([torch.tensor([cnn_train[features].values], dtype=torch.float16),torch.tensor([cnn_train['default_score']])]) #print("f:+++++++++++++++++++",cnn_train[features],nn_x_train_jicheng, nn_y_train_jicheng) #del cnn_train #'''
# If you need to disturb the data, you need to remove the following two lines and other comments
train_jicheng[i] = shuffle(train_jicheng[i]).reset_index(drop=True)
train_jicheng[i] = shuffle(train_jicheng[i]).reset_index(drop=True)
#print(train_jicheng[i].shape)
#test data from n
train_test_list = [j for j in range(train_all_n_jicheng[i].shape[0]) if (j+1)%5 == 0]
#pd.merge(df1, df2, on='B', how='left')
test_jicheng.append( pd.concat( [train_all_n_jicheng[i].iloc[train_test_list].reset_index(drop=True),test_jicheng_p] ).reset_index(drop=True) )#test data for jicheng
# If you need to disturb the data, you need to remove the following two lines and other comments
#test_jicheng[i] = shuffle(test_jicheng[i]).reset_index(drop=True)
#test_jicheng[i] = shuffle(test_jicheng[i]).reset_index(drop=True)
#print(test_jicheng[i].shape)
#list
x_train_jicheng.append(train_jicheng[i][features])# Every i Corresponding to a batch of data
y_train_jicheng.append(train_jicheng[i]['default_score'])
x_train_jicheng_nn.append(torch.tensor(x_train_jicheng[i].values, dtype=torch.float64))
y_train_jicheng_nn.append(torch.tensor(y_train_jicheng[i].values))
x_test_jicheng.append(test_jicheng[i][features])
y_test_jicheng.append(test_jicheng[i]['default_score'])
x_test_jicheng_nn.append(torch.tensor(x_test_jicheng[i].values, dtype=torch.float64))
y_test_jicheng_nn.append(torch.tensor(y_test_jicheng[i].values))
#nn_x_train_jicheng, nn_y_train_jicheng = torch.tensor(x_train_jicheng[i].values, dtype=torch.float64),torch.tensor(y_train_jicheng[i].values)
''' #print(type(x_train_jicheng[0])) i=0 #print(x_train_jicheng[i],y_train_jicheng[i]) #print(x_train_jicheng_nn[i],y_train_jicheng_nn[i]) print(x_train_jicheng_nn[0].shape) #-10000 Instead of nun nn_x_train_jicheng, nn_y_train_jicheng = torch.where(torch.isnan(x_train_jicheng_nn[i]), torch.full_like(x_train_jicheng_nn[i], -10000), x_train_jicheng_nn[i]), y_train_jicheng_nn[i] nn_x_test_jicheng, nn_y_test_jicheng = torch.where(torch.isnan(x_test_jicheng_nn[i]), torch.full_like(x_test_jicheng_nn[i], -10000), x_test_jicheng_nn[i]), y_test_jicheng_nn[i] #print(nn_x_train_jicheng, nn_y_train_jicheng) print(nn_x_train_jicheng.shape)#317,435 print(nn_x_test_jicheng.shape)#63,435 #'''
4.7 Neural network model
class CNN(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(in_features=1*435, out_features=250)
self.fc2 = nn.Linear(in_features=250, out_features=120)
self.conv1 = nn.Conv1d(in_channels=1, out_channels=6, kernel_size=5)# Will be initialized randomly
self.conv2 = nn.Conv1d(in_channels=6, out_channels=12, kernel_size=5)
self.fc3 = nn.Linear(in_features=12*27, out_features=120)
self.fc4 = nn.Linear(in_features=120, out_features=60)
self.fc5 = nn.Linear(in_features=60, out_features=10)
self.out = nn.Linear(in_features=10, out_features=2)
def forward(self,t):
#1,1,435
t = self.fc1(t)
t = F.tanh(t)
#1,1,250
t = self.fc2(t)
t = F.tanh(t)
#1,1,120
t = F.tanh(self.conv1(t))#1,6,116
t = F.max_pool1d(t, kernel_size=2, stride=2)#1,6,58, It's going to round down
t = F.tanh(self.conv2(t))#1,12,54
t = F.max_pool1d(t, kernel_size=2, stride=2)#1,12,27
t = F.tanh(self.fc3(t.reshape(-1,12*t.shape[2])))#1,120
t = F.tanh(self.fc4(t))
t = F.tanh(self.fc5(t))
t = self.out(t)
t = F.softmax(t,dim=1)
t = recall(t[:,0],jicheng_r)
return t
4.8 Function function
def pre(pres):
pres[pres >= 0.5] = 1
pres[pres < 0.5] = 0
return pres
def get_correct_num(pres, labels):
#pres Greater than 0.5 For the 1
#print(pres.shape)
pres = pre(pres)
right_num = pres.eq(labels).sum()#eq(labels) And labels Compare by element , Same as 1, Different for 0
return right_num
4.9 Network training
# Drawing
torch.set_grad_enabled(True)# When building a map, you can turn off the response propagation , Save memory
network = CNN()
#answer Data preparation
x_test_nn = torch.tensor(x_test_final.values, dtype=torch.float64)
x_test_nn = torch.where(torch.isnan(x_test_nn), torch.full_like(x_test_nn, -10000), x_test_nn)
nn_x_test = x_test_nn.reshape(x_test_nn.shape[0],1,-1).to(torch.float32)#8963,1,435
pre_test_nn_all = torch.zeros(network(nn_x_test).shape)
pre_val_nn_all = torch.zeros_like(network(nn_x_test))
#print(pre_val_nn_all)
lun = 20
for i in range(lun-20,lun):
# Drawing
torch.set_grad_enabled(True)# When building a map, you can turn off the response propagation , Save memory
network = CNN()
# Convert training data 、 Test data
nn_x_train_jicheng, nn_y_train_jicheng = torch.where(torch.isnan(x_train_jicheng_nn[i]), torch.full_like(x_train_jicheng_nn[i], -10000), x_train_jicheng_nn[i]), y_train_jicheng_nn[i]
nn_x_test_jicheng, nn_y_test_jicheng = torch.where(torch.isnan(x_test_jicheng_nn[i]), torch.full_like(x_test_jicheng_nn[i], -10000), x_test_jicheng_nn[i]), y_test_jicheng_nn[i]
# Self test data preparation
nn_x_val = nn_x_test_jicheng.reshape(nn_x_test_jicheng.shape[0],1,-1).to(torch.float32)#63,1,435
#print(nn_x_val.shape)
# Load training data
train_jicheng_data = TensorDataset( nn_x_train_jicheng, nn_y_train_jicheng )
# dataloaders
batch_size =6
# make sure to SHUFFLE your data
train_jicheng_loader = DataLoader(train_jicheng_data, shuffle=True, batch_size=batch_size)
'''batch = next(iter(train_jicheng_loader)) nn_x,nn_y = batch print(nn_x.shape,nn_y.shape)#torch.Size([1, 435]) torch.Size([1]) nn_x = nn_x.reshape(nn_x.shape[0],1,-1)#.unsqueeze(0) nn_x_test = nn_x_test_jicheng.reshape(nn_x_test_jicheng.shape[0],1,-1).to(torch.float32) print(nn_x.shape)#torch.Size([batch_size, 1, 435])'''
# Training network
epoch_num=400
for epoch in range(epoch_num):
batch_id=0
total_loss=0
total_correct=0
for batch in train_jicheng_loader:
nn_x, nn_y = batch
nn_x = nn_x.reshape(nn_x.shape[0],1,-1)
nn_x = nn_x.to(torch.float32)
pres = network(nn_x)
loss_f = nn.SmoothL1Loss()#Boolean value of Tensor with more than one value is ambiguous
loss = loss_f(pres, nn_y.to(torch.float32))#Found dtype Long but expected Float
optimizer = optim.Adam(network.parameters(), lr=0.001)# Optimizer
optimizer.zero_grad()#pytorch The gradient value will be accumulated , So we need to clear the gradient and then calculate , Otherwise, the result will be multi-step gradient addition
loss.backward()
optimizer.step()
total_loss += loss.item()
#print(pres)
total_correct += get_correct_num(pres, nn_y)
#print(pres)
batch_id += 1
#if(batch_id%20 == 0):
#print("batch_id:",batch_id,",total_correct:",total_correct,",total_loss:",total_loss)
correct_rate = total_correct / len(train_jicheng_data)
pre_val_nn = network(nn_x_val)
num = get_correct_num(pre_val_nn, nn_y_test_jicheng)#63
#print("eopch:",epoch,",correct_rate:",correct_rate,",total_loss:",total_loss,"test_correct_num:",num,"test_shape:",nn_y_test_jicheng.shape)
#print("***++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++***")
#num += num
pre_test_nn_all += network(nn_x_test)
print("i:",i,"pre:",pre_test_nn_all,"grade:",correct_rate,",total_loss:",total_loss,"test_correct_num:",num,"test_shape:",nn_y_test_jicheng.shape)
#num_all = nn_y_test_jicheng.shape[0]
#grate = num/num_all
#print("grate:",grate)
4.10nn Model output
pre_test_nn_all = pre_test_nn_all/20
test_final['default_score'] = ((pre_test_nn_all).cpu().detach().numpy())
answer_cnn = pd.merge(answer[['ent_id']], test_final[['ent_id','default_score']], on=['ent_id'], how='left')
answer_cnn[['ent_id', 'default_score']].to_csv('answer_fcnn_19_400.csv', header=True, index=False, sep='|')
4.11lgb,cgb,cat model building
# Build the model
def cv_model(clf, train_x, train_y, test_x,test_y, test_final_x, clf_name,flag):
folds = train_x.shape[0]-1
print(folds)
folds=10
seed = 2022
kf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=seed)
train = np.zeros(train_x.shape[0])
test = np.zeros(test_x.shape[0])
test_final = np.zeros(test_final_x.shape[0])
cv_scores = []
cv_scores_test = []
for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
print('************************************ {} ************************************'.format(str(i+1)))
trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
if clf_name == "lgb":
train_matrix = clf.Dataset(trn_x, label=trn_y)
valid_matrix = clf.Dataset(val_x, label=val_y)
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'min_child_weight': 5,
'num_leaves': 2 ** 5,
'lambda_l2': 10,
'feature_fraction': 0.8,
'bagging_fraction': 0.8,
'bagging_freq': 4,
#10 fold 0.001,900:lgb_score_mean: 0.9231728810742007 lgb_score_mean_test: 0.9432300033467202
#10 fold 0.01,900:lgb_score_mean: 0.9365468438097789 lgb_score_mean_test: 0.9592670682730924
#10 fold 0.1,900:lgb_score_mean: 0.9362564399139585 lgb_score_mean_test: 0.9548477242302542
#10 fold 0.05,900: lgb_score_mean: 0.9369092381980579 lgb_score_mean_test: 0.9602484939759035***************
#10 fold 0.05,900,baselinenegative:lgb_score_mean: 0.9473846153846154 lgb_score_mean_test: 0.9585261089275688
#10 fold 0.01,900,baselinenegative:lgb_score_mean: 0.9452154745838957 lgb_score_mean_test: 0.9586047164514317
#lgb_score_mean: 0.9365468438097789 lgb_score_mean_test: 0.9592628848728246++++++++++++++++++++++++++++++++++
#lgb_score_mean: 0.9580903014889222 lgb_score_mean_test: 0.9497874029139975
#5 fold 0.001,900:lgb_score_mean: 0.9251487496189362 lgb_score_mean_test: 0.95645749665328
#5 fold 0.005,900:lgb_score_mean: 0.9262832002448256 lgb_score_mean_test: 0.9572573627844712
#5 fold 0.01,900:lgb_score_mean: 0.9231927998284016 lgb_score_mean_test: 0.9573862115127177
#8 fold 0.01,900:lgb_score_mean: 0.9306092662273949 lgb_score_mean_test: 0.958592704149933+++++++++++++
#8 fold 0.05,900:lgb_score_mean: 0.9321498906176219 lgb_score_mean_test: 0.9570511211512718************
#8 fold 0.05,1200:lgb_score_mean: 0.9324803792387731 lgb_score_mean_test: 0.9570751757028113
#0.9615
'learning_rate': 0.01,
'seed': 2022,
'n_jobs':-1,
'verbose': -1,
}
model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix],
categorical_feature=[], verbose_eval=500, early_stopping_rounds=900)
val_pred = model.predict(val_x, num_iteration=model.best_iteration)
if(flag == 1):
val_pred = recall(val_pred,recall_r)
if(flag == 2):
val_pred = recall(val_pred,jicheng_r)
test_pred = model.predict(test_x, num_iteration=model.best_iteration)
if(flag == 1):
test_pred = recall(test_pred,recall_r)
if(flag == 2):
test_pred = recall(test_pred,jicheng_r)
test_final_pred = model.predict(test_final_x, num_iteration=model.best_iteration)
if(flag == 1):
test_final_pred = recall(test_final_pred,recall_r)
if(flag == 2):
test_final_pred = recall(test_final_pred,jicheng_r)
# print(list(sorted(zip(features, model.feature_importance("gain")), key=lambda x: x[1], reverse=True))[:20])
if clf_name == "xgb":
train_matrix = clf.DMatrix(trn_x,label=trn_y)
valid_matrix = clf.DMatrix(val_x,label=val_y)
test_matrix = clf.DMatrix(test_x)
test_final_matrix = clf.DMatrix(test_final_x)
params = {
'booster':'gbtree',
'objective':'binary:logistic',
'eval_metric':'auc',
'gamma':1,
'min_child_weight':1.5,
'max_depth':5,
'lambda':10,
'subsample':0.7,
'colsample_bytree':0.7,
'colsample_bylevel':0.7,
#8 fold 0.005,800:xgb_score_mean: 0.9316989067398125 xgb_score_mean_test: 0.9527265311244979
#xgb_score_mean: 0.9375689969683867 xgb_score_mean_test: 0.951910140562249
'eta':0.01,
'tree_method':'exact',
'seed':2022,
'nthread':36
}
watchlist = [(train_matrix,'train'),(valid_matrix,'eval')]
model = clf.train(params,train_matrix,num_boost_round=50000,evals=watchlist,verbose_eval=500, early_stopping_rounds=800)
val_pred = model.predict(valid_matrix,ntree_limit=model.best_ntree_limit)
if(flag == 1):
val_pred = recall(val_pred,recall_r)
if(flag == 2):
val_pred = recall(val_pred,jicheng_r)
test_pred = model.predict(test_matrix,ntree_limit=model.best_ntree_limit)
if(flag == 1):
test_pred = recall(test_pred,recall_r)
if(flag == 2):
test_pred = recall(test_pred,jicheng_r)
test_final_pred = model.predict(test_final_matrix,ntree_limit=model.best_ntree_limit)
if(flag == 1):
test_final_pred = recall(test_final_pred,recall_r)
if(flag == 2):
test_final_pred = recall(test_final_pred,jicheng_r)
if clf_name =="cat":
#8 fold 0.05,800:cat_score_mean: 0.8828433432624545 cat_score_mean_test: 0.9101478831994645
#cat_score_mean: 0.8703723803584509 cat_score_mean_test: 0.9054350736278447
params = {
'learning_rate':0.01,
'depth':5,
'l2_leaf_reg':10,
#'bootstrap_type':''
'od_type':"Iter",
'od_wait':50,
'random_seed':11,
#'allow_writing_files':True
}
model = clf(iterations=20000,**params)
model.fit(trn_x,trn_y,eval_set=(val_x,val_y),cat_features=[],use_best_model=True,verbose=800)
val_pred = model.predict(val_x)
if(flag == 1):
val_pred = recall(val_pred,recall_r)
if(flag == 2):
val_pred = recall(val_pred,jicheng_r)
test_pred = model.predict(test_x)
if(flag == 1):
test_pred = recall(test_pred,recall_r)
if(flag == 2):
test_pred = recall(test_pred,jicheng_r)
test_final_pred = model.predict(test_final_x)
if(flag == 1):
test_final_pred = recall(test_final_pred,recall_r)
if(flag == 2):
test_final_pred = recall(test_final_pred,jicheng_r)
train[valid_index] = val_pred
test += test_pred / kf.n_splits
test_final += test_final_pred / kf.n_splits
cv_scores.append(roc_auc_score(val_y, val_pred))
cv_scores_test.append(roc_auc_score(test_y, test))
print("cv_scores:",cv_scores,"cv_scores_test:",cv_scores_test)
print("%s_scotrainre_list:" % clf_name, cv_scores,"%s_scotrainre_list_test:" % clf_name, cv_scores_test)
print("%s_score_mean:" % clf_name, np.mean(cv_scores),"%s_score_mean_test:" % clf_name, np.mean(cv_scores_test))
print("%s_score_std:" % clf_name, np.std(cv_scores),"%s_score_std_test:" % clf_name, np.std(cv_scores_test))
return train, test, test_final
#flag:0-without recall;1-with recall_r;2-with jicheng_r
def lgb_model(x_train, y_train, x_test ,y_test, x_test_final,flag):
lgb_train, lgb_test, lgb_test_final = cv_model(lgb, x_train, y_train, x_test,y_test, x_test_final, "lgb",flag)
return lgb_train, lgb_test, lgb_test_final
def xgb_model(x_train, y_train, x_test,y_test, x_test_final,flag):
xgb_train, xgb_test, xgb_test_final = cv_model(xgb, x_train, y_train, x_test,y_test, x_test_final, "xgb",flag)
return xgb_train, xgb_test, xgb_test_final
def cat_model(x_train, y_train, x_test,y_test, x_test_final,flag):
cat_train, cat_test, cat_test_final = cv_model(CatBoostRegressor, x_train, y_train, x_test,y_test, x_test_final, "cat",flag)
return cat_train, cat_test, cat_test_final
4.12lgb,xgb,cat model training
lgb_train, lgb_test, lbg_test_final = lgb_model(x_train, y_train, x_test,y_test, x_test_final,1)
test_final['default_score'] = (lbg_test_final)
answer_lgbrecall_all = pd.merge(answer[['ent_id']], test_final[['ent_id','default_score']], on=['ent_id'], how='left')
answer_lgbrecall_all[['ent_id', 'default_score']].to_csv('answer_lgbrecall_all_0506.csv', header=True, index=False, sep='|')
xgb_train, xgb_test, xgb_test_fianl = xgb_model(x_train, y_train, x_test,y_test, x_test_final,1)
test_final['default_score'] = (xgb_test_fianl)
answer_xgbrecall_all = pd.merge(answer[['ent_id']], test_final[['ent_id','default_score']], on=['ent_id'], how='left')
answer_xgbrecall_all[['ent_id', 'default_score']].to_csv('answer_xgbrecall_all_ptp_0507.csv', header=True, index=False, sep='|')
cat_train, cat_test, cat_test_final = cat_model(x_train, y_train, x_test,y_test, x_test_final,1)
test_final['default_score'] = (cat_test_final)
answer_catrecall_all = pd.merge(answer[['ent_id']], test_final[['ent_id','default_score']], on=['ent_id'], how='left')
answer_catrecall_all[['ent_id', 'default_score']].to_csv('answer_catrecall_all_0506.csv', header=True, index=False, sep='|')
Integrated learning
#lgb Integrate
a=b=c=0
for i in range(jicheng_num):
# close print Output
sys.stdout = open(os.devnull, 'w')
if((i+1)%30==0):
# open print Output
sys.stdout = sys.__stdout__
lgb_train_jicheng, lgb_test_jicheng, lbg_test_final_jicheng = \
lgb_model(x_train_jicheng[i], y_train_jicheng[i], x_test_jicheng[i],y_test_jicheng[i], x_test_final,0)
#a += lgb_train_jicheng
#b += lgb_test_jicheng
c += lbg_test_final_jicheng
#lgb_train_jicheng, lgb_test_jicheng, lbg_test_final_jicheng = a/jicheng_num, b/jicheng_num, c/jicheng_num
lbg_test_final_jicheng = c/jicheng_num
test_final['default_score'] = (lbg_test_final_jicheng)
answer_lgbjicheng_all = pd.merge(answer[['ent_id']], test_final[['ent_id','default_score']], on=['ent_id'], how='left')
answer_lgbjicheng_all[['ent_id', 'default_score']].to_csv('answer_lgbjicheng_all_0506.csv', header=True, index=False, sep='|')
#xgb Integrate
a=b=c=0
for i in range(jicheng_num):
# close print Output
sys.stdout = open(os.devnull, 'w')
if((i+1)%30==0):
# open print Output
sys.stdout = sys.__stdout__
xgb_train_jicheng, xgb_test_jicheng, xbg_test_final_jicheng = \
xgb_model(x_train_jicheng[i], y_train_jicheng[i], x_test_jicheng[i],y_test_jicheng[i], x_test_final,0)
#a += xgb_train_jicheng
#b += xgb_test_jicheng
c += xbg_test_final_jicheng
#xgb_train_jicheng, xgb_test_jicheng, xbg_test_final_jicheng = a/jicheng_num, b/jicheng_num, c/jicheng_num
xgb_test_final_jicheng = c/jicheng_num
test_final['default_score'] = (xgb_test_final_jicheng)
answer_xgbjicheng_all = pd.merge(answer[['ent_id']], test_final[['ent_id','default_score']], on=['ent_id'], how='left')
answer_xgbjicheng_all[['ent_id', 'default_score']].to_csv('answer_xgbjicheng_all_0506.csv', header=True, index=False, sep='|')
#cat Integrate
a=b=c=0
for i in range(jicheng_num):
# close print Output
sys.stdout = open(os.devnull, 'w')
if((i+1)%30==0):
# open print Output
sys.stdout = sys.__stdout__
cat_train_jicheng, cat_test_jicheng, cat_test_final_jicheng = \
cat_model(x_train_jicheng[i], y_train_jicheng[i], x_test_jicheng[i],y_test_jicheng[i], x_test_final,0)
#a += cat_train_jicheng
#b += cat_test_jicheng
c += cat_test_final_jicheng
#cat_train_jicheng, cat_test_jicheng, cat_test_final_jicheng = a/jicheng_num, b/jicheng_num, c/jicheng_num
cat_test_final_jicheng = c/jicheng_num
test_final['default_score'] = (cat_test_final_jicheng)
answer_catjicheng_all = pd.merge(answer[['ent_id']], test_final[['ent_id','default_score']], on=['ent_id'], how='left')
answer_catjicheng_all[['ent_id', 'default_score']].to_csv('answer_catjicheng_all_recall_shuffle_20_0508.csv', header=True, index=False, sep='|')
4.13 Integration method uses code
Separate the models and train at the same time , Output to file respectively , Integrate files directly
import pandas as pd
import os
import gc
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
from sklearn.preprocessing import MinMaxScaler
from sklearn.utils import shuffle
from gensim.models import Word2Vec
import math
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')
import os, sys
class HiddenPrints:
def __enter__(self):
self._original_stdout = sys.stdout
sys.stdout = open(os.devnull, 'w')
def __exit__(self, exc_type, exc_val, exc_tb):
sys.stdout.close()
sys.stdout = self._original_stdout
#with HiddenPrints:
#no print here
answer = pd.read_csv('../input/xintaijicheng/answer.csv', sep='|')#
answer_050705 = pd.read_csv('../input/xintaijicheng/answer.csv', sep='|')
answer_recall = pd.read_csv('../input/xintaijicheng/answer.csv', sep='|')
answer_jicheng = pd.read_csv('../input/xintaijicheng/answer.csv', sep='|')
answer_jicheng_recall_11 = pd.read_csv('../input/xintaijicheng/answer.csv', sep='|')
answer_jicheng_recall_20 = pd.read_csv('../input/xintaijicheng/answer.csv', sep='|')
#answer_cnn = pd.read_csv('../input/xintaijicheng/answer.csv', sep='|')
''' lgb_recall_all = pd.read_csv('../input/xintaijicheng/answer_lgbrecall_all_0506.csv', sep='|') xgb_recsll_all = pd.read_csv('../input/xintaijicheng/answer_xgbrecall_all_0506.csv', sep='|') cat_recall_all = pd.read_csv('../input/xintaijicheng/answer_catrecall_all_0506.csv', sep='|') lgb_jicheng_all = pd.read_csv('../input/xintaijicheng/answer_lgbjicheng_all_0506.csv', sep='|') xgb_jicheng_all = pd.read_csv('../input/xintaijicheng/answer_xgbjicheng_all_0506.csv', sep='|') cat_jicheng_all = pd.read_csv('../input/xintaijicheng/answer_catjicheng_all_0506.csv', sep='|') lgb_jicheng_all_recall_11_0507 = pd.read_csv('../input/xintaijicheng/answer_lgbjicheng_all_recall_11_0507.csv', sep='|') xgb_jicheng_all_recall_11_0507 = pd.read_csv('../input/xintaijicheng/answer_xgbjicheng_all_recall__11_0507.csv', sep='|') cat_jicheng_all_recall_11_0507 = pd.read_csv('../input/xintaijicheng/answer_catjicheng_all_recall_11_0507.csv', sep='|') '''
lgb_jicheng_all_recall_20_0507 = pd.read_csv('../input/xintaijicheng/answer_lgbjicheng_all_recall_20_0507.csv', sep='|')#0.991451
xgb_jicheng_all_recall_20_0507 = pd.read_csv('../input/xintaijicheng/answer_xgbjicheng_all_recall_20_0507.csv', sep='|')#0.991832
cat_jicheng_all_recall_shuffle_30_0508 = pd.read_csv('../input/xintaijicheng/answer_catjicheng_all_recall__shuffle_30_0508.csv', sep='|')#0.991635
cnn = pd.read_csv('../input/xintaijicheng/answer_cnn.csv', sep='|')
cnn_981987 = pd.read_csv('../input/xintaijicheng/answer_cnn_1_52_10_0.001_shuffle.csv', sep='|')
cnn_19 = pd.read_csv('../input/xintaijicheng/answer_cnn_1_52_10_0.01_shuffle_jc0019.csv', sep='|')
cnn_39 = pd.read_csv('../input/xintaijicheng/answer_cnn_1_52_10_0.01_shuffle_jc0039.csv', sep='|')
cnn_59 = pd.read_csv('../input/xintaijicheng/answer_cnn_1_52_10_0.01_shuffle_jc0059.csv', sep='|')
cnn_79 = pd.read_csv('../input/xintaijicheng/answer_cnn_1_52_10_0.01_shuffle_jc0079.csv', sep='|')
cnn_99 = pd.read_csv('../input/xintaijicheng/answer_cnn_1_52_10_0.01_shuffle_jc0099.csv', sep='|')
cnn_119 = pd.read_csv('../input/xintaijicheng/answer_cnn_1_52_10_0.01_shuffle_jc00119.csv', sep='|')
fcnn = pd.read_csv('../input/xintaijicheng/answer_fcnn_19.csv', sep='|')
jicnn = pd.read_csv('../input/xintaijicheng/answer_jcnn_19.csv', sep='|')
#answer_recall['default_score'] = (lgb_recall_all['default_score']*0.33343 + xgb_recsll_all['default_score']*0.33348 + cat_recall_all['default_score']*0.33309)
#answer_jicheng['default_score'] = (lgb_jicheng_all['default_score']*0.333 + xgb_jicheng_all['default_score']*0.3333 + cat_jicheng_all['default_score']*0.3337)
#answer_jicheng_recall_11['default_score'] = ( lgb_jicheng_all_recall_11_0507['default_score']*0.333144 + xgb_jicheng_all_recall_11_0507['default_score']*0.333481 + cat_jicheng_all_recall_11_0507['default_score']*0.333375 )
#answer_jicheng_recall_20['default_score'] = ( lgb_jicheng_all_recall_20_0507['default_score']*0.333 + xgb_jicheng_all_recall_20_0507['default_score']*0.334 + cat_jicheng_all_recall_shuffle_30_0508['default_score']*0.333 )
answer_jicheng_recall_20['default_score'] = ( lgb_jicheng_all_recall_20_0507['default_score']*0.4 + \
xgb_jicheng_all_recall_20_0507['default_score']*0.2 + \
cat_jicheng_all_recall_shuffle_30_0508['default_score']*0.4 )
#answer_050705['default_score'] = answer_recall['default_score']*0.405 +\
# answer_jicheng_recall_11['default_score']*0.595
answer['default_score'] = cnn_19['default_score']*0.16 +\
cnn_39['default_score']*0.17 +\
cnn_59['default_score']*0.17 +\
cnn_79['default_score']*0.16 +\
cnn_99['default_score']*0.17 +\
cnn_119['default_score']*0.17
answer['default_score'] = cnn['default_score']*0.45 +\
cnn_981987['default_score']*0.45 +\
answer['default_score']*0.001 +\
answer_jicheng_recall_20['default_score']*0.099
answer['default_score'] = fcnn['default_score']*0.2 +\
jicnn['default_score']*0.2 +\
answer['default_score']*0.6
''' answer['default_score'] = (cnn['default_score']*0.45 +\ cnn_981987['default_score']*0.45 +\ answer_jicheng_recall_20['default_score']*0.1)*0.5 +\ answer['default_score']*0.5 answer['default_score'] = answer['default_score']*0.97 +\ cnn_19['default_score']*0.01 +\ cnn_39['default_score']*0.01 +\ cnn_59['default_score']*0.01 '''
#print(answer['default_score'],answer_recall['default_score'], answer_jicheng_recall_11['default_score'])
answer[['ent_id', 'default_score']].to_csv('answer_051207.csv', header=True, index=False, sep='|')
5 Key points of the plan
The final score :0.993535
5.1 Feature Engineering
- Manually observed data , Do some extraction and construction , Facilitate machine learning to find functions
5.1 Super parameter adjustment
- Partition validation set , Used to select super parameters , Pay attention to the training set when dividing 、 Verify the same distribution of the set , Especially when the positive and negative samples are extremely uneven
- It is mainly carried out k Cross validation k It's worth choosing , And learning rate
- Final choice 10 Crossover verification ,0.01 step
- After selection, train again with all the data
5.2 Positive and negative sample processing
- recall And integrated learning
- Divide the category with more samples into multiple parts , Each one is of the type with few samples 3 times , One plus a few samples , Form a set of training data , Each group of training data trains a model , Finally, integrate multiple models , Take the mean
- advantage : Ensemble learning has used a few samples for many times , And each score is enough , You can get good training results
- shortcoming : The code is complex , Long training time
- effect :
- lgb score :0.990894
- xgb score :0.991045
- cat score :0.98988
- Mean value fusion of three models :0.991641
- auxiliary : change recall Equal multiples of positive and negative samples , Repeat... Again
- lgb score :0.990894
- xgb score :0.991045
- cat score :0.98988
- Mean value fusion of three models :0.991516
- Fusion of main model and auxiliary model
- Proportional fusion :0.991756( Integrate experience : It is often that a smaller proportion of models with higher scores will have better results )
- 0.6:0.4:0.99115
- 0.4:0.6:0.991751
- 0.3:0.7:0.991738
- 0.405:0.595:0.991756
5.3 The constructed validation set is also used as training
- lgb score :0.991451, data shuffle Score after disruption :0.991502
- xgb score :0.991832, data shuffle Score after disruption :0.990766
- cat score :0.991325, data shuffle Score after disruption :0.991635
- Three models (lgb,xgb, Disrupted cat) The fusion :0.992045( Integrate experience : It is often that a smaller proportion of models with higher scores will have better results )
- Equivalency :0.992034
- 0.2:0.5:0.3:0.992023
- 0.3:0.4:0.3:0.991996
- 0.4:0.2:0.4:0.992045
- 0.43:0.15:0.42:0.992041
5.4 Join in NN Model
- Calculate the size of each floor
- Before disrupting the data NN score :0.977279( have only dataloader Time will automatically disrupt )
- shuffle Upset NN score :0.979368
- Two NN And 0.992045 Model fusion of scores :0.993535(0.45:0.45:0.1)( Integrate experience : It is often that a smaller proportion of models with higher scores will have better results )
- Scores in different proportions will fluctuate up and down 0.0005
5.5 Other insights
- The details matter
- Superparameters have a great influence , But don't always indulge in crazy parameter adjustment , The best parameters and similar parameters have little effect on performance
- Looking for teammates , You can open your mind , It is more conducive to model fusion
- Mysterious fusion experience : It is often that a smaller proportion of models with higher scores will have better results
边栏推荐
- Log in to the server using the fortress machine (springboard machine)
- Flinksql UDF custom data source
- [Huawei] Huawei machine test question-105
- MS SQL Server 2019 学习
- App performance test case
- Selenium basic knowledge debugging method
- 【Pytorch】nn.Module
- Do you know the use of string?
- Opencv project - credit card recognition (learning record)
- hcip第九天笔记
猜你喜欢

Mutual implementation of stack and queue (c)

Selenium basic knowledge automatically login Baidu online disk

MS SQL Server 2019 learning

Postman extracts the token parameter value in the response header and sets it as an environment variable, with code attached

MySQL -- subquery scalar subquery

About the solution of thinking that you download torch as a GPU version, but the result is really a CPU version

Image feature SIFT (scale invariant feature transform)

Unable to auto assemble, bean of type "redistemplate" not found

Kubernetes:(一)基本概念

When does MySQL use table locks and row locks?
随机推荐
Opencv project - credit card recognition (learning record)
hcip第十三天笔记
[cloud native] MySQL index analysis and query optimization
Train-clean-100 dataset
Hcip day 9 notes
MySQL --- 子查询 - 标量子查询
Have you seen the interview questions of VR major? Trust me, it's absolutely useful
2021-06-03 database query - sorting
MySQL statement
Selenium basic knowledge automatically login Baidu Post Bar
Advanced part of C language IV. detailed explanation of user-defined types
Jersey2.25.1 integration freemaker
Project practice - document scanning OCR recognition
Digital twin demonstration project -- Talking about simple pendulum (4) IOT exploration
Amber tutorial A17 learning - concept
Stable TTL serial port rate supported by Arduino under different dominant frequencies
Li Kou, niuke.com - > linked list related topics (Article 1) (C language)
Using bidirectional linked list to realize stack (c)
Harbor2.2 quick check of user role permissions
Basic operation of queue