当前位置:网站首页>Kaggle-Titanic
Kaggle-Titanic
2022-07-07 22:17:00 【r1ch4rd】
Recently, I studied kaggle, Did Titanic Project , Use this blog to record
Kaggle-Titanic
kaggle link
Environmental Science :Anaconda,python2.7
github Source link
By looking at the data set , This is a dichotomous problem , So we can use logistic regression model ( The first thought )
Data visualization
First, get the data set, and you need to visually check it to find out the available features .
Feature Engineering ( Very important !!)
1、 Preprocessing
Many data in the data set are incomplete , We need various methods to complete the data set .
① Mode method
Data with few missing items can be filled with modes :
such as Embarked:
#1)Embarked
combined_train_test['Embarked'].fillna(combined_train_test['Embarked'].mode().iloc[0], inplace=True)② Random forest prediction
Many missing items , But we can use other features to predict the data of this feature as filling :
such as Age:
##6)Age
### Random forest prediction
missing_age_df = pd.DataFrame(combined_train_test[['Age', 'Embarked', 'Sex', 'Title', 'Name_length', 'Fare', 'Fare_bin_id', 'Pclass']])
missing_age_train = missing_age_df[missing_age_df['Age'].notnull()]
missing_age_test = missing_age_df[missing_age_df['Age'].isnull()]
#missing_age_test.info()
from sklearn import ensemble
from sklearn import model_selection
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
def fill_missing_age(missing_age_train, missing_age_test):
missing_age_X_train = missing_age_train.drop(['Age'], axis=1)
missing_age_Y_train = missing_age_train['Age']
missing_age_X_test = missing_age_test.drop(['Age'], axis=1)
#gbm
gbm_reg = GradientBoostingRegressor(random_state=42)
gbm_reg_param_grid = {
'n_estimators': [2000], 'max_depth': [4], 'learning_rate': [0.01], 'max_features': [3]}
gbm_reg_grid = model_selection.GridSearchCV(gbm_reg, gbm_reg_param_grid, cv=10, n_jobs=25, verbose=1, scoring='neg_mean_squared_error')
gbm_reg_grid.fit(missing_age_X_train, missing_age_Y_train)
print('Age feature Best GB Params:' + str(gbm_reg_grid.best_params_))
print('Age feature Best GB Score:' + str(gbm_reg_grid.best_score_))
print('GB Train Error for "Age" Feature Regressor:' + str(
gbm_reg_grid.score(missing_age_X_train, missing_age_Y_train)))
missing_age_test.loc[:, 'Age_GB'] = gbm_reg_grid.predict(missing_age_X_test)
print(missing_age_test['Age_GB'][:4])
# model 2 rf
rf_reg = RandomForestRegressor()
rf_reg_param_grid = {
'n_estimators': [200], 'max_depth': [5], 'random_state': [0]}
rf_reg_grid = model_selection.GridSearchCV(rf_reg, rf_reg_param_grid, cv=10, n_jobs=25, verbose=1,
scoring='neg_mean_squared_error')
rf_reg_grid.fit(missing_age_X_train, missing_age_Y_train)
print('Age feature Best RF Params:' + str(rf_reg_grid.best_params_))
print('Age feature Best RF Score:' + str(rf_reg_grid.best_score_))
print('RF Train Error for "Age" Feature Regressor' + str(
rf_reg_grid.score(missing_age_X_train, missing_age_Y_train)))
missing_age_test.loc[:, 'Age_RF'] = rf_reg_grid.predict(missing_age_X_test)
print(missing_age_test['Age_RF'][:4])
# two models merge
print('shape1', missing_age_test['Age'].shape, missing_age_test[['Age_GB', 'Age_RF']].mode(axis=1).shape)
# missing_age_test['Age'] = missing_age_test[['Age_GB', 'Age_LR']].mode(axis=1)
missing_age_test.loc[:, 'Age'] = np.mean([missing_age_test['Age_GB'], missing_age_test['Age_RF']])
print(missing_age_test['Age'][:4])
missing_age_test.drop(['Age_GB', 'Age_RF'], axis=1, inplace=True)
return missing_age_test
combined_train_test.loc[(combined_train_test.Age.isnull()), 'Age'] = fill_missing_age(missing_age_train, missing_age_test)③ The average
For very few missing data with only one or two missing items , Fill in with the average :
combined_train_test['Fare'] = combined_train_test[['Fare']].fillna(combined_train_test.groupby('Pclass').transform(np.mean))2、 Numeric type conversion
because sklearn The requirements in are all digital , Therefore, the non digital type should be transformed :
①dummy
Category variable , such as embarked, Contains only S,C,Q Three variables :
emb_dummies_df = pd.get_dummies(combined_train_test['Embarked'], prefix=combined_train_test[['Embarked']].columns[0])②factorize
dummy Can't handle it very well Cabin In this way, there are many attributes with variables ,factorize() You can create some numbers , To represent a variable , Map one for each category ID, This mapping finally produces only one feature , Don't like dummy Generate multiple features : To be improved
③scaling
It's a mapping , Map larger values to smaller ranges , such as (-1,1).
Age The scope of is much larger than that of other attributes , This makes Age There will be greater weight , We need to scaling( Feature scaling )
from sklearn import preprocessing
assert np.size(df['Age']) == 891
scaler = preprocessing.StandardScaler()
df['Age_scaled'] = scaler.fit_transform(df['Age'].values.reshape(-1, 1))④Binning
Fare Attribute processing can also use the above method scaling, It can also be used. binning. This is a method of discretizing continuous data , Divide the value into the set range ( bucket ) in .
combined_train_test['Fare_bin'] = pd.qcut(combined_train_test['Fare'], 5)Of course, data bin After melting , must factorize perhaps dummy.
combined_train_test['Fare_bin_id'] = pd.factorize(combined_train_test['Fare_bin'])[0]
fare_bin_dummies_df = pd.get_dummies(combined_train_test['Fare_bin_id']).rename(columns = lambda x: 'Fare_' + str(x))
combined_train_test = pd.concat([combined_train_test,fare_bin_dummies_df], axis=1)
combined_train_test.drop(['Fare_bin'], axis=1, inplace=True)
3、 Discard useless features
Throw in some of the previously processed tag attributes , Or labels produced halfway , And useless labels for models .
After correlation analysis and cross validation , Add useful tags .
# Discard useless features
combined_data_backup = combined_train_test
combined_train_test.drop(['PassengerId', 'Embarked', 'Sex', 'Name', 'Title', 'Fare_bin_id', 'Pclass_Fare_Category',
'Parch', 'SibSp', 'Ticket', 'Family_Size_Category'],axis=1,inplace=True)Build a model
Establish a simple logistic regression model :
from sklearn import linear_model
clf = linear_model.LogisticRegression(C=1.0, penalty='l1', tol=1e-6)
clf.fit(titanic_train_data_X.values, titanic_train_data_Y.values)
#print clf
predictions = clf.predict(titanic_test_data_X)
result = pd.DataFrame({
'PassengerId':test_df_org['PassengerId'].values, 'Survived':predictions.astype(np.int32)})
result.to_csv("~/Documents/data/base_line_predictions.csv", index=False)Cross validation
Use the original data set for cross validation ( To be improved )
correlation analysis
Pit to be filled
Model fusion
bagging Methods model fusion :
from sklearn.ensemble import BaggingRegressor
# fit To BaggingRegressor In
clf = linear_model.LogisticRegression(C=1.0, penalty='l1', tol=1e-6)
bagging_clf = BaggingRegressor(clf, n_estimators=20, max_samples=0.8, max_features=1.0, bootstrap=True,
bootstrap_features=False, n_jobs=-1)
bagging_clf.fit(titanic_train_data_X.values, titanic_train_data_Y.values)
predictions = bagging_clf.predict(titanic_test_data_X)
result = pd.DataFrame({
'PassengerId': test_df_org['PassengerId'].values, 'Survived': predictions.astype(np.int32)})
result.to_csv("~/Documents/data/base_bagging_predictions.csv", index=False)Catalog
边栏推荐
- L2:ZK-Rollup的现状,前景和痛点
- Ant destination multiple selection
- Tsconfig of typescript TS basics JSON configuration options
- Paint basic graphics with custompaint
- Reinforcement learning - learning notes 9 | multi step TD target
- [JDBC Part 1] overview, get connection, CRUD
- Display optimization when the resolution of easycvr configuration center video recording plan page is adjusted
- Solve the problem of uni in uni app Request sent a post request without response.
- npm uninstall和rm直接删除的区别
- How to realize the movement control of characters in horizontal game
猜你喜欢

使用 CustomPaint 绘制基本图形

operator

PKPM 2020软件安装包下载及安装教程

Application practice | the efficiency of the data warehouse system has been comprehensively improved! Data warehouse construction based on Apache Doris in Tongcheng digital Department

Index summary (assault version)

An overview of the latest research progress of "efficient deep segmentation of labels" at Shanghai Jiaotong University, which comprehensively expounds the deep segmentation methods of unsupervised, ro
![Jerry's manual matching method [chapter]](/img/92/74281c29565581ecb761230fbfd0f3.png)
Jerry's manual matching method [chapter]

大数据开源项目,一站式全自动化全生命周期运维管家ChengYing(承影)走向何方?

PKPM 2020 software installation package download and installation tutorial

#DAYU200体验官#MPPT光伏发电项目 DAYU200、Hi3861、华为云IotDA
随机推荐
How polardb-x does distributed database hotspot analysis
怎样写一个增广矩阵到txt文件中
How much does it cost to develop a small program mall?
Oracle advanced (VI) Oracle expdp/impdp details
[azure microservice service fabric] the service fabric cluster hangs up because the certificate expires (the upgrade cannot be completed, and the node is unavailable)
TCP/IP 协议栈
Jerry's about TWS channel configuration [chapter]
Ad domain group policy management
Restore backup data on persistent volumes
L2: current situation, prospects and pain points of ZK Rollup
Reptile combat (VII): pictures of the king of reptiles' heroes
如何实现横版游戏中角色的移动控制
npm uninstall和rm直接删除的区别
OpenGL configuration vs2019
ByteDance senior engineer interview, easy to get started, fluent
Actual combat: sqlserver 2008 Extended event XML is converted to standard table format [easy to understand]
【Azure微服务 Service Fabric 】如何转移Service Fabric集群中的种子节点(Seed Node)
Cv2.resize function reports an error: error: (-215:assertion failed) func= 0 in function ‘cv::hal::resize‘
Open source OA development platform: contract management user manual
【JDBC Part 1】概述、获取连接、CRUD