当前位置:网站首页>Kaggle-Titanic
Kaggle-Titanic
2022-07-07 22:17:00 【r1ch4rd】
Recently, I studied kaggle, Did Titanic Project , Use this blog to record
Kaggle-Titanic
kaggle link
Environmental Science :Anaconda,python2.7
github Source link
By looking at the data set , This is a dichotomous problem , So we can use logistic regression model ( The first thought )
Data visualization
First, get the data set, and you need to visually check it to find out the available features .
Feature Engineering ( Very important !!)
1、 Preprocessing
Many data in the data set are incomplete , We need various methods to complete the data set .
① Mode method
Data with few missing items can be filled with modes :
such as Embarked
:
#1)Embarked
combined_train_test['Embarked'].fillna(combined_train_test['Embarked'].mode().iloc[0], inplace=True)
② Random forest prediction
Many missing items , But we can use other features to predict the data of this feature as filling :
such as Age:
##6)Age
### Random forest prediction
missing_age_df = pd.DataFrame(combined_train_test[['Age', 'Embarked', 'Sex', 'Title', 'Name_length', 'Fare', 'Fare_bin_id', 'Pclass']])
missing_age_train = missing_age_df[missing_age_df['Age'].notnull()]
missing_age_test = missing_age_df[missing_age_df['Age'].isnull()]
#missing_age_test.info()
from sklearn import ensemble
from sklearn import model_selection
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
def fill_missing_age(missing_age_train, missing_age_test):
missing_age_X_train = missing_age_train.drop(['Age'], axis=1)
missing_age_Y_train = missing_age_train['Age']
missing_age_X_test = missing_age_test.drop(['Age'], axis=1)
#gbm
gbm_reg = GradientBoostingRegressor(random_state=42)
gbm_reg_param_grid = {
'n_estimators': [2000], 'max_depth': [4], 'learning_rate': [0.01], 'max_features': [3]}
gbm_reg_grid = model_selection.GridSearchCV(gbm_reg, gbm_reg_param_grid, cv=10, n_jobs=25, verbose=1, scoring='neg_mean_squared_error')
gbm_reg_grid.fit(missing_age_X_train, missing_age_Y_train)
print('Age feature Best GB Params:' + str(gbm_reg_grid.best_params_))
print('Age feature Best GB Score:' + str(gbm_reg_grid.best_score_))
print('GB Train Error for "Age" Feature Regressor:' + str(
gbm_reg_grid.score(missing_age_X_train, missing_age_Y_train)))
missing_age_test.loc[:, 'Age_GB'] = gbm_reg_grid.predict(missing_age_X_test)
print(missing_age_test['Age_GB'][:4])
# model 2 rf
rf_reg = RandomForestRegressor()
rf_reg_param_grid = {
'n_estimators': [200], 'max_depth': [5], 'random_state': [0]}
rf_reg_grid = model_selection.GridSearchCV(rf_reg, rf_reg_param_grid, cv=10, n_jobs=25, verbose=1,
scoring='neg_mean_squared_error')
rf_reg_grid.fit(missing_age_X_train, missing_age_Y_train)
print('Age feature Best RF Params:' + str(rf_reg_grid.best_params_))
print('Age feature Best RF Score:' + str(rf_reg_grid.best_score_))
print('RF Train Error for "Age" Feature Regressor' + str(
rf_reg_grid.score(missing_age_X_train, missing_age_Y_train)))
missing_age_test.loc[:, 'Age_RF'] = rf_reg_grid.predict(missing_age_X_test)
print(missing_age_test['Age_RF'][:4])
# two models merge
print('shape1', missing_age_test['Age'].shape, missing_age_test[['Age_GB', 'Age_RF']].mode(axis=1).shape)
# missing_age_test['Age'] = missing_age_test[['Age_GB', 'Age_LR']].mode(axis=1)
missing_age_test.loc[:, 'Age'] = np.mean([missing_age_test['Age_GB'], missing_age_test['Age_RF']])
print(missing_age_test['Age'][:4])
missing_age_test.drop(['Age_GB', 'Age_RF'], axis=1, inplace=True)
return missing_age_test
combined_train_test.loc[(combined_train_test.Age.isnull()), 'Age'] = fill_missing_age(missing_age_train, missing_age_test)
③ The average
For very few missing data with only one or two missing items , Fill in with the average :
combined_train_test['Fare'] = combined_train_test[['Fare']].fillna(combined_train_test.groupby('Pclass').transform(np.mean))
2、 Numeric type conversion
because sklearn The requirements in are all digital , Therefore, the non digital type should be transformed :
①dummy
Category variable , such as embarked, Contains only S,C,Q Three variables :
emb_dummies_df = pd.get_dummies(combined_train_test['Embarked'], prefix=combined_train_test[['Embarked']].columns[0])
②factorize
dummy Can't handle it very well Cabin In this way, there are many attributes with variables ,factorize() You can create some numbers , To represent a variable , Map one for each category ID, This mapping finally produces only one feature , Don't like dummy Generate multiple features : To be improved
③scaling
It's a mapping , Map larger values to smaller ranges , such as (-1,1).
Age The scope of is much larger than that of other attributes , This makes Age There will be greater weight , We need to scaling( Feature scaling )
from sklearn import preprocessing
assert np.size(df['Age']) == 891
scaler = preprocessing.StandardScaler()
df['Age_scaled'] = scaler.fit_transform(df['Age'].values.reshape(-1, 1))
④Binning
Fare Attribute processing can also use the above method scaling, It can also be used. binning. This is a method of discretizing continuous data , Divide the value into the set range ( bucket ) in .
combined_train_test['Fare_bin'] = pd.qcut(combined_train_test['Fare'], 5)
Of course, data bin After melting , must factorize perhaps dummy.
combined_train_test['Fare_bin_id'] = pd.factorize(combined_train_test['Fare_bin'])[0]
fare_bin_dummies_df = pd.get_dummies(combined_train_test['Fare_bin_id']).rename(columns = lambda x: 'Fare_' + str(x))
combined_train_test = pd.concat([combined_train_test,fare_bin_dummies_df], axis=1)
combined_train_test.drop(['Fare_bin'], axis=1, inplace=True)
3、 Discard useless features
Throw in some of the previously processed tag attributes , Or labels produced halfway , And useless labels for models .
After correlation analysis and cross validation , Add useful tags .
# Discard useless features
combined_data_backup = combined_train_test
combined_train_test.drop(['PassengerId', 'Embarked', 'Sex', 'Name', 'Title', 'Fare_bin_id', 'Pclass_Fare_Category',
'Parch', 'SibSp', 'Ticket', 'Family_Size_Category'],axis=1,inplace=True)
Build a model
Establish a simple logistic regression model :
from sklearn import linear_model
clf = linear_model.LogisticRegression(C=1.0, penalty='l1', tol=1e-6)
clf.fit(titanic_train_data_X.values, titanic_train_data_Y.values)
#print clf
predictions = clf.predict(titanic_test_data_X)
result = pd.DataFrame({
'PassengerId':test_df_org['PassengerId'].values, 'Survived':predictions.astype(np.int32)})
result.to_csv("~/Documents/data/base_line_predictions.csv", index=False)
Cross validation
Use the original data set for cross validation ( To be improved )
correlation analysis
Pit to be filled
Model fusion
bagging Methods model fusion :
from sklearn.ensemble import BaggingRegressor
# fit To BaggingRegressor In
clf = linear_model.LogisticRegression(C=1.0, penalty='l1', tol=1e-6)
bagging_clf = BaggingRegressor(clf, n_estimators=20, max_samples=0.8, max_features=1.0, bootstrap=True,
bootstrap_features=False, n_jobs=-1)
bagging_clf.fit(titanic_train_data_X.values, titanic_train_data_Y.values)
predictions = bagging_clf.predict(titanic_test_data_X)
result = pd.DataFrame({
'PassengerId': test_df_org['PassengerId'].values, 'Survived': predictions.astype(np.int32)})
result.to_csv("~/Documents/data/base_bagging_predictions.csv", index=False)
Catalog
边栏推荐
- Meta force force meta universe system development fossage model
- 2022 how to evaluate and select low code development platforms?
- How to realize the movement control of characters in horizontal game
- 100million single men and women "online dating", supporting 13billion IPOs
- ByteDance senior engineer interview, easy to get started, fluent
- NVR硬盘录像机通过国标GB28181协议接入EasyCVR,设备通道信息不显示是什么原因?
- Win11时间怎么显示星期几?Win11怎么显示今天周几?
- Talk about relational database and serverless
- Pre sale 179000, hengchi 5 can fire? Product power online depends on how it is sold
- OpenGL job coordinate system
猜你喜欢
The whole network "chases" Zhong Xuegao
Have you ever been confused? Once a test / development programmer, ignorant gadget C bird upgrade
[JDBC Part 1] overview, get connection, CRUD
Why can't win11 display seconds? How to solve the problem that win11 time does not display seconds?
Wechat official account oauth2.0 authorizes login and displays user information
operator
L2:ZK-Rollup的现状,前景和痛点
Restapi version control strategy [eolink translation]
What if the win11u disk does not display? Solution to failure of win11 plug-in USB flash disk
Ternary expressions, generative expressions, anonymous functions
随机推荐
[开源] .Net ORM 访问 Firebird 数据库
Win11时间怎么显示星期几?Win11怎么显示今天周几?
Jerry's about TWS pairing mode configuration [chapter]
[azure microservice service fabric] how to transfer seed nodes in the service fabric cluster
Display optimization when the resolution of easycvr configuration center video recording plan page is adjusted
The latest Android interview collection, Android video extraction audio
Codemail auto collation code of visual studio plug-in
Preparing for the interview and sharing experience
2022 how to evaluate and select low code development platforms?
如何实现横版游戏中角色的移动控制
Dry goods sharing | devaxpress v22.1 original help document download collection
使用 BlocConsumer 同时构建响应式组件和监听状态
使用 CustomPaint 绘制基本图形
DBSync新增对MongoDB、ES的支持
Win11U盘不显示怎么办?Win11插U盘没反应的解决方法
Ternary expressions, generative expressions, anonymous functions
应用实践 | 数仓体系效率全面提升!同程数科基于 Apache Doris 的数据仓库建设
Wechat official account oauth2.0 authorizes login and displays user information
Navicat connect 2002 - can't connect to local MySQL server through socket '/var/lib/mysql/mysql Sock 'solve
[开源] .Net ORM 访问 Firebird 数据库