当前位置：网站首页>Chapter 6 of machine learning [series] random forest model

Chapter 6 of machine learning [series] random forest model

2022-06-11 06:02:00 【Forward ing】

machine learning 【 series 】 Chapter 6 stochastic forest model

Chapter 6 random forest model

Random forest model

machine learning 【 series 】 Chapter 6 stochastic forest model
Preface
One 、 The principle and code implementation of random forest model
Two 、 Case actual combat ： Stock rise and fall prediction model
summary

Preface

This chapter will introduce the integrated learning model , Integrated learning actually adopts this idea ： Combine multiple models , To produce a more powerful model . This chapter will explain a very typical integrated learning model ----- Random forest model , And through an application in the financial field ---- Stock up and down prediction model to consolidate the knowledge .

Tips ： The following is the main body of this article , The following cases can be used for reference

One 、 The principle and code implementation of random forest model

1. Introduction to integration model

The integrated learning model uses a series of weak learners ( Also called base model or base model ) To study , And integrate the results of each weak learner , So as to obtain better learning effect than a single learner . The common algorithms of ensemble learning model are Bagging Algorithm and Boosting Algorithm .Bagging The typical machine learning model of the algorithm is the random forest model to be discussed in this chapter , and Boosting The typical machine learning model of the algorithm will be discussed next time AdaBoost,GBDT、XGBoost and LightGBM Model .

（1）Bagging Algorithm

Bagging The principle of the algorithm is similar to voting , Every weak learner has one vote , Finally, according to the voting of all weak learning , according to “ The minority is subordinate to the majority ” The principle of produces the final prediction result . say concretely , In the classification problem, we use n A weak learner votes to get the final prediction result , In the regression problem, we take n The average of two weak learners is taken as the final result .

（2）Boosting Algorithm
Boosting The essence of the algorithm is to promote the weak learner to the strong learner , It and Bagging The difference between algorithms is ：Bagging The algorithm treats all weak learners equally ; and Boosting The algorithm will work on weak learners “ Differential treatment ”, Generally speaking, it means paying attention to “ Cultivate elites ” and “ Pay attention to mistakes ”.
“ Cultivate elites ” After each round of training, the weak learner with more accurate prediction results is given a larger weight , Reduce the weight of weak learners that do not perform well . So in the final prediction ,“ Excellent model ” The weight of is big , Equivalent to it can cast multiple votes , and “ The general model ” Only one vote or no vote .
“ Pay attention to mistakes ” That is to change the weight or probability distribution of the training set after each round of training , By increasing the weight of the example predicted wrong by the weak learner in the previous round , Reduce the correct weight predicted by the weak learner in the previous round , To improve the weak learner's attention to the mispredicted data , So as to improve the overall prediction effect of the model .

2. The basic principle of random forest model

Random forests yes (Random Forest) A classic Bagging Model , If the learner is a decision tree model . Random forest will be sampled randomly in the original data set , constitute n Different sample data sets , And then build... Based on these data sets n Different decision tree models , Finally, according to these sample data sets , And then build... Based on these data sets n Different decision tree models , Finally, according to the average value of these decision tree models （ For regression models ） Or voting （ For classification model ） To get the final result . In order to ensure the generalization ability of the model （ Or general ability ）, The random forest model builds each tree , Often follow “ Random data ” and “ Feature randomised ” These two basic principles .

Random data ： Among all the data, the data is randomly selected as the training data of one of the decision tree models . for example ： Yes 1000 Raw data , There is a place to put back to extract 1000 Time , Form a new set of data , For training a decision tree model .
Feature randomised ： If the characteristic dimension of each sample is M, Specify a constant k<M, Randomly from M Selected from features k Features . In the use of Python When constructing a random forest model , The number of features selected by default k by Radical sign M.

3. Random forest model code implementation

#  Random forest classification model , Weak learners are classified decision tree models 
from sklearn.ensemble import RandomForestClassifier
X = [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]
y = [0, 0, 0, 1, 1]
model = RandomForestClassifier(n_estimators=10, random_state=123)
model.fit(X, y)
print(model.predict([[5, 5]]))

#  Random forest regression model , Weak learners are regression decision tree models 
from sklearn.ensemble import RandomForestRegressor
X = [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]
y = [1, 2, 3, 4, 5]
model = RandomForestRegressor(n_estimators=10, random_state=123)
model.fit(X, y)
print(model.predict([[5, 5]]))

Two 、 Case actual combat ： Stock rise and fall prediction model

1. Stock basic data acquisition

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import tushare as ts
#　1. Get daily market data 
df = ts.get_hist_data("000002",start='2018-01-01',end="2019-01-31")
# print(df.head())

# 2. Get minute level data 
df = ts.get_hist_data("000002",ktype='5')
# print(df.head())

# 3. Get real-time market data 
df = ts.get_realtime_quotes("000002")
# print(df.head())
df = df[["code","name","price","bid","ask","volume","amount","time"]]
# print(df)

# 4. Obtain transaction data 
#  Get historical split data , That is, the data of each transaction 
df = ts.get_tick_data("000002",date="2018-12-12",src="tt")
# print(df)
#  Get each data of the current day 
df = ts.get_today_ticks("000002");
# print(df)
#  Get index data 
df = ts.get_index()
# print(df.head())

2. Stock derivative variable generation

# 1. Get basic stock data 
df = ts.get_k_data("000002",start='2015-01-01',end="2019-12-31")
df = df.set_index('date')
# print(df.head())

# 2. Generate simple derived variables 
df["close-open"] = (df["close"] - df["open"]) / df["open"]
df["high-low"] = (df["high"] - df["low"]) / df["low"]
df["pre_close"] = df["close"].shift(1)
df["price_change"] = df["close"] - df["pre_close"]
df["p_change"] = (df["close"]-df["pre_close"]) / df["pre_close"]*100
# print(df.head())

# 3. Generate moving average indicators MA value 
df["MA5"]=df["close"].rolling(5).mean()
df["MA10"]=df["close"].rolling(10).mean()

df.dropna(inplace=True)  #  Delete null value line , Or you could write it as df = df.dropna()
df["MA5"] = df["close"].sort_index().rolling(5).mean()
# print(df.head())

# 4. Stock derived variable generation library TA_lib Installation 
import talib

# 5. use TA—lib The library generates relative strength indicators RSI value 
df["RSI"] = talib.RSI(df["close"],timeperiod=12)

RSI Value can reflect the strength of stock price rise relative to decline in the short term , Help us better judge the rising and falling trend of the stock price .RSI The bigger the value is. , The stronger the rise relative to the fall , On the contrary, the weaker the rise relative to the decline .

# 6. use TA-lib Library generated momentum index MOM value 
df["MoM"] = talib.MOM(df["close"],timeperiod=5)
# MOM It reflects the rise and fall speed of the stock price over a period of time 

# 7. use TA-lib The library generates an exponential moving average EMA
df["EMA12"] = talib.EMA(df["close"],timeperiod=12) # 12 Moving average of daily index 
df["EMA26"] = talib.EMA(df["close"],timeperiod=26) # 26 Moving average of daily index 
# EMA And moving average MA Value has some type , But the formula is more complicated .EMA Is a trend indicator .

# 8. use TA-lib The library generates a moving average MACD value 
df["MACD"],df["MACDsignal"],df["MACDhist"] = talib.MACD(df["close"],fastperiod=6,slowperiod=12,signalperiod=9)
#MACD Stock is a common index in the market , It is based on EMA It is worth deriving variables .

#  After generating all derived variable data and deleting null values , Print tail() Look at the end of the table 5 That's ok .
print(df.tail())

3. Multi factor model construction

With the right data , You can build the model . The model in this case is built according to multiple features , In the field of quantitative finance, it is called multi factor model . Stock data is time series data , Some related data processing work is slightly different from the model mentioned before . Let's start with some simple data processing , Then build the model .

1. Introduce the required libraries

import tushare as ts  #  Introduce stock basic data related database 
import numpy as np  #  Introduce scientific computing related libraries 
import pandas as pd  #  Introduce scientific computing related libraries 
import talib  #  Introduce stock derivative variable data correlation database 
import matplotlib.pyplot as plt  #  Import drawing related Library 
from sklearn.ensemble import RandomForestClassifier  #  The classification decision tree model is introduced 
from sklearn.metrics import accuracy_score  #  The prediction accuracy scoring function is introduced

2. get data


# 1. Stock basic data acquisition 
df = ts.get_k_data("000002",start='2015-01-01',end='2020-11-14')
df = df.set_index("date")

# 2. Simple derived variable data construction 
df["close-open"] = (df["close"] - df["open"]) / df["open"]
df["high-low"] = (df["high"] - df["low"]) / df["low"]
df["pre_close"] = df["close"].shift(1)
df["price_change"] = df["close"] - df["pre_close"]
df["p_change"] = (df["close"]-df["pre_close"]) / df["pre_close"]*100

# 3. Moving average related data construction 
df["MA5"] = df["close"].rolling(5).mean()
df["MA10"] = df["close"].rolling(10).mean()
df.dropna(inplace=True)

# 4. adopt TA_lib The library constructs derived variable data 
df["RSI"] = talib.RSI(df["close"],timeperiod=12)
df["MoM"] = talib.MOM(df["close"],timeperiod=5)
df["EMA12"] = talib.EMA(df["close"],timeperiod=12) # 12 Moving average of daily index 
df["EMA26"] = talib.EMA(df["close"],timeperiod=26) # 26 Moving average of daily index 
df["MACD"],df["MACDsignal"],df["MACDhist"] = talib.MACD(df["close"],fastperiod=6,slowperiod=12,signalperiod=9)
df.dropna(inplace=True)

3. Extract feature variables and target variables

X = df[["close","volume","close-open","MA5","MA10","high-low","RSI","MoM","EMA12","MACD","MACDsignal","MACDhist"]]
y = np.where(df["price_change"].shift(-1)>0,1,-1)

4. Divide the training set and the test set

X_length = X.shape[0]
split = int(X_length * 0.5)
X_train,X_test = X[:split],X[split:]
y_train,y_test = y[:split],y[split:]

5. Build a model

model = RandomForestClassifier(max_depth=3,n_estimators=10,min_samples_leaf=10,random_state=1)
model.fit(X_train,y_train)

4. Model use and evaluation

1. Predict the rise and fall of the stock price in the next day

y_pred = model.predict(X_test)
a = pd.DataFrame()
a[" Predictive value "] = list(y_pred)
a[" actual value "] = list(y_test)

#  use predict_proba() Function can predict the probability of belonging to each classification 
y_pred_proba = model.predict_proba(X_test)

2. Model accuracy evaluation

from sklearn.metrics import accuracy_score
score = accuracy_score(y_pred,y_test)

model.score(X_test,y_test)

3. Analyze the characteristic importance of characteristic variables

features = X.columns
importances = model.feature_importances_
a = pd.DataFrame()
a[" features "] = features
a[" Importance of features "] = importances
a = a.sort_values(" Importance of features ",ascending=False)
print(a)

5. Draw the return test curve

Here we mainly use the multiplicative function cumprod()

X_test["prediction"] = model.predict(X_test)
X_test["p_change"] = (X_test["close"] - X_test["close"].shift(1)) / X_test["close"].shift(1)
X_test["origin"] = (X_test["p_change"]+1).cumprod()
X_test["strategy"] = (X_test["prediction"].shift(1) * X_test["p_change"]+1).cumprod()

X_test[["strategy","origin"]].dropna().plot()
plt.gcf().autofmt_xdate()
plt.show()