当前位置：网站首页>Data preprocessing - Data Mining 1

Data preprocessing - Data Mining 1

2022-07-03 10:24:00 【Why】

Put the data in “？” Make up the missing data of the sign .
use “ Mean replacement ” Methods to make up the missing data , Replace the missing value of the column with the mean value of each column .

# Import pandas The library operates on file data 
import pandas as pd
# Read file data set 
df = pd.read_excel(' Homework 1_ Data preprocessing dataset .xls')
# Calculate the average value of each column to fill the missing value of the corresponding column 
df.fillna(value = df.mean(),inplace=True)
# Export the populated dataset to another excel In file 
df.to_excel(" Homework 1_ Missing values filled .xlsx",index=False)

Calculate the quartile of each numerical dimension , And make a box diagram .
Calculate the quartile of each numerical dimension ：
Method 1 ：

# Read the dataset file that has filled in the missing values 
df = pd.read_excel(' Homework 1_ Missing values filled .xlsx')
# Calculate the quartile of each numerical dimension 
print(" Total basic integral of observation window ")
print(" The upper quartile is ："+str(df.iloc[:,2].quantile(0.25)))
print(" The lower quartile is ："+str(df.iloc[:,2].quantile(0.75)))
print("\n The total ticket price in the second year ")
print(" The upper quartile is ："+str(df.iloc[:,3].quantile(0.25)))
print(" The lower quartile is ："+str(df.iloc[:,3].quantile(0.75)))
print("\n Total flight kilometers of observation window ")
print(" The upper quartile is ："+str(df.iloc[:,4].quantile(0.25)))
print(" The lower quartile is ："+str(df.iloc[:,4].quantile(0.75)))
print("\n Total weighted flight kilometers of observation window （Σ Space discount × Leg distance ）")
print(" The upper quartile is ："+str(df.iloc[:,5].quantile(0.25)))
print(" The lower quartile is ："+str(df.iloc[:,5].quantile(0.75)))
print("\n The quarterly average basic integral accumulation of the observation window ")
print(" The upper quartile is ："+str(df.iloc[:,6].quantile(0.25)))
print(" The lower quartile is ："+str(df.iloc[:,6].quantile(0.75)))

Method 2 ：

df = pd.read_excel(' Homework 1_ Missing values filled .xlsx')
print(df.describe())

Box chart ：

The box diagram adopts python Related drawing packages matplotlib.pyplot draw
Result analysis ： The data is mainly concentrated in [0,25000] In interval , Uneven data distribution , There is a separation group point .

import matplotlib.pyplot as plt
from pylab import *
mpl.rcParams['font.sans-serif']=['SimHei']

df = pd.read_excel(' Homework 1_ Missing values filled .xlsx')
df.plot.box(title=" Airline customer data ")
plt.grid(linestyle="--", alpha=0.3)
plt.show()

Insert picture description here

Make a histogram of each numerical dimension 、 Q-Q Plot 、 Scatter map .
Histogram ：
Result analysis ： The data is mainly concentrated in 0-50000 Between , Uneven data distribution , The interval distribution of attribute data is generally similar .

df = pd.read_excel(' Homework 1_ Missing values filled .xlsx')
plt.hist(df.iloc[:,6], bins=[0,50000,100000,150000,250000])
#  Show horizontal axis labels 
plt.xlabel(" Numerical range ")
#  Displays the vertical axis label 
plt.ylabel(" frequency ")
#  Show picture title 
plt.title(" Observation window quarterly average basic integral cumulative histogram ")
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()

Insert picture description here

Q-Q Plot ：
Result analysis ： The data is mainly concentrated in 0-20000, Unevenly distributed

df = pd.read_excel(' Homework 1_ Missing values filled .xlsx')
plt.scatter((np.arange(2000)+1)/2000,df.iloc[:,2].sort_values(),s=0.5)
x_major_locator=MultipleLocator(0.25)
ax=plt.gca()
ax.xaxis.set_major_locator(x_major_locator)
plt.xlim(0,1)
#  Show picture title 
plt.text(0.25,df.iloc[:,2].sort_values()[24],"Q1",color="r")
plt.text(0.50,df.iloc[:,2].sort_values()[49]," Median ",color="r")
plt.text(0.75,df.iloc[:,2].sort_values()[74],"Q3",color="r")
plt.title(" Figure of total basic integral and cumulative integral digits of observation window ")
plt.xlabel("f- value ")# Abscissa name 
plt.ylabel(" data ")# The ordinate name 
plt.show()

Insert picture description here

Scatter map ：
Visible data is concentrated in 0-50000 Between , There are small separated group points

df = pd.read_excel(' Homework 1_ Missing values filled .xlsx')
plt.scatter(np.arange(2000),df.iloc[:,2],edgecolor='blue',s=2)
#  Displays the vertical axis label 
plt.ylabel(" The quarterly average basic integral accumulation of the observation window ")
#  Show picture title 
plt.title(" Cumulative scatter diagram of quarterly average basic integral of observation window ")
plt.axis([0,2000,0,300000])
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()

Insert picture description here

Minimize the data according to each attribute - Maximize normalization and z-score Normalization .
Use python Language , Use the following formula to write code to minimize the data - Maximize normalization and z-score Normalization .
Minimum - Maximum normalization ：

# Minimum - Maximum normalization 
a1=(df.iloc[:,2] - df.iloc[:,2].min())/(df.iloc[:,2].max() - df.iloc[:,2].min())
print(" The total basic integral is the smallest - Maximum normalization ："+str(a1))
a2=(df.iloc[:,3] - df.iloc[:,3].min())/(df.iloc[:,3].max() - df.iloc[:,3].min())
print(" The total ticket price in the second year is the smallest - Maximum normalization ："+str(a2))
a3=(df.iloc[:,4] - df.iloc[:,4].min())/(df.iloc[:,4].max() - df.iloc[:,4].min())
print(" The minimum total flying kilometers - Maximum normalization ："+str(a3))
a4=(df.iloc[:,5] - df.iloc[:,5].min())/(df.iloc[:,5].max() - df.iloc[:,5].min())
print(" The total weighted flight kilometers are the smallest - Maximum normalization ："+str(a4))
a5=(df.iloc[:,6] - df.iloc[:,6].min())/(df.iloc[:,6].max() - df.iloc[:,6].min())
print(" The cumulative average of quarterly basic points is the smallest - Maximum normalization ："+str(a5))

Because there's so much data , The omission is shown below ：
Insert picture description here
z-score Normalization ：

# zero - Mean normalization 
b1=(df.iloc[:,2] - df.iloc[:,2].mean())/df.iloc[:,2].std()
print(" Total basic integral z-score Normalization ："+str(b1))
b2=(df.iloc[:,3] - df.iloc[:,3].mean())/df.iloc[:,3].std()
print(" The total ticket price in the second year z-score Normalization ："+str(b2))
b3=(df.iloc[:,4] - df.iloc[:,4].mean())/df.iloc[:,4].std()
print(" Total flying kilometers z-score Normalization ："+str(b3))
b4=(df.iloc[:,5] - df.iloc[:,5].mean())/df.iloc[:,5].std()
print(" Total weighted flight kilometers z-score Normalization ："+str(b4))
b5=(df.iloc[:,6] - df.iloc[:,6].mean())/df.iloc[:,6].std()
print(" Quarterly average basic points accumulation z-score Normalization ："+str(b5))

Because there's so much data , The omission is shown below ：
Insert picture description here

原网站

版权声明
本文为[Why]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202150537176284.html

当前位置：网站首页>Data preprocessing - Data Mining 1

Data preprocessing - Data Mining 1

边栏推荐

猜你喜欢

随机推荐