当前位置：网站首页>Numpy -- data cleaning

Numpy -- data cleaning

2022-07-07 15:50:00 【madkeyboard】

List of articles

- - Data cleaning

Data cleaning

Dirty data

Usually, the data we get cannot be 100% error free , There are usually some problems , For example, the data value is missing 、 The data value is abnormally large or small 、 Format error and dependent data error .

Generate a set of dirty data vividly through the following code , As shown in the figure, a dtype = object.（ This dtype Namely numpy Type of data , Pay attention here , If dtype = object explain list Directly converted data cannot be directly involved in calculation , Only int and float Only such data types can participate in the calculation ）

raw_data = [
["Name", "StudentID", "Age", "AttendClass", "Score"],
[" Xiao Ming ", 20131, 10, 1, 67],
[" floret ", 20132, 11, 1, 88],
[" Side dish ", 20133, None, 1, "98"],
[" Xiao Qi ", 20134, 8, 1, 110],
[" Cauliflower ", 20134, 98, 0, None],
[" Liu Xin ", 20136, 12, 0, 12]
]

data = np.array(raw_data)
print(data)

Insert picture description here

Data preprocessing

pre_data = []
for i in range(len(raw_data)):
    if i == 0: #  Remove the first line of string 
      continue
    pre_data.append(raw_data[i][1:]) #  Remove the first column of names 

data = np.array(pre_data,dtype=np.float) #  The reason it's used here float Because the data contains None, Only float To convert None
print(data)

Insert picture description here

Data cleaning

Clean out all illogical data , In the data entered before , The first column has obvious repetition of student numbers , This is illogical data .np.unique() Make the data unique , And in the process of using, you can also see how many times the duplicate data has been repeated . As shown in the figure below ,20134 It appears twice , Then we can clearly know 20135 Not recorded , Then you can correct the data .

fcow = data[:,0] #  Take all student numbers in the first column 
print(fcow)
unique, counts = np.unique(fcow,return_counts=True) # return_counts Show the number of repetitions of the data 

print(" Data after cleaning ：",unique)
print(" The number of times the data repeats ：",counts)

Insert picture description here

Looking at the second column of data , First of all, you can intuitively see that there is a lack of data , So for this missing data , We can add by averaging the existing data

is_nan = np.isnan(data[:,1]) #  Find the second column as none The data of 
nan_idx = np.argwhere(is_nan)
print(" Subscript ：",nan_idx," by none")

mean_age = data[~np.isnan(data[:,1]), 1].mean() # ~ Take the opposite ,isnan It returns a Boolean value , The Boolean value selected here is false（ Not for None）, Then average 
print(" Average age ：",mean_age)

Insert picture description here

I was puzzled when I saw this , The average age of primary school students is 28？ By observing the data , There is a data for 98, Obviously wrong , This is abnormal data . So we need to delete these two wrong data , Then replace them with the average of the remaining data .

normal_idx = ~np.isnan(data[:,1]) & (data[:,1] < 13) #  Find the second column as none The data of 
print("(flase) For the data that needs to be changed ：",normal_idx)

mean_age = data[normal_idx,1].mean() # ~ Take the opposite ,isnan It returns a Boolean value , The Boolean value selected here is false（ Not for None）, Then average 
print(" Average age ：",mean_age)

data[~normal_idx,1] = mean_age
print(" Data after cleaning ：",np.floor(data[:,1])) #  Age has no decimal , Round down again

Insert picture description here

Finally, look at the data in the last two columns , here 0 and 1 Whether the representative is in class or not , There can be no grades when there is no class , There is a problem with the last line . And the total score of primary school is generally 100, There are beyond 100 The existence of indicates abnormal data .

data = np.array(pre_data,dtype=np.float64) #  The reason it's used here float Because the data contains None, Only float To convert None

data[data[:,2] == 0,3] = np.nan #  Those who don't have classes have no grades nan

data[:,3] = np.clip(data[:,3], 0, 100) #  Cut the scores that are no longer within a reasonable range 
print(data[:,2:]) #  Output the last two columns

Insert picture description here

Finally, compare the data before and after cleaning , Although the amount of data this time is very small , But there are also many cleaning methods , Familiar with these ways , In the future, it is also easy to catch more huge data , Keep your data clean and hygienic .

原网站

版权声明
本文为[madkeyboard]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/188/202207071326431723.html