当前位置:网站首页>Numpy -- data cleaning

Numpy -- data cleaning

2022-07-07 15:50:00 madkeyboard

Data cleaning

Dirty data

​ Usually, the data we get cannot be 100% error free , There are usually some problems , For example, the data value is missing 、 The data value is abnormally large or small 、 Format error and dependent data error .

​ Generate a set of dirty data vividly through the following code , As shown in the figure, a dtype = object.( This dtype Namely numpy Type of data , Pay attention here , If dtype = object explain list Directly converted data cannot be directly involved in calculation , Only int and float Only such data types can participate in the calculation )

raw_data = [
["Name", "StudentID", "Age", "AttendClass", "Score"],
[" Xiao Ming ", 20131, 10, 1, 67],
[" floret ", 20132, 11, 1, 88],
[" Side dish ", 20133, None, 1, "98"],
[" Xiao Qi ", 20134, 8, 1, 110],
[" Cauliflower ", 20134, 98, 0, None],
[" Liu Xin ", 20136, 12, 0, 12]
]

data = np.array(raw_data)
print(data)

 Insert picture description here

Data preprocessing

pre_data = []
for i in range(len(raw_data)):
    if i == 0: #  Remove the first line of string 
      continue
    pre_data.append(raw_data[i][1:]) #  Remove the first column of names 

data = np.array(pre_data,dtype=np.float) #  The reason it's used here float Because the data contains None, Only float To convert None
print(data)

 Insert picture description here

Data cleaning

​ Clean out all illogical data , In the data entered before , The first column has obvious repetition of student numbers , This is illogical data .np.unique() Make the data unique , And in the process of using, you can also see how many times the duplicate data has been repeated . As shown in the figure below ,20134 It appears twice , Then we can clearly know 20135 Not recorded , Then you can correct the data .

fcow = data[:,0] #  Take all student numbers in the first column 
print(fcow)
unique, counts = np.unique(fcow,return_counts=True) # return_counts Show the number of repetitions of the data 

print(" Data after cleaning :",unique)
print(" The number of times the data repeats :",counts)

 Insert picture description here

​ Looking at the second column of data , First of all, you can intuitively see that there is a lack of data , So for this missing data , We can add by averaging the existing data

is_nan = np.isnan(data[:,1]) #  Find the second column as none The data of 
nan_idx = np.argwhere(is_nan)
print(" Subscript :",nan_idx," by none")

mean_age = data[~np.isnan(data[:,1]), 1].mean() # ~ Take the opposite ,isnan It returns a Boolean value , The Boolean value selected here is false( Not for None), Then average 
print(" Average age :",mean_age)

 Insert picture description here

​ I was puzzled when I saw this , The average age of primary school students is 28? By observing the data , There is a data for 98, Obviously wrong , This is abnormal data . So we need to delete these two wrong data , Then replace them with the average of the remaining data .

normal_idx = ~np.isnan(data[:,1]) & (data[:,1] < 13) #  Find the second column as none The data of 
print("(flase) For the data that needs to be changed :",normal_idx)

mean_age = data[normal_idx,1].mean() # ~ Take the opposite ,isnan It returns a Boolean value , The Boolean value selected here is false( Not for None), Then average 
print(" Average age :",mean_age)

data[~normal_idx,1] = mean_age
print(" Data after cleaning :",np.floor(data[:,1])) #  Age has no decimal , Round down again 

 Insert picture description here

​ Finally, look at the data in the last two columns , here 0 and 1 Whether the representative is in class or not , There can be no grades when there is no class , There is a problem with the last line . And the total score of primary school is generally 100, There are beyond 100 The existence of indicates abnormal data .

data = np.array(pre_data,dtype=np.float64) #  The reason it's used here float Because the data contains None, Only float To convert None

data[data[:,2] == 0,3] = np.nan #  Those who don't have classes have no grades nan

data[:,3] = np.clip(data[:,3], 0, 100) #  Cut the scores that are no longer within a reasonable range 
print(data[:,2:]) #  Output the last two columns 

 Insert picture description here

​ Finally, compare the data before and after cleaning , Although the amount of data this time is very small , But there are also many cleaning methods , Familiar with these ways , In the future, it is also easy to catch more huge data , Keep your data clean and hygienic .

原网站

版权声明
本文为[madkeyboard]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/188/202207071326431723.html

随机推荐