当前位置:网站首页>Numpy -- data cleaning
Numpy -- data cleaning
2022-07-07 15:50:00 【madkeyboard】
List of articles
Data cleaning
Dirty data
Usually, the data we get cannot be 100% error free , There are usually some problems , For example, the data value is missing 、 The data value is abnormally large or small 、 Format error and dependent data error .
Generate a set of dirty data vividly through the following code , As shown in the figure, a dtype = object.( This dtype Namely numpy Type of data , Pay attention here , If dtype = object explain list Directly converted data cannot be directly involved in calculation , Only int and float Only such data types can participate in the calculation )
raw_data = [
["Name", "StudentID", "Age", "AttendClass", "Score"],
[" Xiao Ming ", 20131, 10, 1, 67],
[" floret ", 20132, 11, 1, 88],
[" Side dish ", 20133, None, 1, "98"],
[" Xiao Qi ", 20134, 8, 1, 110],
[" Cauliflower ", 20134, 98, 0, None],
[" Liu Xin ", 20136, 12, 0, 12]
]
data = np.array(raw_data)
print(data)
Data preprocessing
pre_data = []
for i in range(len(raw_data)):
if i == 0: # Remove the first line of string
continue
pre_data.append(raw_data[i][1:]) # Remove the first column of names
data = np.array(pre_data,dtype=np.float) # The reason it's used here float Because the data contains None, Only float To convert None
print(data)
Data cleaning
Clean out all illogical data , In the data entered before , The first column has obvious repetition of student numbers , This is illogical data .np.unique() Make the data unique , And in the process of using, you can also see how many times the duplicate data has been repeated . As shown in the figure below ,20134 It appears twice , Then we can clearly know 20135 Not recorded , Then you can correct the data .
fcow = data[:,0] # Take all student numbers in the first column
print(fcow)
unique, counts = np.unique(fcow,return_counts=True) # return_counts Show the number of repetitions of the data
print(" Data after cleaning :",unique)
print(" The number of times the data repeats :",counts)
Looking at the second column of data , First of all, you can intuitively see that there is a lack of data , So for this missing data , We can add by averaging the existing data
is_nan = np.isnan(data[:,1]) # Find the second column as none The data of
nan_idx = np.argwhere(is_nan)
print(" Subscript :",nan_idx," by none")
mean_age = data[~np.isnan(data[:,1]), 1].mean() # ~ Take the opposite ,isnan It returns a Boolean value , The Boolean value selected here is false( Not for None), Then average
print(" Average age :",mean_age)
I was puzzled when I saw this , The average age of primary school students is 28? By observing the data , There is a data for 98, Obviously wrong , This is abnormal data . So we need to delete these two wrong data , Then replace them with the average of the remaining data .
normal_idx = ~np.isnan(data[:,1]) & (data[:,1] < 13) # Find the second column as none The data of
print("(flase) For the data that needs to be changed :",normal_idx)
mean_age = data[normal_idx,1].mean() # ~ Take the opposite ,isnan It returns a Boolean value , The Boolean value selected here is false( Not for None), Then average
print(" Average age :",mean_age)
data[~normal_idx,1] = mean_age
print(" Data after cleaning :",np.floor(data[:,1])) # Age has no decimal , Round down again
Finally, look at the data in the last two columns , here 0 and 1 Whether the representative is in class or not , There can be no grades when there is no class , There is a problem with the last line . And the total score of primary school is generally 100, There are beyond 100 The existence of indicates abnormal data .
data = np.array(pre_data,dtype=np.float64) # The reason it's used here float Because the data contains None, Only float To convert None
data[data[:,2] == 0,3] = np.nan # Those who don't have classes have no grades nan
data[:,3] = np.clip(data[:,3], 0, 100) # Cut the scores that are no longer within a reasonable range
print(data[:,2:]) # Output the last two columns
Finally, compare the data before and after cleaning , Although the amount of data this time is very small , But there are also many cleaning methods , Familiar with these ways , In the future, it is also easy to catch more huge data , Keep your data clean and hygienic .
边栏推荐
- 【数字IC验证快速入门】18、SystemVerilog学习之基本语法5(并发线程...内含实践练习)
- LeetCode2_ Add two numbers
- The rebound problem of using Scrollview in cocos Creator
- [wechat applet] Chapter (5): basic API interface of wechat applet
- UE4 exports the picture + text combination diagram through ucanvasrendertarget2d
- How to release NFT in batches in opensea (rinkeby test network)
- 航运船公司人工智能AI产品成熟化标准化规模应用,全球港航人工智能/集装箱人工智能领军者CIMC中集飞瞳,打造国际航运智能化标杆
- 【微信小程序】Chapter(5):微信小程序基础API接口
- 【花雕体验】15 尝试搭建Beetle ESP32 C3之Arduino开发环境
- 一大波开源小抄来袭
猜你喜欢
Iterator and for of.. loop
Configure mongodb database in window environment
LeetCode1_ Sum of two numbers
HPDC smart base Talent Development Summit essay
Cut ffmpeg as needed, and use emscripten to compile and run
航运船公司人工智能AI产品成熟化标准化规模应用,全球港航人工智能/集装箱人工智能领军者CIMC中集飞瞳,打造国际航运智能化标杆
【数字IC验证快速入门】22、SystemVerilog项目实践之AHB-SRAMC(2)(AMBA总线介绍)
【數字IC驗證快速入門】26、SystemVerilog項目實踐之AHB-SRAMC(6)(APB協議基本要點)
[quick start of Digital IC Verification] 29. Ahb-sramc (9) (ahb-sramc svtb overview) of SystemVerilog project practice
[wechat applet] Chapter (5): basic API interface of wechat applet
随机推荐
Detailed explanation of unity hot update knowledge points and introduction to common solution principles
unnamed prototyped parameters not allowed when body is present
Runnable是否可以中断
2022第四届中国(济南)国际智慧养老产业展览会,山东老博会
【数字IC验证快速入门】22、SystemVerilog项目实践之AHB-SRAMC(2)(AMBA总线介绍)
[quick start of Digital IC Verification] 20. Basic grammar of SystemVerilog learning 7 (coverage driven... Including practical exercises)
[wechat applet] Chapter (5): basic API interface of wechat applet
How to understand that binary complement represents negative numbers
Cocos uses custom material to display problems
10 schemes to ensure interface data security
Three. JS introductory learning notes 07: external model import -c4d to JSON file for web pages -fbx import
Gd32 F3 pin mapping problem SW interface cannot be burned
Configure mongodb database in window environment
The "go to definition" in VS2010 does not respond or prompts the solution of "symbol not found"
2022全开源企业发卡网修复短网址等BUG_2022企业级多商户发卡平台源码
Points for attention in porting gd32 F4 series programs to gd32 F3 series
Three. JS introductory learning notes 00: coordinate system, camera (temporarily understood)
Ida Pro reverse tool finds the IP and port of the socket server
【數字IC驗證快速入門】20、SystemVerilog學習之基本語法7(覆蓋率驅動...內含實踐練習)
A wave of open source notebooks is coming