当前位置:网站首页>"Hands on learning in depth" Chapter 2 - preparatory knowledge_ 2.2 data preprocessing_ Learning thinking and exercise answers
"Hands on learning in depth" Chapter 2 - preparatory knowledge_ 2.2 data preprocessing_ Learning thinking and exercise answers
2022-07-08 02:10:00 【coder_ sure】
List of articles
2.2 Data preprocessing
author github link : github link
exercises
Create a raw dataset with more rows and columns .
Delete the column with the most missing values .
Convert the preprocessed data set into tensor format .
The answer process
In order to solve real-world problems with deep learning , We often start by preprocessing raw data , Instead of starting with the prepared tensor format data . stay Python Data analysis tools commonly used in , We usually use pandas software package . Like a giant Python Like many other expansion packs in the ecosystem ,pandas Compatible with tensors . In this section, we will briefly introduce the use of pandas Preprocessing raw data , And the steps of converting the original data into tensor format . We will introduce more data preprocessing techniques in later chapters .
Reading data sets
For example , First of all ( Create a manual data set , And stored in CSV( Comma separated values ) file ) ../data/house_tiny.csv in . Data stored in other formats can also be processed in a similar way . Let's write the data set in rows CSV In file .
import os
os.makedirs(os.path.join('..', 'data'), exist_ok=True)# Create a folder
data_file = os.path.join('..', 'data', 'house_tiny.csv')# Wrote a csv file ,csv( Comma separated value file format ): Each row is a set of data , Each column is separated by commas .
with open(data_file, 'w') as f:
f.write('NumRooms,Alley,Price,area\n') # Name ( I made a list of area contents )
f.write('NA,Pave,127500,120\n') # Each row represents a data sample
f.write('2,NA,106000,100\n')
f.write('4,NA,178100,170\n')
f.write('NA,NA,140000,140\n')
f.write('NA,NA,140000,130\n')
want [ Created from CSV Load the original data set in the file ], We import pandas Package and call read_csv function . The dataset has four rows and three columns . Each line describes the number of rooms (“NumRooms”)、 Alley type (“Alley”)、 House price (“Price”) And housing area (area).
# If not installed pandas, Just uncomment the following lines to install pandas
# !pip install pandas
import pandas as pd
data = pd.read_csv(data_file)
data
Output :
Handling missing values
Be careful ,“NaN” Item represents missing value . [ To handle missing data , Typical methods include interpolation and deletion ,] The interpolation method uses a substitute value to make up for the missing value , The deletion rule directly ignores the missing value . Delete the column with the most missing values .
inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = inputs.fillna(inputs.mean())# Let's replace... With the mean of the same column “NaN” term
print(inputs)
Output :
NumRooms Price
0 3.0 127500
1 2.0 106000
2 4.0 178100
3 3.0 140000
4 3.0 140000
Convert to tensor format
[ Now? inputs and outputs All entries in are numeric types , They can be converted to tensor format .] When the data is in tensor format , It can be done by :numref:sec_ndarray The tensor functions introduced in .
import torch
X, y = torch.tensor(inputs.values), torch.tensor(outputs.values)
X, y
Output :
(tensor([[3.0000e+00, 1.2750e+05],
[2.0000e+00, 1.0600e+05],
[4.0000e+00, 1.7810e+05],
[3.0000e+00, 1.4000e+05],
[3.0000e+00, 1.4000e+05]], dtype=torch.float64),
tensor([120, 100, 170, 140, 130]))
边栏推荐
猜你喜欢
随机推荐
《ClickHouse原理解析与应用实践》读书笔记(7)
adb工具介绍
JVM memory and garbage collection-3-runtime data area / heap area
ArrayList源码深度剖析,从最基本的扩容原理,到魔幻的迭代器和fast-fail机制,你想要的这都有!!!
鱼和虾走的路
leetcode 869. Reordered Power of 2 | 869. Reorder to a power of 2 (state compression)
How to make the conductive slip ring signal better
Remote Sensing投稿經驗分享
快手小程序担保支付php源码封装
《通信软件开发与应用》课程结业报告
谈谈 SAP 系统的权限管控和事务记录功能的实现
Is NPDP recognized in China? Look at it and you'll see!
I don't know. The real interest rate of Huabai installment is so high
谈谈 SAP iRPA Studio 创建的本地项目的云端部署问题
PB9.0 insert OLE control error repair tool
C语言-模块化-Clion(静态库,动态库)使用
Application of slip ring in direct drive motor rotor
Installing and using mpi4py
Kwai applet guaranteed payment PHP source code packaging
Usage of hydraulic rotary joint









