当前位置:网站首页>"Hands on learning in depth" Chapter 2 - preparatory knowledge_ 2.2 data preprocessing_ Learning thinking and exercise answers
"Hands on learning in depth" Chapter 2 - preparatory knowledge_ 2.2 data preprocessing_ Learning thinking and exercise answers
2022-07-08 02:10:00 【coder_ sure】
List of articles
2.2 Data preprocessing
author github link : github link
exercises
Create a raw dataset with more rows and columns .
Delete the column with the most missing values .
Convert the preprocessed data set into tensor format .
The answer process
In order to solve real-world problems with deep learning , We often start by preprocessing raw data , Instead of starting with the prepared tensor format data . stay Python Data analysis tools commonly used in , We usually use pandas software package . Like a giant Python Like many other expansion packs in the ecosystem ,pandas Compatible with tensors . In this section, we will briefly introduce the use of pandas Preprocessing raw data , And the steps of converting the original data into tensor format . We will introduce more data preprocessing techniques in later chapters .
Reading data sets
For example , First of all ( Create a manual data set , And stored in CSV
( Comma separated values ) file ) ../data/house_tiny.csv
in . Data stored in other formats can also be processed in a similar way . Let's write the data set in rows CSV
In file .
import os
os.makedirs(os.path.join('..', 'data'), exist_ok=True)# Create a folder
data_file = os.path.join('..', 'data', 'house_tiny.csv')# Wrote a csv file ,csv( Comma separated value file format ): Each row is a set of data , Each column is separated by commas .
with open(data_file, 'w') as f:
f.write('NumRooms,Alley,Price,area\n') # Name ( I made a list of area contents )
f.write('NA,Pave,127500,120\n') # Each row represents a data sample
f.write('2,NA,106000,100\n')
f.write('4,NA,178100,170\n')
f.write('NA,NA,140000,140\n')
f.write('NA,NA,140000,130\n')
want [ Created from CSV Load the original data set in the file ], We import pandas Package and call read_csv function . The dataset has four rows and three columns . Each line describes the number of rooms (“NumRooms”)、 Alley type (“Alley”)、 House price (“Price”) And housing area (area).
# If not installed pandas, Just uncomment the following lines to install pandas
# !pip install pandas
import pandas as pd
data = pd.read_csv(data_file)
data
Output :
Handling missing values
Be careful ,“NaN” Item represents missing value . [ To handle missing data , Typical methods include interpolation and deletion ,] The interpolation method uses a substitute value to make up for the missing value , The deletion rule directly ignores the missing value . Delete the column with the most missing values .
inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = inputs.fillna(inputs.mean())# Let's replace... With the mean of the same column “NaN” term
print(inputs)
Output :
NumRooms Price
0 3.0 127500
1 2.0 106000
2 4.0 178100
3 3.0 140000
4 3.0 140000
Convert to tensor format
[ Now? inputs
and outputs
All entries in are numeric types , They can be converted to tensor format .] When the data is in tensor format , It can be done by :numref:sec_ndarray
The tensor functions introduced in .
import torch
X, y = torch.tensor(inputs.values), torch.tensor(outputs.values)
X, y
Output :
(tensor([[3.0000e+00, 1.2750e+05],
[2.0000e+00, 1.0600e+05],
[4.0000e+00, 1.7810e+05],
[3.0000e+00, 1.4000e+05],
[3.0000e+00, 1.4000e+05]], dtype=torch.float64),
tensor([120, 100, 170, 140, 130]))
边栏推荐
- 力争做到国内赛事应办尽办,国家体育总局明确安全有序恢复线下体育赛事
- 喜欢测特曼的阿洛
- How to use diffusion models for interpolation—— Principle analysis and code practice
- burpsuite
- Master go game through deep neural network and tree search
- node js 保持长连接
- 很多小伙伴不太了解ORM框架的底层原理,这不,冰河带你10分钟手撸一个极简版ORM框架(赶快收藏吧)
- 微信小程序uniapp页面无法跳转:“navigateTo:fail can not navigateTo a tabbar page“
- Le chemin du poisson et des crevettes
- 进程和线程的退出
猜你喜欢
Ml self realization /knn/ classification / weightlessness
Partage d'expériences de contribution à distance
Master go game through deep neural network and tree search
《ClickHouse原理解析与应用实践》读书笔记(7)
快手小程序担保支付php源码封装
分布式定时任务之XXL-JOB
Many friends don't know the underlying principle of ORM framework very well. No, glacier will take you 10 minutes to hand roll a minimalist ORM framework (collect it quickly)
JVM memory and garbage collection-3-direct memory
metasploit
文盘Rust -- 给程序加个日志
随机推荐
[target tracking] |atom
List of top ten domestic industrial 3D visual guidance enterprises in 2022
How to make the conductive slip ring signal better
LeetCode精选200道--数组篇
CorelDRAW2022下载安装电脑系统要求技术规格
[reinforcement learning medical] deep reinforcement learning for clinical decision support: a brief overview
力扣6_1342. 将数字变成 0 的操作次数
Installing and using mpi4py
MySQL查询为什么没走索引?这篇文章带你全面解析
For friends who are not fat at all, nature tells you the reason: it is a genetic mutation
[recommendation system paper reading] recommendation simulation user feedback based on Reinforcement Learning
魚和蝦走的路
MQTT X Newsletter 2022-06 | v1.8.0 发布,新增 MQTT CLI 和 MQTT WebSocket 工具
Ml backward propagation
The function of carbon brush slip ring in generator
云原生应用开发之 gRPC 入门
Height of life
Ml self realization /knn/ classification / weightlessness
QT -- create QT program
如何用Diffusion models做interpolation插值任务?——原理解析和代码实战