当前位置：网站首页>"Hands on learning in depth" Chapter 2 - preparatory knowledge_ 2.2 data preprocessing_ Learning thinking and exercise answers

"Hands on learning in depth" Chapter 2 - preparatory knowledge_ 2.2 data preprocessing_ Learning thinking and exercise answers

2022-07-08 02:10:00 【coder_ sure】

List of articles

2.2 Data preprocessing
- exercises

2.2 Data preprocessing

author github link ： github link

exercises

Create a raw dataset with more rows and columns .
Delete the column with the most missing values .
Convert the preprocessed data set into tensor format .

The answer process

In order to solve real-world problems with deep learning , We often start by preprocessing raw data , Instead of starting with the prepared tensor format data . stay Python Data analysis tools commonly used in , We usually use pandas software package . Like a giant Python Like many other expansion packs in the ecosystem ,pandas Compatible with tensors . In this section, we will briefly introduce the use of pandas Preprocessing raw data , And the steps of converting the original data into tensor format . We will introduce more data preprocessing techniques in later chapters .

Reading data sets

For example , First of all ( Create a manual data set , And stored in CSV（ Comma separated values ） file ) ../data/house_tiny.csv in . Data stored in other formats can also be processed in a similar way . Let's write the data set in rows CSV In file .

import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)# Create a folder 
data_file = os.path.join('..', 'data', 'house_tiny.csv')# Wrote a csv file ,csv( Comma separated value file format ): Each row is a set of data , Each column is separated by commas .
with open(data_file, 'w') as f:
    f.write('NumRooms,Alley,Price,area\n')  #  Name （ I made a list of area contents ）
    f.write('NA,Pave,127500,120\n')  #  Each row represents a data sample 
    f.write('2,NA,106000,100\n')
    f.write('4,NA,178100,170\n')
    f.write('NA,NA,140000,140\n')
    f.write('NA,NA,140000,130\n')

want [ Created from CSV Load the original data set in the file ], We import pandas Package and call read_csv function . The dataset has four rows and three columns . Each line describes the number of rooms （“NumRooms”）、 Alley type （“Alley”）、 House price （“Price”） And housing area （area）.

#  If not installed pandas, Just uncomment the following lines to install pandas
# !pip install pandas
import pandas as pd

data = pd.read_csv(data_file)
data

Output ：
Please add a picture description

Handling missing values

Be careful ,“NaN” Item represents missing value . [ To handle missing data , Typical methods include interpolation and deletion ,] The interpolation method uses a substitute value to make up for the missing value , The deletion rule directly ignores the missing value . Delete the column with the most missing values .

inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = inputs.fillna(inputs.mean())# Let's replace... With the mean of the same column “NaN” term 
print(inputs)

Output ：

   NumRooms   Price
0       3.0  127500
1       2.0  106000
2       4.0  178100
3       3.0  140000
4       3.0  140000

Convert to tensor format

[ Now? inputs and outputs All entries in are numeric types , They can be converted to tensor format .] When the data is in tensor format , It can be done by :numref:sec_ndarray The tensor functions introduced in .

import torch

X, y = torch.tensor(inputs.values), torch.tensor(outputs.values)
X, y

Output ：

(tensor([[3.0000e+00, 1.2750e+05],
         [2.0000e+00, 1.0600e+05],
         [4.0000e+00, 1.7810e+05],
         [3.0000e+00, 1.4000e+05],
         [3.0000e+00, 1.4000e+05]], dtype=torch.float64),
 tensor([120, 100, 170, 140, 130]))