当前位置:网站首页>How to Load Data from CSV (Data Preparation Part)
How to Load Data from CSV (Data Preparation Part)
2022-06-11 20:56:00 【梦想家DBA】
- How to load a CSV file
- How to convert strings from a file to floating point numbers.
- How to convert class values from a file to integers.
1.2 Tutorial
- Load a file
- Load a file and convert Strings to Floats
- Load a file and convert Strings to Integers.
# Function for loading a CSV
# load a CSV file
from csv import reader
def load_csv(filename):
file = open(filename,"r")
lines = reader(file)
dataset = list(lines)
return dataset
load_csv('pima-indians-diabetes.data.csv')# Example of Loading the Pima Indians Diabetes Dataset CSV File
# Example of loading Pima Indians CSV dataset
from csv import reader
# Load a csv file
def load_csv(filename):
file = open(filename,"r")
lines = reader(file)
dataset = list(lines)
return dataset
# Load dataset
filename = 'pima-indians-diabetes.data.csv'
dataset = load_csv(filename)
print('Loaded data file {0} with {1} rows and {2} columns'.format(filename,len(dataset),len(dataset[0])))Sample output from loading the Pima Indians Diabetes dataset CSV file.

A limitation of this function is that it will load empty lines from data files and add them to our list of rows. Below is the updated example with the new improved version of the load_csv () function
# Improved Example of Loading the Pima Indians Diabetes Dataset CSV File
# Example of loading Pima Indian CSV dataset
from csv import reader
# Load a CSV file
def load_csv(filename):
dataset = list()
with open(filename, 'r') as file:
csv_reader = reader(file)
for row in csv_reader:
if not row:
continue
dataset.append(row)
return dataset
# Load dataset
filename = 'pima-indians-diabetes.data.csv'
dataset = load_csv(filename)
print('Loaded data file {0} with {1} rows and {2} columns'.format(filename,len(dataset),len(dataset[0])))Sample Output From Loading the Pima Indians Diabetes Dataset CSV File

1.2 Convert String to Floats
if not all machine learning algorithms prefer to work with numbers. Specifically, floating point numbers are prefered.Our code for loading a CSV file returns a dataset as a list of lists. but each value is a string. We can see if we print out one record from the dataset:
print(dataset[0])
We can write a small function to convert specific columns of our loaded dataset to floating point values.Below is this function called str_column_to_float(). It will convert a given column in the dataset to floating point values, careful to strip any whitespace from the value befor making the conversion.
def str_column_to_float(dataset, column):
for row in dataset:
row[column] = float(row[column].strip())We can test this function by combining it with our load CSV function above, and convert all of the numeric data in the Pima Indians dataset to floating point values. The complete example is below.
# Example of converting string variables to float
from csv import reader
# Load a CSV file
def load_csv(filename):
dataset = list()
with open(filename, 'r') as file:
csv_reader = reader(file)
for row in csv_reader:
if not row:
continue
dataset.append(row)
return dataset
# Convert string column to float
def str_column_to_float(dataset, column):
for row in dataset:
row[column] = float(row[column].strip())
# Load pima-indians-diabetes dataset
filename = 'pima-indians-diabetes.data.csv'
dataset = load_csv(filename)
print('Loaded data file {0} with {1} rows and {2} columns'.format(filename,len(dataset),len(dataset[0])))
print(dataset[0])
# convert string columns to float
for i in range(len(dataset[0])):
str_column_to_float(dataset,i)
print(dataset[0])Running this example we see the first row of the dataset printed both before and after the conversion. We can see that the values in each column have been converted from strings to numbers.

Some machine learning algorithms prefer all values to be numeric, including the outcome or predicted value. We can convert the class value in the iris flowers dataset to an integer by creating a map.
- First, we locate all of the unique class values, which happen to be: Iris-setosa, Iris-versicolor and Iris-virginica.
- Next, we assign an integer value to each, such as: 0, 1 and 2.
- Finally, we replace all occurrences of class string values with their corresponding integer values.
Below is a function to do just that called str_column_to_int(). Like the previously introduced str_column_to_float() it operates on a single column in the dataset.
# Example of integer encoding string class values
from csv import reader
# Load a CSV file
def load_csv(filename):
dataset = list()
with open(filename, 'r') as file:
csv_reader = reader(file)
for row in csv_reader:
if not row:
continue
dataset.append(row)
return dataset
# Convert string column to float
def str_column_to_float(dataset, column):
for row in dataset:
row[column] = float(row[column].strip())
# Convert string column to float
def str_column_to_float(dataset,column):
for row in dataset:
row[column] = float(row[column].strip())
# Convert string column to integer
def str_column_to_int(dataset, column):
class_values = [row[column] for row in dataset]
unique = set(class_values)
lookup = dict()
for i, value in enumerate(unique):
lookup[value] = i
for row in dataset:
row[column] = lookup[row[column]]
return lookup
# Load iris dataset
filename = 'iris.csv'
dataset = load_csv(filename)
print('Loaded data file {0} with {1} rows and {2} columns'.format(filename,len(dataset),len(dataset[0])))
print(dataset[0])
# convert string columns to float
for i in range(4):
str_column_to_int(dataset,4)
# convert class column to int
lookup = str_column_to_int(dataset, 4)
print(dataset[0])
print(lookup)
边栏推荐
- 应用场景:现场直播节目制作NDI技术中PoE网卡的广泛应用
- 为什么100G网络传输要使用iWARP、RoCE v2、NVMe-oF等协议
- Current situation and future development trend of global and Chinese cogeneration system market from 2022 to 2028
- 2022-2028 global and Chinese thermocouple sensor market status and future development trend
- Cuckoo Hash
- Pyqt5 technical part - set the default value of qcombobox drop-down box and get the current selection of the drop-down box
- Release of version 5.6 of rainbow, add multiple installation methods, and optimize the topology operation experience
- On scale of canvas recttransform in ugui
- Research and Analysis on the market status of polybutene-1 in China from 2021 to 2027 and forecast report on its development prospect
- 【指标体系】最新数仓指标体系建模方法
猜你喜欢

12 date and time in R

Redis第四话 -- redis高性能原理(多路复用)和高可用分析(备份、主从)

Lanqi technology joins in, and dragon dragon dragon community welcomes leading chip design manufacturers again

Date of SQL optimization_ Format() function

Black circle display implementation

Docker installing MySQL

为什么100G网络传输要使用iWARP、RoCE v2、NVMe-oF等协议

Teach you how to grab ZigBee packets through cc2531 and parse encrypted ZigBee packets
![[data visualization] Apache superset 1.2.0 tutorial (III) - detailed explanation of chart functions](/img/1f/00f2085186971198928b012a3792ea.jpg)
[data visualization] Apache superset 1.2.0 tutorial (III) - detailed explanation of chart functions

Gestionnaire de paquets d'Unit é Starting Server Stuck
随机推荐
The input value "18-20000hz" is incorrect. The setting information is incomplete. Please select a company
ubantu1804 两个opencv版本共存
New product release: domestic single port Gigabit network card is officially mass produced!
var 和 let的区别_let 和 var的区别
Wechat applet | rotation chart
File upload vulnerability - simple exploitation 2 (Mozhe college shooting range)
Wechat applet Bluetooth development
Docker installing MySQL
第一部分 物理层
【数据可视化】Apache Superset 1.2.0教程 (三)—— 图表功能详解
UDP、TCP
In idea, run the yarn command to show that the file cannot be loaded because running scripts is disabled on this system
New product release: lr-link Lianrui launched the first 25g OCP 3.0 network card
Lr-link Lianrui makes its debut at the digital Expo with new products - helping the construction of new infrastructure data center
[data visualization] Apache superset 1.2.0 tutorial (III) - detailed explanation of chart functions
Implement AOP and interface caching on WPF client
Recompile kubeadm to solve the problem of certificate expiration in one year
解决 img 5px 间距的问题
Why should I use iwarp, roce V2, nvme of and other protocols for 100g network transmission
[file upload vulnerability 04] server side mime detection and bypass experiment (based on upload-labs-2 shooting range)