当前位置:网站首页>How to use data sets in machine learning?
How to use data sets in machine learning?
2022-06-30 21:49:00 【Program Yuanke】
Can datasets be used for machine learning ? A dataset is a collection of all instances that share a common property . Machine learning models will usually contain a number of different data sets , Each data set is used to perform various roles in the system .
How to use data sets in machine learning ?
When any experienced data scientist deals with ML Related projects , Will complete 60% Work to analyze data sets , We call it exploratory data analysis (EDA). This means that data plays an important role in machine learning . In the real world , We need to process a lot of data , This makes it impossible to use ordinary giant pandas to calculate and read data , It seems to take more time , And our working resources are usually limited . To make it feasible , many AI The researchers came up with a solution , To identify different technologies and ways to handle large datasets .
Now? , I will share the following techniques with some examples . Here for practical implementation , I'm using google Colab, its RAM Capacity of 12.72 GB.
Let's consider using random numbers from 0( contain ) To 10( Not included ) Data sets created , The dataset has 1000000 Row sum 400 Column .
Execute the above code CPU The time and hanging time are as follows :

Now? , Let's convert this data frame to CSV file .
Execute the above code CPU The time and hanging time are as follows :

Now? , Use panda to load the data set now generated ( nearly 763 MB), Then see what happens .
When you execute the above code , because RAM Not available , The laptop will crash . ad locum , I used a relatively small data set , It's about the size of 763MB, Then consider situations where you need to process large amounts of data . What is the next plan to solve this problem ?
Techniques for handling large datasets :
1. Read in block size CSV file :

When we specify chunk_size Read large CSV When you file , The original data frame will be decomposed into blocks and stored in pandas Parser object . We iterate over objects in this way , And connected to form a raw data frame that takes less time .
Generated on CSV In file , This file contains 1000000 Row sum 400 Column , therefore , If we read 100000 In a row CSV File as block size , be
Execute the above code CPU The time and hanging time are as follows :

Now we need to iterate over the blocks in the list , Then you need to store them in a list and connect them to form a complete data set .
Execute the above code CPU The time and hanging time are as follows :

We can observe a significant improvement in reading time . such , We can read large datasets and reduce the reading time , Sometimes you can avoid system crashes .
2. Change the size of the data type :
If you want to improve performance when doing anything with large datasets , It takes more time to avoid this cause , We can change the size of some column data types , for example (int64→int32),(float64→float32) To reduce space It is stored and saved in CSV In file , For further implementation .
for example , If we apply it to data frames after blocking , And compare the memory usage before and after the file size is reduced to half , And the memory usage is reduced to half , This eventually leads to CPU Time reduction
The memory usage before and after data type conversion is as follows :


ad locum , We can clearly observe 3 GB Is the memory usage before data type conversion , and 1.5 GB Is the memory usage after data type conversion . If we calculate the performance by calculating the average value before and after the data frame , that CPU Time will be reduced , Our goal can be achieved .
3. Remove unwanted columns from the data frame :
We can delete unwanted columns from the dataset , In order to reduce the memory usage of the loaded data frames , This can improve our CPU performance .
4. Change data format :
Whether your data is in CSV Files and so on ASCII The text is stored ?
Maybe you can use another data format to speed up data loading and use less memory . A good example is binary format , for example GRIB,NetCDF or HDF. You can use many command-line tools to convert one data format to another , Instead of loading the entire dataset into memory . Using another format allows you to store data in a more compact form , To save memory , for example 2 Byte integer or 4 Byte float .
5. Use the correct data type to reduce the object size :
Usually , You can reduce the memory usage of data frames by converting them to the correct data type . Almost all datasets contain object data types , The object data type is usually in string format , This is inefficient for memory . When you consider dates , Category features ( Such as regional , City , Place names ) when , They take up more memory , therefore , If you convert them to the corresponding data type ( Such as DateTime), Then the category will reduce the memory usage than before 10 More than times .
6. Use something like Vaex Such a quick load Library :
Vaex It's a high performance Python library , Used of lazy Out-of-Core DataFrame( Be similar to Pandas), Browse large tabular datasets visually . It exceeds one billion per second (10 ^ 9) Samples / The speed of the line is N Calculate statistics on a dimensional grid , For example, the average , The sum of the , Count , Standard deviation . Visualization uses histograms , Density plot and 3d Volume rendering complete , This allows interactive exploration of big data . Vaex Using memory mapping , Zero memory replication strategy and lazy computing for high performance ( Don't waste memory ).
Now? , Let's implement... In the randomly generated data set above vaex library , To observe the performance .
1. First , We need it according to the operating system you use , Use the command prompt / shell install vaex library .
2. then , We need to use vaex The library will CSV The file is converted to hdf5 file .
After executing the above code , A... Will be generated in your working directory dataset.csv.hdf5 file . The memory usage before and after data type conversion is as follows :

It can be seen that , take CSV Convert to hdf5 The document cost nearly 39 second , Relative to file size , Time should be shorter .
3. Use vaex Read hdf5 file :-
Now we need to pass vaex In the library open Function on hdf5 file .
After looking at the code above , If we see the output , It looks like a flower 697 Milliseconds to read hdf5 file , From this we can learn about reading 3GB hdf5 File execution speed . This is a vaex The practical advantages of the library .

By using vaex, We can perform different operations on large data frames , for example
- Expression system
- Core data frame exceeded
- Quick grouping / polymerization
- Join quickly and efficiently
If you want to explore vaex More information about the library , Please click here .
Conclusion :
In this way , We can follow these techniques when dealing with large data sets in machine learning . If it works , You can praise and support .
Share some of my artificial intelligence learning materials for free , For a long time , Very comprehensive . Including some AI Common framework actual combat video 、 Image recognition 、OpenCV、NLQ、 machine learning 、pytorch、 Computer vision 、 Videos such as deep learning and neural network 、 Courseware source code 、 Famous essence resources at home and abroad 、AI Hot papers 、 Industry reports, etc .
For better systematic learning AI, I recommend that you collect one .
Here are some screenshots , Free download method is attached at the end of the article .
One 、 AI must read books

Two 、 Free video on artificial intelligence courses and projects

3、 ... and 、 Collection of papers on artificial intelligence

Four 、 AI Industry Report

Learn Artificial Intelligence well , Read more , Do more , practice , If you want to improve your level , We must learn to settle down and learn slowly and systematically , Only in the end can we gain something .
Click on the business card below , Scan the code and download the information for free .
边栏推荐
- jupyterbook 清空控制台输出
- 興奮神經遞質——穀氨酸與大腦健康
- Pytorch quantitative practice (1)
- NCAT detailed introduction (Reprint)
- 1-7 Path路径模块
- 1-1 basic concepts of database
- Pytorch quantitative practice (2)
- qsort函数和模拟实现qsort函数
- 1-11 create online file service
- Excuse me, can I open an account for the company? Is it safe? All the answers you want are here
猜你喜欢

Reading notes of Clickhouse principle analysis and Application Practice (1)

Introduce an online platform for multi omics integration and network visual analysis
笔记【JUC包以及Future介绍】

clickhouse原生監控項,系統錶描述

Ten security measures against unauthorized access attacks

clickhouse原生监控项,系统表描述

Open the jupyter notebook/lab and FAQ & settings on the remote server with the local browser

Pytorch quantitative practice (1)

Markdown notes concise tutorial

用yml文件进行conda迁移环境时的报错小结
随机推荐
1-10 根据不同的url响应客户端的内容
Jupyter notebook/lab switch CONDA environment
程序员女友给我做了一个疲劳驾驶检测
AKK菌——下一代有益菌
笔记【JUC包以及Future介绍】
Anaconda下安装Jupyter notebook
Summary of errors reported when using YML file to migrate CONDA environment
Side sleep ha ha ha
京东与腾讯续签三年战略合作协议;起薪涨至26万元,韩国三星SK争相加薪留住半导体人才;Firefox 102 发布|极客头条
1-17 express中间件
FreeRTOS record (IX. an example of a bare metal project transferring to FreeRTOS)
What happens when word encounters an error while trying to open a file?
本地浏览器打开远程服务器上的Jupyter Notebook/Lab以及常见问题&设置
1-20 预检请求
《ClickHouse原理解析与应用实践》读书笔记(3)
漫谈Clickhouse Join
jupyterbook 清空控制台输出
Can flinksql two Kafka streams join?
1-17 express Middleware
[untitled] first time to participate in CSDN activities