当前位置:网站首页>How to use data sets in machine learning?
How to use data sets in machine learning?
2022-06-30 21:49:00 【Program Yuanke】
Can datasets be used for machine learning ? A dataset is a collection of all instances that share a common property . Machine learning models will usually contain a number of different data sets , Each data set is used to perform various roles in the system .
How to use data sets in machine learning ?
When any experienced data scientist deals with ML Related projects , Will complete 60% Work to analyze data sets , We call it exploratory data analysis (EDA). This means that data plays an important role in machine learning . In the real world , We need to process a lot of data , This makes it impossible to use ordinary giant pandas to calculate and read data , It seems to take more time , And our working resources are usually limited . To make it feasible , many AI The researchers came up with a solution , To identify different technologies and ways to handle large datasets .
Now? , I will share the following techniques with some examples . Here for practical implementation , I'm using google Colab, its RAM Capacity of 12.72 GB.
Let's consider using random numbers from 0( contain ) To 10( Not included ) Data sets created , The dataset has 1000000 Row sum 400 Column .
Execute the above code CPU The time and hanging time are as follows :

Now? , Let's convert this data frame to CSV file .
Execute the above code CPU The time and hanging time are as follows :

Now? , Use panda to load the data set now generated ( nearly 763 MB), Then see what happens .
When you execute the above code , because RAM Not available , The laptop will crash . ad locum , I used a relatively small data set , It's about the size of 763MB, Then consider situations where you need to process large amounts of data . What is the next plan to solve this problem ?
Techniques for handling large datasets :
1. Read in block size CSV file :

When we specify chunk_size Read large CSV When you file , The original data frame will be decomposed into blocks and stored in pandas Parser object . We iterate over objects in this way , And connected to form a raw data frame that takes less time .
Generated on CSV In file , This file contains 1000000 Row sum 400 Column , therefore , If we read 100000 In a row CSV File as block size , be
Execute the above code CPU The time and hanging time are as follows :

Now we need to iterate over the blocks in the list , Then you need to store them in a list and connect them to form a complete data set .
Execute the above code CPU The time and hanging time are as follows :

We can observe a significant improvement in reading time . such , We can read large datasets and reduce the reading time , Sometimes you can avoid system crashes .
2. Change the size of the data type :
If you want to improve performance when doing anything with large datasets , It takes more time to avoid this cause , We can change the size of some column data types , for example (int64→int32),(float64→float32) To reduce space It is stored and saved in CSV In file , For further implementation .
for example , If we apply it to data frames after blocking , And compare the memory usage before and after the file size is reduced to half , And the memory usage is reduced to half , This eventually leads to CPU Time reduction
The memory usage before and after data type conversion is as follows :


ad locum , We can clearly observe 3 GB Is the memory usage before data type conversion , and 1.5 GB Is the memory usage after data type conversion . If we calculate the performance by calculating the average value before and after the data frame , that CPU Time will be reduced , Our goal can be achieved .
3. Remove unwanted columns from the data frame :
We can delete unwanted columns from the dataset , In order to reduce the memory usage of the loaded data frames , This can improve our CPU performance .
4. Change data format :
Whether your data is in CSV Files and so on ASCII The text is stored ?
Maybe you can use another data format to speed up data loading and use less memory . A good example is binary format , for example GRIB,NetCDF or HDF. You can use many command-line tools to convert one data format to another , Instead of loading the entire dataset into memory . Using another format allows you to store data in a more compact form , To save memory , for example 2 Byte integer or 4 Byte float .
5. Use the correct data type to reduce the object size :
Usually , You can reduce the memory usage of data frames by converting them to the correct data type . Almost all datasets contain object data types , The object data type is usually in string format , This is inefficient for memory . When you consider dates , Category features ( Such as regional , City , Place names ) when , They take up more memory , therefore , If you convert them to the corresponding data type ( Such as DateTime), Then the category will reduce the memory usage than before 10 More than times .
6. Use something like Vaex Such a quick load Library :
Vaex It's a high performance Python library , Used of lazy Out-of-Core DataFrame( Be similar to Pandas), Browse large tabular datasets visually . It exceeds one billion per second (10 ^ 9) Samples / The speed of the line is N Calculate statistics on a dimensional grid , For example, the average , The sum of the , Count , Standard deviation . Visualization uses histograms , Density plot and 3d Volume rendering complete , This allows interactive exploration of big data . Vaex Using memory mapping , Zero memory replication strategy and lazy computing for high performance ( Don't waste memory ).
Now? , Let's implement... In the randomly generated data set above vaex library , To observe the performance .
1. First , We need it according to the operating system you use , Use the command prompt / shell install vaex library .
2. then , We need to use vaex The library will CSV The file is converted to hdf5 file .
After executing the above code , A... Will be generated in your working directory dataset.csv.hdf5 file . The memory usage before and after data type conversion is as follows :

It can be seen that , take CSV Convert to hdf5 The document cost nearly 39 second , Relative to file size , Time should be shorter .
3. Use vaex Read hdf5 file :-
Now we need to pass vaex In the library open Function on hdf5 file .
After looking at the code above , If we see the output , It looks like a flower 697 Milliseconds to read hdf5 file , From this we can learn about reading 3GB hdf5 File execution speed . This is a vaex The practical advantages of the library .

By using vaex, We can perform different operations on large data frames , for example
- Expression system
- Core data frame exceeded
- Quick grouping / polymerization
- Join quickly and efficiently
If you want to explore vaex More information about the library , Please click here .
Conclusion :
In this way , We can follow these techniques when dealing with large data sets in machine learning . If it works , You can praise and support .
Share some of my artificial intelligence learning materials for free , For a long time , Very comprehensive . Including some AI Common framework actual combat video 、 Image recognition 、OpenCV、NLQ、 machine learning 、pytorch、 Computer vision 、 Videos such as deep learning and neural network 、 Courseware source code 、 Famous essence resources at home and abroad 、AI Hot papers 、 Industry reports, etc .
For better systematic learning AI, I recommend that you collect one .
Here are some screenshots , Free download method is attached at the end of the article .
One 、 AI must read books

Two 、 Free video on artificial intelligence courses and projects

3、 ... and 、 Collection of papers on artificial intelligence

Four 、 AI Industry Report

Learn Artificial Intelligence well , Read more , Do more , practice , If you want to improve your level , We must learn to settle down and learn slowly and systematically , Only in the end can we gain something .
Click on the business card below , Scan the code and download the information for free .
边栏推荐
- JD and Tencent renewed the three-year strategic cooperation agreement; The starting salary rose to 260000 yuan, and Samsung sk of South Korea scrambled for a raise to retain semiconductor talents; Fir
- 1-21 JSONP接口
- Ssh server configuration file parameter permitrootlogin introduction
- Excitatory neurotransmitter glutamate and brain health
- The Jenkins download Plug-in can't be downloaded. Solution
- 1-19 using CORS to solve interface cross domain problems
- jenkins下载插件下载不了,解决办法
- SQL server extracts pure numbers from strings
- Sqlserver gets the data of numbers, Chinese and characters in the string
- Go Web 编程入门: 一探优秀测试库 GoConvey
猜你喜欢

Phoenix architecture: an architect's perspective

漫谈Clickhouse Join

JD and Tencent renewed the three-year strategic cooperation agreement; The starting salary rose to 260000 yuan, and Samsung sk of South Korea scrambled for a raise to retain semiconductor talents; Fir

《ClickHouse原理解析与应用实践》读书笔记(3)

Reading notes of Clickhouse principle analysis and Application Practice (2)

AKK菌——下一代有益菌

SQL server extracts pure numbers from strings

The Jenkins download Plug-in can't be downloaded. Solution

【无标题】

Bloom filter
随机推荐
根据肠道微生物组重新思考健康饮食
CA I ah, several times Oh, ah, a sentence IU home Oh
Document layout analysis: a comprehensive survey 2019 paper learning summary
The programmer's girlfriend gave me a fatigue driving test
机器学习工作要求研究生吗?
1-13 express监听GET和POST请求&处理请求
PyTorch量化实践(2)
Introduction to go web programming: a probe into the excellent test library gocovey
Introduce an online platform for multi omics integration and network visual analysis
1-3 using SQL to manage databases
Three techniques for reducing debugging time of embedded software
谈谈数字化转型的几个关键问题
5G 在智慧医疗中的需求
请问,启牛证券开户,可以开户吗?安全吗?你想要的答案全在这里
pytorch geometric torch-scatter和torch-sparse安装报错问题解决
twelve thousand three hundred and forty-five
ceshi deces
PyTorch量化感知训练(QAT)步骤
《ClickHouse原理解析与应用实践》读书笔记(1)
京东与腾讯续签三年战略合作协议;起薪涨至26万元,韩国三星SK争相加薪留住半导体人才;Firefox 102 发布|极客头条