当前位置：网站首页>How to use data sets in machine learning?

How to use data sets in machine learning?

2022-06-30 21:49:00 【Program Yuanke】

Can datasets be used for machine learning ？ A dataset is a collection of all instances that share a common property . Machine learning models will usually contain a number of different data sets , Each data set is used to perform various roles in the system .

How to use data sets in machine learning ？

When any experienced data scientist deals with ML Related projects , Will complete 60% Work to analyze data sets , We call it exploratory data analysis (EDA). This means that data plays an important role in machine learning . In the real world , We need to process a lot of data , This makes it impossible to use ordinary giant pandas to calculate and read data , It seems to take more time , And our working resources are usually limited . To make it feasible , many AI The researchers came up with a solution , To identify different technologies and ways to handle large datasets .

Now? , I will share the following techniques with some examples . Here for practical implementation , I'm using google Colab, its RAM Capacity of 12.72 GB.

Let's consider using random numbers from 0( contain ) To 10( Not included ) Data sets created , The dataset has 1000000 Row sum 400 Column .

Execute the above code CPU The time and hanging time are as follows ：

Now? , Let's convert this data frame to CSV file .

Execute the above code CPU The time and hanging time are as follows ：

Now? , Use panda to load the data set now generated ( nearly 763 MB), Then see what happens .

When you execute the above code , because RAM Not available , The laptop will crash . ad locum , I used a relatively small data set , It's about the size of 763MB, Then consider situations where you need to process large amounts of data . What is the next plan to solve this problem ?

Techniques for handling large datasets ：

1. Read in block size CSV file ：

When we specify chunk_size Read large CSV When you file , The original data frame will be decomposed into blocks and stored in pandas Parser object . We iterate over objects in this way , And connected to form a raw data frame that takes less time .

Generated on CSV In file , This file contains 1000000 Row sum 400 Column , therefore , If we read 100000 In a row CSV File as block size , be

Execute the above code CPU The time and hanging time are as follows ：

Now we need to iterate over the blocks in the list , Then you need to store them in a list and connect them to form a complete data set .

Execute the above code CPU The time and hanging time are as follows ：

We can observe a significant improvement in reading time . such , We can read large datasets and reduce the reading time , Sometimes you can avoid system crashes .

2. Change the size of the data type ：

If you want to improve performance when doing anything with large datasets , It takes more time to avoid this cause , We can change the size of some column data types , for example (int64→int32),(float64→float32) To reduce space It is stored and saved in CSV In file , For further implementation .

for example , If we apply it to data frames after blocking , And compare the memory usage before and after the file size is reduced to half , And the memory usage is reduced to half , This eventually leads to CPU Time reduction

The memory usage before and after data type conversion is as follows ：

ad locum , We can clearly observe 3 GB Is the memory usage before data type conversion , and 1.5 GB Is the memory usage after data type conversion . If we calculate the performance by calculating the average value before and after the data frame , that CPU Time will be reduced , Our goal can be achieved .

3. Remove unwanted columns from the data frame ：

We can delete unwanted columns from the dataset , In order to reduce the memory usage of the loaded data frames , This can improve our CPU performance .

4. Change data format ：

Whether your data is in CSV Files and so on ASCII The text is stored ?

Maybe you can use another data format to speed up data loading and use less memory . A good example is binary format , for example GRIB,NetCDF or HDF. You can use many command-line tools to convert one data format to another , Instead of loading the entire dataset into memory . Using another format allows you to store data in a more compact form , To save memory , for example 2 Byte integer or 4 Byte float .

5. Use the correct data type to reduce the object size ：

Usually , You can reduce the memory usage of data frames by converting them to the correct data type . Almost all datasets contain object data types , The object data type is usually in string format , This is inefficient for memory . When you consider dates , Category features ( Such as regional , City , Place names ) when , They take up more memory , therefore , If you convert them to the corresponding data type ( Such as DateTime), Then the category will reduce the memory usage than before 10 More than times .

6. Use something like Vaex Such a quick load Library ：

Vaex It's a high performance Python library , Used of lazy Out-of-Core DataFrame( Be similar to Pandas), Browse large tabular datasets visually . It exceeds one billion per second (10 ^ 9) Samples / The speed of the line is N Calculate statistics on a dimensional grid , For example, the average , The sum of the , Count , Standard deviation . Visualization uses histograms , Density plot and 3d Volume rendering complete , This allows interactive exploration of big data . Vaex Using memory mapping , Zero memory replication strategy and lazy computing for high performance ( Don't waste memory ).

Now? , Let's implement... In the randomly generated data set above vaex library , To observe the performance .

1. First , We need it according to the operating system you use , Use the command prompt / shell install vaex library .

2. then , We need to use vaex The library will CSV The file is converted to hdf5 file .

After executing the above code , A... Will be generated in your working directory dataset.csv.hdf5 file . The memory usage before and after data type conversion is as follows ：

It can be seen that , take CSV Convert to hdf5 The document cost nearly 39 second , Relative to file size , Time should be shorter .

3. Use vaex Read hdf5 file ：-

Now we need to pass vaex In the library open Function on hdf5 file .

After looking at the code above , If we see the output , It looks like a flower 697 Milliseconds to read hdf5 file , From this we can learn about reading 3GB hdf5 File execution speed . This is a vaex The practical advantages of the library .

By using vaex, We can perform different operations on large data frames , for example

Expression system
Core data frame exceeded
Quick grouping / polymerization
Join quickly and efficiently

If you want to explore vaex More information about the library , Please click here .

Conclusion ：

In this way , We can follow these techniques when dealing with large data sets in machine learning . If it works , You can praise and support .

Share some of my artificial intelligence learning materials for free , For a long time , Very comprehensive . Including some AI Common framework actual combat video 、 Image recognition 、OpenCV、NLQ、 machine learning 、pytorch、 Computer vision 、 Videos such as deep learning and neural network 、 Courseware source code 、 Famous essence resources at home and abroad 、AI Hot papers 、 Industry reports, etc .
For better systematic learning AI, I recommend that you collect one .
Here are some screenshots , Free download method is attached at the end of the article .

One 、 AI must read books

Two 、 Free video on artificial intelligence courses and projects

3、 ... and 、 Collection of papers on artificial intelligence

Four 、 AI Industry Report

Learn Artificial Intelligence well , Read more , Do more , practice , If you want to improve your level , We must learn to settle down and learn slowly and systematically , Only in the end can we gain something .

Click on the business card below , Scan the code and download the information for free .

原网站

版权声明
本文为[Program Yuanke]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/181/202206302148017964.html

当前位置：网站首页>How to use data sets in machine learning?

How to use data sets in machine learning?

Here are some screenshots , Free download method is attached at the end of the article .

Click on the business card below , Scan the code and download the information for free .

边栏推荐

猜你喜欢

随机推荐