当前位置：网站首页>Machine learning - Data Science Library Day 3 - Notes

Machine learning - Data Science Library Day 3 - Notes

2022-07-01 12:04:00 【weixin_ forty-five million six hundred and forty-nine thousand 】

Catalog

What is? numpy

In a Python Basic database of scientific computing in China , Focus on numerical calculation , Most of them PYTHON Basic database of scientific computing database , Mostly used in large 、 Perform numeric operations on multidimensional arrays

Axis (axis)

stay numpy Can be understood as direction , Use 0,1,2… Digital representation , For a one-dimensional array , only one 0 Axis , about 2 Dimension group (shape(2,2)), Yes 0 Axis and 1 Axis , For three-dimensional arrays (shape(2,2, 3)), Yes 0,1,2 Axis

With the concept of axis , It will be more convenient for us to calculate , For example, calculate a 2 The average of the set of dimensions , You must specify which direction to calculate the average of the numbers above
Create array ：
Insert picture description here

Insert picture description here
Modify the shape of the array

Inter array operation

Transpose matrix

numpy Reading data
CSV:Comma-Separated Value, Comma separated value files
Show ： Form status
Source file ： Formatted text with newline and comma separated rows and columns , Each row of data represents a record
because csv Easy to show , Read and write , So it's also used in many places csv Storage and transmission of small and medium-sized data , For the convenience of teaching , We will often operate csv File format , But it is also easy to operate the data in the database

numpy Reading data
Insert picture description here

numpy Read and store data

# coding=utf-8
import numpy as np
us_file_path = "./youtube_video_data/US_video_data_numbers.csv"
uk_file_path = "./youtube_video_data/GB_video_data_numbers.csv"
# t1 = np.loadtxt(us_file_path,delimiter=",",dtype="int",unpack=True)
t2 = np.loadtxt(us_file_path,delimiter=",",dtype="int")
# print(t1)
print(t2)
print("*"*100)

b = t2[2:5,1:4]
# print(b)
# Take multiple non adjacent points 
# The result is （0,0） （2,1） （2,3）
c = t2[[0,2,2],[0,1,3]]
print(c)

Running results ：
Insert picture description here

numpy Boolean index in
Insert picture description here

numpy The ternary operator in
Insert picture description here

numpy Medium nan and inf

nan(NAN,Nan):not a number It's not a number

When we read the local file as float When , If there is a deficiency , Will appear nan
As an inappropriate calculation ( For example, infinity (inf) Subtract infinity )
inf(-inf,inf):infinity,inf Positive infinity ,-inf Negative infinity
When will it show up inf Include （-inf,+inf）
For example, a number divided by 0,（python An error will be reported directly in ,numpy There is a inf perhaps -inf）

numpy Medium nan Points for attention

1. Two nan It's not equal
Insert picture description here

2.np.nan!=np.nan
3. Take advantage of the above features , Judge the... In the array nan The number of
Insert picture description here

4. Judge whether a number is nan adopt np.isnan(a) To judge
Insert picture description here

5.nan And any value calculated as nan

###numpy Statistical functions commonly used in
Sum up ：t.sum(axis=None)
mean value ：t.mean(a,axis=None) Affected by outliers
The median ：np.median(t,axis=None)
Maximum ：t.max(axis=None)
minimum value ：t.min(axis=None)
extremum ：np.ptp(t,axis=None) That is, the difference between the maximum value and the minimum value is only
Standard deviation ：t.std(axis=None)

numpy Fill in nan

# coding=utf-8
import numpy as np
# print(t1)
def fill_ndarray(t1):
    for i in range(t1.shape[1]):  # Traverse each column 
        temp_col = t1[:,i]  # The current column 
        nan_num = np.count_nonzero(temp_col!=temp_col)
        if nan_num !=0: # Not for 0, Indicates that there are... In the current column nan
            temp_not_nan_col = temp_col[temp_col==temp_col] # The current column is not nan Of array
            #  Check that the current is nan The location of , Assign a value that is not nan The average of 
            temp_col[np.isnan(temp_col)] = temp_not_nan_col.mean()
    return t1
if __name__ == '__main__':
    t1 = np.arange(24).reshape((4, 6)).astype("float")
    t1[1, 2:] = np.nan
    print(t1)
    t1 = fill_ndarray(t1)
    print(t1)

Running results ：
Insert picture description here

【 Hands on 】 Britain and the United States each youtube1000 The data is combined with the previous matplotlib Draw a histogram of the number of comments

import numpy as np
from matplotlib import  pyplot as plt
us_file_path = "./youtube_video_data/US_video_data_numbers.csv"
uk_file_path = "./youtube_video_data/GB_video_data_numbers.csv"
# t1 = np.loadtxt(us_file_path,delimiter=",",dtype="int",unpack=True)
t_us = np.loadtxt(us_file_path,delimiter=",",dtype="int")
# Take the data of the comment 
t_us_comments = t_us[:,-1]
# Choose more than 5000 Small data 
t_us_comments = t_us_comments[t_us_comments<=5000]
print(t_us_comments.max(),t_us_comments.min())
d = 50
bin_nums = (t_us_comments.max()-t_us_comments.min())//d
# mapping 
plt.figure(figsize=(20,8),dpi=80)
plt.hist(t_us_comments,bin_nums)
plt.show()

Running results ：
Insert picture description here

【 Hands on 】 I hope to know about youtube The relationship between the number of comments and the number of likes in the video , How to draw the change map

import numpy as np
from matplotlib import  pyplot as plt
us_file_path = "./youtube_video_data/US_video_data_numbers.csv"
uk_file_path = "./youtube_video_data/GB_video_data_numbers.csv"
t_uk = np.loadtxt(uk_file_path,delimiter=",",dtype="int")
# Choose to like books better than 50 Ten thousand small data 
t_uk = t_uk[t_uk[:,1]<=500000]
t_uk_comment = t_uk[:,-1]
t_uk_like = t_uk[:,1]
plt.figure(figsize=(20,8),dpi=80)
plt.scatter(t_uk_like,t_uk_comment)
plt.show()

Running results ：
Insert picture description here

Row and column swapping of arrays

The horizontal or vertical splicing of arrays is very simple , But what should we pay attention to before splicing ？
When splicing vertically ： Each column represents the same meaning ！！！ Otherwise, the bull's head is not right for the horse's mouth
If each column has a different meaning , At this time, the columns of a certain group of numbers should be exchanged , Make it the same as another class

【 Hands on 】 Now I hope to study and analyze the data methods of the two countries in the previous case , At the same time, keep the information of the country （ Country source of each data ）, What to do

import numpy as np
us_data = "./youtube_video_data/US_video_data_numbers.csv"
uk_data = "./youtube_video_data/GB_video_data_numbers.csv"
# Load country data 
us_data = np.loadtxt(us_data,delimiter=",",dtype=int)
uk_data = np.loadtxt(uk_data,delimiter=",",dtype=int)
#  Add country information 
# The structure is all 0 The data of 
zeros_data = np.zeros((us_data.shape[0],1)).astype(int)
ones_data = np.ones((uk_data.shape[0],1)).astype(int)
# Add a column with all 0,1 Array of 
us_data = np.hstack((us_data,zeros_data))
uk_data = np.hstack((uk_data,ones_data))
#  Splice two sets of data 
final_data = np.vstack((us_data,uk_data))
print(final_data)

Running results ：
Insert picture description here

numpy More easy to use methods

1. Get the position of the maximum value and the minimum value
np.argmax(t,axis=0)
np.argmin(t,axis=1)
2. Create a full 0 Array of : np.zeros((3,4))
3. Create a full 1 Array of :np.ones((3,4))
4. Create a diagonal for 1 The square array of ( Matrix )：np.eye(3)

原网站

版权声明
本文为[weixin_ forty-five million six hundred and forty-nine thousand ]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202160037193761.html