当前位置:网站首页>[hard core dry goods] which company is better in data analysis? Choose pandas or SQL

[hard core dry goods] which company is better in data analysis? Choose pandas or SQL

2022-07-05 19:33:00 Xinyi 2002

Another week , Today, Xiaobian is going to talk about Pandas and SQL Grammatical differences between , I believe for many data analysts , Whether it's Pandas Module or SQL, They are all very many tools used in daily study and work , Of course, we can also be in Pandas From the module SQL sentence , By calling read_sql() Method

Want to get the source code of this tutorial , It can be answered in the background of official account 【20220704】 Can get

Building a database

First we pass SQL Statement is creating a new database , I'm sure everyone knows the basic grammar ,

CREATE TABLE  Table name  (
   Field name   data type  ...
)

Let's take a look at the specific code

import pandas as pd
import sqlite3
connector = sqlite3.connect('public.db')
my_cursor = connector.cursor()
my_cursor.executescript("""
CREATE TABLE sweets_types
(
    id integer NOT NULL,
    name character varying NOT NULL,
    PRIMARY KEY (id)
);
... Limited space , Refer to the source code for details ...
""")

At the same time, we also insert data into these new tables , The code is as follows

my_cursor.executescript("""
INSERT INTO sweets_types(name) VALUES
    ('waffles'),
    ('candy'),
    ('marmalade'),
    ('cookies'),
    ('chocolate');
... Limited space , Refer to the source code for details ...
""")

We can view the new table through the following code , And convert it to DataFrame Data set in format , The code is as follows

df_sweets = pd.read_sql("SELECT * FROM sweets;", connector)

output

46a608b7abb4f92efec46e807b2fb48d.png

We have built a total of 5 Data sets , It mainly involves desserts 、 Types of desserts and data of processing and storage , For example, the data set of desserts mainly includes the weight of desserts 、 Sugar content 、 Production date and expiration time 、 Cost and other data , as well as

df_manufacturers = pd.read_sql("SELECT * FROM manufacturers", connector)

output

1d43d93e379705276048d64d0e4611ff.png

The data set of processing involves the main person in charge and contact information of the factory , The warehouse data set involves the detailed address of the warehouse 、 City location, etc

df_storehouses = pd.read_sql("SELECT * FROM storehouses", connector)

output

09069144e6d6137bdc7384b6115c3931.png

And the dessert category data set ,

df_sweets_types = pd.read_sql("SELECT * FROM sweets_types;", connector)

output

ea9dd8b76c8c563956a14661f7e8daab.png

Data screening

Screening of simple conditions

Next, let's do some data screening , For example, the weight of desserts is equal to 300 The name of dessert , stay Pandas The code in the module looks like this

#  Convert data type 
df_sweets['weight'] = pd.to_numeric(df_sweets['weight'])
#  Output results 
df_sweets[df_sweets.weight == 300].name

output

1      Mikus
6     Soucus
11     Macus
Name: name, dtype: object

Of course, we can also pass pandas In the middle of read_sql() Method to call SQL sentence

pd.read_sql("SELECT name FROM sweets WHERE weight = '300'", connector)

output

6a9d7434b85ebe6d319440b9d5bca05c.png

Let's look at a similar case , The screening cost is equal to 100 The name of dessert , The code is as follows

# Pandas
df_sweets['cost'] = pd.to_numeric(df_sweets['cost'])
df_sweets[df_sweets.cost == 100].name

# SQL
pd.read_sql("SELECT name FROM sweets WHERE cost = '100'", connector)

output

Milty

For text data , We can also further screen out the data we want , The code is as follows

# Pandas
df_sweets[df_sweets.name.str.startswith('M')].name

# SQL
pd.read_sql("SELECT name FROM sweets WHERE name LIKE 'M%'", connector)

output

Milty
Mikus
Mivi
Mi
Misa
Maltik
Macus

Of course. SQL Wildcards in statements ,% Means to match any number of letters , and _ Means to match any letter , The specific differences are as follows

# SQL
pd.read_sql("SELECT name FROM sweets WHERE name LIKE 'M%'", connector)

output

abf34ddae32b04834720465862b94be3.png

pd.read_sql("SELECT name FROM sweets WHERE name LIKE 'M_'", connector)

output

6328b9a0aa3941ec17366028bb4333c8.png

Screening of complex conditions

Let's take a look at data filtering with multiple conditions , For example, we want the weight to be equal to 300 And the cost price is controlled at 150 The name of dessert , The code is as follows

# Pandas
df_sweets[(df_sweets.cost == 150) & (df_sweets.weight == 300)].name

# SQL
pd.read_sql("SELECT name FROM sweets WHERE cost = '150' AND weight = '300'", connector)

output

Mikus

Or the cost price can be controlled within 200-300 Dessert name between , The code is as follows

# Pandas
df_sweets[df_sweets['cost'].between(200, 300)].name

# SQL
pd.read_sql("SELECT name FROM sweets WHERE cost BETWEEN '200' AND '300'", connector)

output

406df04b1e977bcb62196dbec8dafeff.png

If it comes to sorting , stay SQL It uses ORDER BY sentence , The code is as follows

# SQL
pd.read_sql("SELECT name FROM sweets ORDER BY id DESC", connector)

output

0aa82e01df53fa9d44919da43e7cc672.png

And in the Pandas What is called in the module is sort_values() Method , The code is as follows

# Pandas
df_sweets.sort_values(by='id', ascending=False).name

output

11     Macus
10    Maltik
9        Sor
8         Co
7     Soviet
6     Soucus
5     Soltic
4       Misa
3         Mi
2       Mivi
1      Mikus
0      Milty
Name: name, dtype: object

Select the dessert name with the highest cost price , stay Pandas The code in the module looks like this

df_sweets[df_sweets.cost == df_sweets.cost.max()].name

output

11    Macus
Name: name, dtype: object

And in the SQL The code in the statement , We need to first screen out which dessert is the most expensive , Then proceed with further processing , The code is as follows

pd.read_sql("SELECT name FROM sweets WHERE cost = (SELECT MAX(cost) FROM sweets)", connector)

We want to see which cities are warehousing , stay Pandas The code in the module looks like this , By calling unique() Method

df_storehouses['city'].unique()

output

array(['Moscow', 'Saint-petersburg', 'Yekaterinburg'], dtype=object)

And in the SQL The corresponding sentence is DISTINCT keyword

pd.read_sql("SELECT DISTINCT city FROM storehouses", connector)

Data grouping Statistics

stay Pandas Group statistics in modules generally call groupby() Method , Then add a statistical function later , For example, it is to calculate the mean value of scores mean() Method , Or summative sum() Methods, etc. , For example, we want to find out the names of desserts produced and processed in more than one city , The code is as follows

df_manufacturers.groupby('name').name.count()[df_manufacturers.groupby('name').name.count() > 1]

output

name
Mishan    2
Name: name, dtype: int64

And in the SQL The grouping in the statement is also GROUP BY, If there are other conditions later , It's using HAVING keyword , The code is as follows

pd.read_sql("""
SELECT name, COUNT(name) as 'name_count' FROM manufacturers
GROUP BY name HAVING COUNT(name) > 1
""", connector)

Data merging

When two or more datasets need to be merged , stay Pandas Modules , We can call merge() Method , For example, we will df_sweets Data set and df_sweets_types Merge the two data sets , among df_sweets In the middle of sweets_types_id Is the foreign key of the table

df_sweets.head()

output

4518fa75e764f39f1fa19e3b8ad44108.png

df_sweets_types.head()

output

ef32784acc6877e6ce058353b790f7b3.png

The specific data consolidation code is as follows

df_sweets_1 = df_sweets.merge(df_sweets_types, left_on='sweets_types_id', right_on='id')

output

74c336a48c008006d8218e353039860b.png

We will further screen out chocolate flavored desserts , The code is as follows

df_sweets_1.query('name_y == "chocolate"').name_x

output

10    Misa
11     Sor
Name: name_x, dtype: object

and SQL The sentence is relatively simple , The code is as follows

# SQL
pd.read_sql("""
SELECT sweets.name FROM sweets
JOIN sweets_types ON sweets.sweets_types_id = sweets_types.id
WHERE sweets_types.name = 'chocolate';
""", connector)

output

e7281c6d71b9a76cf33974a87be5c1d7.png

The structure of the data set

Let's take a look at the structure of the data set , stay Pandas View directly in the module shape Attribute is enough , The code is as follows

df_sweets.shape

output

(12, 10)

And in the SQL In the sentence , It is

pd.read_sql("SELECT count(*) FROM sweets;", connector)

output

1ec1dbd43d16b7f4b385a1c978319bfc.png

NO.1

Previous recommendation

Historical articles

8 Cool visual charts , Quickly write the visual analysis report that the boss likes to see

【 Hard core original 】 Inventory Python Common encryption algorithms in crawlers , Recommended collection !!

【 Hard core dry goods 】Pandas Data type conversion in modules

use Python among Plotly.Express The module draws several charts , I was really amazed !!

Share 、 Collection 、 give the thumbs-up 、 I'm looking at the arrangement ?

51334da9225c53c360586639414e75d2.gif

57fdd5db0a2f47d03ac5561776c76d3b.gif

7365632e1b1c85e6e03520e617e4875f.gif

e40627ec53c3073ea1e923b007cb349d.gif

原网站

版权声明
本文为[Xinyi 2002]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/186/202207051913501548.html