当前位置:网站首页>How to customize sorting for pandas dataframe
How to customize sorting for pandas dataframe
2020-11-06 01:28:00 【Artificial intelligence meets pioneer】
author |B. Chen compile |VK source |Towards Data Science

Pandas DataFrame There's a built-in method sort_values(), You can sort values according to a given variable . The method itself is quite simple to use , But it doesn't work with custom sort , for example ,
-
t T-shirt size :XS、S、M、L and XL
-
month : January 、 February 、 March 、 April, etc
-
What day : Monday 、 Tuesday 、 Wednesday 、 Thursday 、 Friday 、 Saturday and Sunday .
In this paper , We will learn how to deal with Pandas DataFrame Custom sort .
Please check my Github repo To get the source code :https://github.com/BindiChen/machine-learning/blob/master/data-analysis/017-pandas-custom-sort/pandas-custom-sort.ipynb
problem
Suppose we have a data set about clothing stores :
df = pd.DataFrame({
'cloth_id': [1001, 1002, 1003, 1004, 1005, 1006],
'size': ['S', 'XL', 'M', 'XS', 'L', 'S'],
})

We can see , Each piece of cloth has a size value , The data should be sorted in the following order :
-
XS For extra large
-
S For Trumpet
-
M For medium
-
L For big
-
XL For extra large
however , When calling sort_values('size') when , You will get the following output .

The output is not what we want , But it's technically correct . actually ,sort_values() It is to sort numerical data in numerical order , Sort the object data in alphabetical order .
Here are two common solutions :
-
Create a new column for a custom sort
-
Use CategoricalDtype Cast data to an ordered category type
Create a new column for a custom sort
In this solution , A mapping data frame is needed to represent a custom sort , Then create a new column from the map , Finally, we can sort the data by new columns . Let's take an example to see how this works .
First , Let's create a mapping data frame to represent a custom sort .
df_mapping = pd.DataFrame({
'size': ['XS', 'S', 'M', 'L', 'XL'],
})
sort_mapping = df_mapping.reset_index().set_index('size')

after , Use sort_mapping Create a new column with the mapping values in size_num.
df['size_num'] = df['size'].map(sort_mapping['index'])
Last , Sort values by new column size .
df.sort_values('size_num')

This, of course, is our job . But it creates an alternate column , Efficiency may be reduced when dealing with large data sets .
We can use CategoricalDtype To solve this problem more effectively .
Use CategoricalDtype Cast data to an ordered category type
CategoricalDtype Is a type of categorical data with a category and order [1]. It's very useful for creating custom sorts [2]. Let's take an example to see how this works .
First , Let's import CategoricalDtype.
from pandas.api.types import CategoricalDtype
then , Create a custom category type cat_size_order
-
The first parameter is set to ['XS'、'S'、'M'、'L'、'XL'] As a unique value of size .
-
The second parameter ordered=True, Think of this variable as ordered .
cat_size_order = CategoricalDtype(
['XS', 'S', 'M', 'L', 'XL'],
ordered=True
)
then , call astype(cat_size_order) Cast size data to a custom category type . By running df['size'], We can see size Column has been converted to a category type , The order is [XS<S<M<L<XL].
>>> df['size'] = df['size'].astype(cat_size_order)
>>> df['size']
0 S
1 XL
2 M
3 XS
4 L
5 S
Name: size, dtype: category
Categories (5, object): [XS < S < M < L < XL]
Last , We can call the same method to sort the values .
df.sort_values('size')

It works better . Let's see what the principle is .
Use cat Of codes Attribute access
Now? size Column has been converted to category type , We can use .cat Accessor to view the classification properties . Behind the scenes , It USES codes Property to represent the size of an ordered variable .
Let's create a new column code , So we can compare size and code values side by side .
df['codes'] = df['size'].cat.codes
df

We can see XS、S、M、L and XL The codes for are 0、1、2、3、4 and 5.codes Is the actual value of the category . By running df.info(), We can see that it's actually int8.
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cloth_id 6 non-null int64
1 size 6 non-null category
2 codes 6 non-null int8
dtypes: category(1), int64(1), int8(1)
memory usage: 388.0 bytes
Sort by multiple variables
Next , Let's make things a little more complicated . here , We will sort the data frames by multiple variables .
df = pd.DataFrame({
'order_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007],
'customer_id': [10, 12, 12, 12, 10, 10, 10],
'month': ['Feb', 'Jan', 'Jan', 'Feb', 'Feb', 'Jan', 'Feb'],
'day_of_week': ['Mon', 'Wed', 'Sun', 'Tue', 'Sat', 'Mon', 'Thu'],
})
Similarly , Let's create two custom category types cat_day_of_week and cat_month, And pass them on to astype().
cat_day_of_week = CategoricalDtype(
['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],
ordered=True
)
cat_month = CategoricalDtype(
['Jan', 'Feb', 'Mar', 'Apr'],
ordered=True,
)
df['day_of_week'] = df['day_of_week'].astype(cat_day_of_week)
df['month'] = df['month'].astype(cat_month)
To sort by multiple variables , We just need to pass a list instead of sort_values(). for example , Press month and day_of_week Sort .
df.sort_values(['month', 'day_of_week'])

Press ustomer_id,month and day_of_week Sort .
df.sort_values(['customer_id', 'month', 'day_of_week'])

That's it , Thanks for reading .
In my, please Github Export the notebook to get the source code :https://github.com/BindiChen/machine-learning/blob/master/data-analysis/017-pandas-custom-sort/pandas-custom-sort.ipynb
Reference
- [1] Pandas.CategoricalDtype API(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.CategoricalDtype.html)
- [2] Pandas Categorical CategoricalDtype tutorial (https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#categorical-categoricaldtype)
Link to the original text :https://towardsdatascience.com/how-to-do-a-custom-sort-on-pandas-dataframe-ac18e7ea5320
Welcome to join us AI Blog station : http://panchuang.net/
sklearn Machine learning Chinese official documents : http://sklearn123.com/
Welcome to pay attention to pan Chuang blog resource summary station : http://docs.panchuang.net/
版权声明
本文为[Artificial intelligence meets pioneer]所创,转载请带上原文链接,感谢
边栏推荐
- Network security engineer Demo: the original * * is to get your computer administrator rights! 【***】
- IPFS/Filecoin合法性:保护个人隐私不被泄露
- 6.5 request to view name translator (in-depth analysis of SSM and project practice)
- Arrangement of basic knowledge points
- How to use Python 2.7 after installing anaconda3?
- 至联云解析:IPFS/Filecoin挖矿为什么这么难?
- htmlcss
- Examples of unconventional aggregation
- Network security engineer Demo: the original * * is to get your computer administrator rights! 【***】
- Summary of common string algorithms
猜你喜欢

Three Python tips for reading, creating and running multiple files

Python download module to accelerate the implementation of recording

Subordination judgment in structured data

一篇文章带你了解CSS 渐变知识

Network security engineer Demo: the original * * is to get your computer administrator rights! 【***】

一篇文章教会你使用HTML5 SVG 标签

PN8162 20W PD快充芯片,PD快充充电器方案

Aprelu: cross border application, adaptive relu | IEEE tie 2020 for machine fault detection

Network security engineer Demo: the original * * is to get your computer administrator rights! 【***】

前端工程师需要懂的前端面试题(c s s方面)总结(二)
随机推荐
华为云“四个可靠”的方法论
PHP应用对接Justswap专用开发包【JustSwap.PHP】
深度揭祕垃圾回收底層,這次讓你徹底弄懂她
Existence judgment in structured data
[event center azure event hub] interpretation of error information found in event hub logs
Advanced Vue component pattern (3)
Nodejs crawler captures ancient books and records, a total of 16000 pages, experience summary and project sharing
Python + appium automatic operation wechat is enough
使用 Iceberg on Kubernetes 打造新一代云原生数据湖
EOS创始人BM: UE,UBI,URI有什么区别?
中小微企业选择共享办公室怎么样?
每个前端工程师都应该懂的前端性能优化总结:
Face to face Manual Chapter 16: explanation and implementation of fair lock of code peasant association lock and reentrantlock
[C / C + + 1] clion configuration and running C language
Vue 3 responsive Foundation
Classical dynamic programming: complete knapsack problem
嘗試從零開始構建我的商城 (二) :使用JWT保護我們的資訊保安,完善Swagger配置
(2)ASP.NET Core3.1 Ocelot路由
Using consult to realize service discovery: instance ID customization
合约交易系统开发|智能合约交易平台搭建