当前位置:网站首页>How to customize sorting for pandas dataframe
How to customize sorting for pandas dataframe
2020-11-06 01:28:00 【Artificial intelligence meets pioneer】
author |B. Chen compile |VK source |Towards Data Science

Pandas DataFrame There's a built-in method sort_values(), You can sort values according to a given variable . The method itself is quite simple to use , But it doesn't work with custom sort , for example ,
-
t T-shirt size :XS、S、M、L and XL
-
month : January 、 February 、 March 、 April, etc
-
What day : Monday 、 Tuesday 、 Wednesday 、 Thursday 、 Friday 、 Saturday and Sunday .
In this paper , We will learn how to deal with Pandas DataFrame Custom sort .
Please check my Github repo To get the source code :https://github.com/BindiChen/machine-learning/blob/master/data-analysis/017-pandas-custom-sort/pandas-custom-sort.ipynb
problem
Suppose we have a data set about clothing stores :
df = pd.DataFrame({
'cloth_id': [1001, 1002, 1003, 1004, 1005, 1006],
'size': ['S', 'XL', 'M', 'XS', 'L', 'S'],
})

We can see , Each piece of cloth has a size value , The data should be sorted in the following order :
-
XS For extra large
-
S For Trumpet
-
M For medium
-
L For big
-
XL For extra large
however , When calling sort_values('size') when , You will get the following output .

The output is not what we want , But it's technically correct . actually ,sort_values() It is to sort numerical data in numerical order , Sort the object data in alphabetical order .
Here are two common solutions :
-
Create a new column for a custom sort
-
Use CategoricalDtype Cast data to an ordered category type
Create a new column for a custom sort
In this solution , A mapping data frame is needed to represent a custom sort , Then create a new column from the map , Finally, we can sort the data by new columns . Let's take an example to see how this works .
First , Let's create a mapping data frame to represent a custom sort .
df_mapping = pd.DataFrame({
'size': ['XS', 'S', 'M', 'L', 'XL'],
})
sort_mapping = df_mapping.reset_index().set_index('size')

after , Use sort_mapping Create a new column with the mapping values in size_num.
df['size_num'] = df['size'].map(sort_mapping['index'])
Last , Sort values by new column size .
df.sort_values('size_num')

This, of course, is our job . But it creates an alternate column , Efficiency may be reduced when dealing with large data sets .
We can use CategoricalDtype To solve this problem more effectively .
Use CategoricalDtype Cast data to an ordered category type
CategoricalDtype Is a type of categorical data with a category and order [1]. It's very useful for creating custom sorts [2]. Let's take an example to see how this works .
First , Let's import CategoricalDtype.
from pandas.api.types import CategoricalDtype
then , Create a custom category type cat_size_order
-
The first parameter is set to ['XS'、'S'、'M'、'L'、'XL'] As a unique value of size .
-
The second parameter ordered=True, Think of this variable as ordered .
cat_size_order = CategoricalDtype(
['XS', 'S', 'M', 'L', 'XL'],
ordered=True
)
then , call astype(cat_size_order) Cast size data to a custom category type . By running df['size'], We can see size Column has been converted to a category type , The order is [XS<S<M<L<XL].
>>> df['size'] = df['size'].astype(cat_size_order)
>>> df['size']
0 S
1 XL
2 M
3 XS
4 L
5 S
Name: size, dtype: category
Categories (5, object): [XS < S < M < L < XL]
Last , We can call the same method to sort the values .
df.sort_values('size')

It works better . Let's see what the principle is .
Use cat Of codes Attribute access
Now? size Column has been converted to category type , We can use .cat Accessor to view the classification properties . Behind the scenes , It USES codes Property to represent the size of an ordered variable .
Let's create a new column code , So we can compare size and code values side by side .
df['codes'] = df['size'].cat.codes
df

We can see XS、S、M、L and XL The codes for are 0、1、2、3、4 and 5.codes Is the actual value of the category . By running df.info(), We can see that it's actually int8.
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cloth_id 6 non-null int64
1 size 6 non-null category
2 codes 6 non-null int8
dtypes: category(1), int64(1), int8(1)
memory usage: 388.0 bytes
Sort by multiple variables
Next , Let's make things a little more complicated . here , We will sort the data frames by multiple variables .
df = pd.DataFrame({
'order_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007],
'customer_id': [10, 12, 12, 12, 10, 10, 10],
'month': ['Feb', 'Jan', 'Jan', 'Feb', 'Feb', 'Jan', 'Feb'],
'day_of_week': ['Mon', 'Wed', 'Sun', 'Tue', 'Sat', 'Mon', 'Thu'],
})
Similarly , Let's create two custom category types cat_day_of_week and cat_month, And pass them on to astype().
cat_day_of_week = CategoricalDtype(
['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],
ordered=True
)
cat_month = CategoricalDtype(
['Jan', 'Feb', 'Mar', 'Apr'],
ordered=True,
)
df['day_of_week'] = df['day_of_week'].astype(cat_day_of_week)
df['month'] = df['month'].astype(cat_month)
To sort by multiple variables , We just need to pass a list instead of sort_values(). for example , Press month and day_of_week Sort .
df.sort_values(['month', 'day_of_week'])

Press ustomer_id,month and day_of_week Sort .
df.sort_values(['customer_id', 'month', 'day_of_week'])

That's it , Thanks for reading .
In my, please Github Export the notebook to get the source code :https://github.com/BindiChen/machine-learning/blob/master/data-analysis/017-pandas-custom-sort/pandas-custom-sort.ipynb
Reference
- [1] Pandas.CategoricalDtype API(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.CategoricalDtype.html)
- [2] Pandas Categorical CategoricalDtype tutorial (https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#categorical-categoricaldtype)
Link to the original text :https://towardsdatascience.com/how-to-do-a-custom-sort-on-pandas-dataframe-ac18e7ea5320
Welcome to join us AI Blog station : http://panchuang.net/
sklearn Machine learning Chinese official documents : http://sklearn123.com/
Welcome to pay attention to pan Chuang blog resource summary station : http://docs.panchuang.net/
版权声明
本文为[Artificial intelligence meets pioneer]所创,转载请带上原文链接,感谢
边栏推荐
- Swagger 3.0 天天刷屏,真的香嗎?
- Brief introduction and advantages and disadvantages of deepwalk model
- git rebase的時候捅婁子了,怎麼辦?線上等……
- 6.6.1 localeresolver internationalization parser (1) (in-depth analysis of SSM and project practice)
- 教你轻松搞懂vue-codemirror的基本用法:主要实现代码编辑、验证提示、代码格式化
- Can't be asked again! Reentrantlock source code, drawing a look together!
- What is the side effect free method? How to name it? - Mario
- ES6学习笔记(五):轻松了解ES6的内置扩展对象
- How to use parameters in ES6
- Skywalking series blog 5-apm-customize-enhance-plugin
猜你喜欢

keras model.compile Loss function and optimizer

The difference between Es5 class and ES6 class

Filecoin最新动态 完成重大升级 已实现四大项目进展!

The road of C + + Learning: from introduction to mastery

带你学习ES5中新增的方法

Did you blog today?

NLP model Bert: from introduction to mastery (1)

PHP应用对接Justswap专用开发包【JustSwap.PHP】

Vue 3 responsive Foundation

vue任意关系组件通信与跨组件监听状态 vue-communication
随机推荐
Python crawler actual combat details: crawling home of pictures
一篇文章教会你使用HTML5 SVG 标签
Brief introduction of TF flags
axios学习笔记(二):轻松弄懂XHR的使用及如何封装简易axios
Three Python tips for reading, creating and running multiple files
6.4 viewresolver view parser (in-depth analysis of SSM and project practice)
use Asponse.Words Working with word templates
Process analysis of Python authentication mechanism based on JWT
加速「全民直播」洪流,如何攻克延时、卡顿、高并发难题?
H5 makes its own video player (JS Part 2)
Mongodb (from 0 to 1), 11 days mongodb primary to intermediate advanced secret
Swagger 3.0 天天刷屏,真的香嗎?
Skywalking series blog 5-apm-customize-enhance-plugin
小程序入门到精通(二):了解小程序开发4个重要文件
5.5 controlleradvice notes - SSM in depth analysis and project practice
Filecoin的经济模型与未来价值是如何支撑FIL币价格破千的
How to use Python 2.7 after installing anaconda3?
The difference between Es5 class and ES6 class
采购供应商系统是什么?采购供应商管理平台解决方案
Synchronous configuration from git to consult with git 2consul