当前位置:网站首页>JSON data flattening pd json_ normalize
JSON data flattening pd json_ normalize
2022-07-07 07:57:00 【Ouchen Eli】
pandas
There's a built-in feature called .json_normalize
.
pandas
It is mentioned in the document of : Will be semi-structured JSON
The data is normalized to a flat table .
def _json_normalize(
data: Union[Dict, List[Dict]],
record_path: Optional[Union[str, List]] = None,
meta: Optional[Union[str, List[Union[str, List[str]]]]] = None,
meta_prefix: Optional[str] = None,
record_prefix: Optional[str] = None,
errors: str = "raise",
sep: str = ".",
max_level: Optional[int] = None,
) -> "DataFrame":
"""
Normalize semi-structured JSON data into a flat table.
Parameters
----------
data : dict or list of dicts
Unserialized JSON objects.
record_path : str or list of str, default None
Path in each object to list of records. If not passed, data will be
assumed to be an array of records.
meta : list of paths (str or list of str), default None
Fields to use as metadata for each record in resulting table.
meta_prefix : str, default None
If True, prefix records with dotted (?) path, e.g. foo.bar.field if
meta is ['foo', 'bar'].
record_prefix : str, default None
If True, prefix records with dotted (?) path, e.g. foo.bar.field if
path to records is ['foo', 'bar'].
errors : {'raise', 'ignore'}, default 'raise'
Configures error handling.
* 'ignore' : will ignore KeyError if keys listed in meta are not
always present.
* 'raise' : will raise KeyError if keys listed in meta are not
always present.
sep : str, default '.'
Nested records will generate names separated by sep.
e.g., for sep='.', {'foo': {'bar': 0}} -> foo.bar.
max_level : int, default None
Max number of levels(depth of dict) to normalize.
if None, normalizes all levels.
.. versionadded:: 0.25.0
Returns
-------
frame : DataFrame
Normalize semi-structured JSON data into a flat table.
Examples
--------
>>> data = [{'id': 1, 'name': {'first': 'Coleen', 'last': 'Volk'}},
... {'name': {'given': 'Mose', 'family': 'Regner'}},
... {'id': 2, 'name': 'Faye Raker'}]
>>> pd.json_normalize(data)
id name.first name.last name.given name.family name
0 1.0 Coleen Volk NaN NaN NaN
1 NaN NaN NaN Mose Regner NaN
2 2.0 NaN NaN NaN NaN Faye Raker
>>> data = [{'id': 1,
... 'name': "Cole Volk",
... 'fitness': {'height': 130, 'weight': 60}},
... {'name': "Mose Reg",
... 'fitness': {'height': 130, 'weight': 60}},
... {'id': 2, 'name': 'Faye Raker',
... 'fitness': {'height': 130, 'weight': 60}}]
>>> pd.json_normalize(data, max_level=0)
id name fitness
0 1.0 Cole Volk {'height': 130, 'weight': 60}
1 NaN Mose Reg {'height': 130, 'weight': 60}
2 2.0 Faye Raker {'height': 130, 'weight': 60}
Normalizes nested data up to level 1.
>>> data = [{'id': 1,
... 'name': "Cole Volk",
... 'fitness': {'height': 130, 'weight': 60}},
... {'name': "Mose Reg",
... 'fitness': {'height': 130, 'weight': 60}},
... {'id': 2, 'name': 'Faye Raker',
... 'fitness': {'height': 130, 'weight': 60}}]
>>> pd.json_normalize(data, max_level=1)
id name fitness.height fitness.weight
0 1.0 Cole Volk 130 60
1 NaN Mose Reg 130 60
2 2.0 Faye Raker 130 60
>>> data = [{'state': 'Florida',
... 'shortname': 'FL',
... 'info': {'governor': 'Rick Scott'},
... 'counties': [{'name': 'Dade', 'population': 12345},
... {'name': 'Broward', 'population': 40000},
... {'name': 'Palm Beach', 'population': 60000}]},
... {'state': 'Ohio',
... 'shortname': 'OH',
... 'info': {'governor': 'John Kasich'},
... 'counties': [{'name': 'Summit', 'population': 1234},
... {'name': 'Cuyahoga', 'population': 1337}]}]
>>> result = pd.json_normalize(data, 'counties', ['state', 'shortname',
... ['info', 'governor']])
>>> result
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
>>> data = {'A': [1, 2]}
>>> pd.json_normalize(data, 'A', record_prefix='Prefix.')
Prefix.0
0 1
1 2
Returns normalized data with columns prefixed with the given string.
"""
def _pull_field(
js: Dict[str, Any], spec: Union[List, str]
) -> Union[Scalar, Iterable]:
"""Internal function to pull field"""
result = js
if isinstance(spec, list):
for field in spec:
result = result[field]
else:
result = result[spec]
return result
def _pull_records(js: Dict[str, Any], spec: Union[List, str]) -> List:
"""
Internal function to pull field for records, and similar to
_pull_field, but require to return list. And will raise error
if has non iterable value.
"""
result = _pull_field(js, spec)
# GH 31507 GH 30145, GH 26284 if result is not list, raise TypeError if not
# null, otherwise return an empty list
if not isinstance(result, list):
if pd.isnull(result):
result = []
else:
raise TypeError(
f"{js} has non list value {result} for path {spec}. "
"Must be list or null."
)
return result
if isinstance(data, list) and not data:
return DataFrame()
# A bit of a hackjob
if isinstance(data, dict):
data = [data]
if record_path is None:
if any([isinstance(x, dict) for x in y.values()] for y in data):
# naive normalization, this is idempotent for flat records
# and potentially will inflate the data considerably for
# deeply nested structures:
# {VeryLong: { b: 1,c:2}} -> {VeryLong.b:1 ,VeryLong.c:@}
#
# TODO: handle record value which are lists, at least error
# reasonably
data = nested_to_record(data, sep=sep, max_level=max_level)
return DataFrame(data)
elif not isinstance(record_path, list):
record_path = [record_path]
if meta is None:
meta = []
elif not isinstance(meta, list):
meta = [meta]
_meta = [m if isinstance(m, list) else [m] for m in meta]
# Disastrously inefficient for now
records: List = []
lengths = []
meta_vals: DefaultDict = defaultdict(list)
meta_keys = [sep.join(val) for val in _meta]
def _recursive_extract(data, path, seen_meta, level=0):
if isinstance(data, dict):
data = [data]
if len(path) > 1:
for obj in data:
for val, key in zip(_meta, meta_keys):
if level + 1 == len(val):
seen_meta[key] = _pull_field(obj, val[-1])
_recursive_extract(obj[path[0]], path[1:], seen_meta, level=level + 1)
else:
for obj in data:
recs = _pull_records(obj, path[0])
recs = [
nested_to_record(r, sep=sep, max_level=max_level)
if isinstance(r, dict)
else r
for r in recs
]
# For repeating the metadata later
lengths.append(len(recs))
for val, key in zip(_meta, meta_keys):
if level + 1 > len(val):
meta_val = seen_meta[key]
else:
try:
meta_val = _pull_field(obj, val[level:])
except KeyError as e:
if errors == "ignore":
meta_val = np.nan
else:
raise KeyError(
"Try running with errors='ignore' as key "
f"{e} is not always present"
) from e
meta_vals[key].append(meta_val)
records.extend(recs)
_recursive_extract(data, record_path, {}, level=0)
result = DataFrame(records)
if record_prefix is not None:
result = result.rename(columns=lambda x: f"{record_prefix}{x}")
# Data types, a problem
for k, v in meta_vals.items():
if meta_prefix is not None:
k = meta_prefix + k
if k in result:
raise ValueError(
f"Conflicting metadata name {k}, need distinguishing prefix "
)
result[k] = np.array(v, dtype=object).repeat(lengths)
return result
json_normalize() Function parameters
Parameter name | explain |
---|---|
data | Unresolved Json object , It can also be Json List objects |
record_path | List or string , If Json Nested lists in objects are not set here , After parsing, the whole list will be directly stored in one column for display |
meta | Json Object key , Nested tags can also be used when there are multiple layers of data |
meta_prefix | The prefix of the key |
record_prefix | Prefix of nested list |
errors | error message , Can be set to ignore , Said if key If not, ignore the error , It can also be set to raise , Said if key If it does not exist, an error will be reported to prompt . The default value is raise |
sep | Multi-storey key Separator between , The default value is . ( One point ) |
max_level | analysis Json The maximum number of layers of the object , It is suitable for multi-layer nested Json object |
Before the code demonstration, import the corresponding dependent Libraries , Not installed pandas Please install the library by yourself ( This code is in Jupyter Notebook Running in the environment ).
from pandas import json_normalize
import pandas as pd
1. Analyze the most basic Json
a. Parsing general Json object
a_dict = {
'school': 'ABC primary school',
'location': 'London',
'ranking': 2
}
pd.json_normalize(a_dict)
The output is :
b. Analyze a Json The object list
json_list = [
{'class': 'Year 1', 'student number': 20, 'room': 'Yellow'},
{'class': 'Year 2', 'student number': 25, 'room': 'Blue'}
]
pd.json_normalize(json_list)
The output is :
2. Parse a with multiple layers of data Json
a. Parse a with multiple layers of data Json object
json_obj = {
'school': 'ABC primary school',
'location': 'London',
'ranking': 2,
'info': {
'president': 'John Kasich',
'contacts': {
'email': {
'admission': '[email protected]',
'general': '[email protected]'
},
'tel': '123456789',
}
}
}
pd.json_normalize(json_obj)
The output is :
Multi-storey key Use points to separate , Shows all the data , This has been resolved 3 layer , The above writing is similar to pd.json_normalize(json_obj, max_level=3)
Equivalent .
If you set max_level=1
, The output result is as shown in the figure below ,contacts
Part of the data collection is integrated into a column
If you set max_level=2
, The output result is as shown in the figure below ,contacts
Under the email
Part of the data collection is integrated into a column
b. Parse a with multiple layers of data Json The object list
json_list = [
{
'class': 'Year 1',
'student count': 20,
'room': 'Yellow',
'info': {
'teachers': {
'math': 'Rick Scott',
'physics': 'Elon Mask'
}
}
},
{
'class': 'Year 2',
'student count': 25,
'room': 'Blue',
'info': {
'teachers': {
'math': 'Alan Turing',
'physics': 'Albert Einstein'
}
}
}
]
pd.json_normalize(json_list)
The output is :
If you separate max_level
Set to 2
and 3
, What should the output results be ? Please try it yourself ~
3. Resolve a with nested lists Json
json_obj = {
'school': 'ABC primary school',
'location': 'London',
'ranking': 2,
'info': {
'president': 'John Kasich',
'contacts': {
'email': {
'admission': '[email protected]',
'general': '[email protected]'
},
'tel': '123456789',
}
},
'students': [
{'name': 'Tom'},
{'name': 'James'},
{'name': 'Jacqueline'}
],
}
pd.json_normalize(json_obj)
In this case students
The value corresponding to the key is a list
, Use []
Cover up . The above method is directly used for analysis , The results are as follows :
students
Part of the data was not successfully parsed , It can be record_path
Set the value , The call mode is pd.json_normalize(json_obj, record_path='students')
, In this calling mode , The results obtained only include name
Part of the data .
To add information for other fields , It needs to be meta
parameter assignment , For example, under the following call mode , The result is as follows :
pd.json_normalize(json_obj, record_path='students', meta=['school', 'location', ['info', 'contacts', 'tel'], ['info', 'contacts', 'email', 'general']])
4. When Key
How to ignore the system error when it does not exist
data = [
{
'class': 'Year 1',
'student count': 20,
'room': 'Yellow',
'info': {
'teachers': {
'math': 'Rick Scott',
'physics': 'Elon Mask',
}
},
'students': [
{ 'name': 'Tom', 'sex': 'M' },
{ 'name': 'James', 'sex': 'M' },
]
},
{
'class': 'Year 2',
'student count': 25,
'room': 'Blue',
'info': {
'teachers': {
# no math teacher
'physics': 'Albert Einstein'
}
},
'students': [
{ 'name': 'Tony', 'sex': 'M' },
{ 'name': 'Jacqueline', 'sex': 'F' },
]
},
]
pd.json_normalize(
data,
record_path =['students'],
meta=['class', 'room', ['info', 'teachers', 'math']]
)
stay class
be equal to Year 2
Of Json In the object ,teachers
Under the math
The key doesn't exist , Running the above code directly will report the following error , Tips math
Keys don't always exist , And the corresponding suggestions are given :Try running with errors='ignore'
.
add to errors
After the condition , The results of re running are shown in the figure below , No, math
The part of the key uses NaN
Filled .
pd.json_normalize(
data,
record_path =['students'],
meta=['class', 'room', ['info', 'teachers', 'math']],
errors='ignore'
)
5. Use sep
Parameters are nested Json Of Key
Set separator
stay 2.a
Case study , It can be noted that the output result has multiple layers key
The header of the data column is .
To multilayer key Separating , It can be for sep
Assign a value to change the separator .
json_obj = {
'school': 'ABC primary school',
'location': 'London',
'ranking': 2,
'info': {
'president': 'John Kasich',
'contacts': {
'email': {
'admission': '[email protected]',
'general': '[email protected]'
},
'tel': '123456789',
}
}
}
pd.json_normalize(json_obj, sep='->')
The output is :
6. Prefix nested list data and metadata
stay 3
In the output of example , Each column name has no prefix , for example name
I don't know if this column is the data obtained by metadata parsing , Or through student
Data from nested lists , Therefore record_prefix
and meta_prefix
The parameters are assigned respectively , You can add the corresponding prefix to the output result .
json_obj = {
'school': 'ABC primary school',
'location': 'London',
'ranking': 2,
'info': {
'president': 'John Kasich',
'contacts': {
'email': {
'admission': '[email protected]',
'general': '[email protected]'
},
'tel': '123456789',
}
},
'students': [
{'name': 'Tom'},
{'name': 'James'},
{'name': 'Jacqueline'}
],
}
pd.json_normalize(json_obj, record_path='students',
meta=['school', 'location', ['info', 'contacts', 'tel'], ['info', 'contacts', 'email', 'general']],
record_prefix='students->',
meta_prefix='meta->',
sep='->')
In this case , Add... To nested list data students->
Prefix , Add... For metadata meta->
Prefix , Will be nested key Change the separator between to ->
, The output is :
7. adopt URL obtain Json And analyze the data
adopt URL Getting data requires requests
library , Please install the corresponding library by yourself .
import requests
from pandas import json_normalize
# Through the weather API, Get Shenzhen near 7 Days of the weather
url = 'https://tianqiapi.com/free/week'
# Pass in url, And set the corresponding params
r = requests.get(url, params={"appid":"59257444", "appsecret":"uULlTGV9 ", 'city':' Shenzhen '})
# Convert the obtained value to json object
result = r.json()
df = json_normalize(result, meta=['city', 'cityid', 'update_time'], record_path=['data'])
df
result The results are as follows , among data
For a nested list :
{'cityid': '101280601',
'city': ' Shenzhen ',
'update_time': '2021-08-09 06:39:49',
'data': [{'date': '2021-08-09',
'wea': ' Moderate rain to thunderstorm ',
'wea_img': 'yu',
'tem_day': '32',
'tem_night': '26',
'win': ' No sustained wind direction ',
'win_speed': '<3 level '},
{'date': '2021-08-10',
'wea': ' thunder shower ',
'wea_img': 'yu',
'tem_day': '32',
'tem_night': '27',
'win': ' No sustained wind direction ',
'win_speed': '<3 level '},
{'date': '2021-08-11',
'wea': ' thunder shower ',
'wea_img': 'yu',
'tem_day': '31',
'tem_night': '27',
'win': ' No sustained wind direction ',
'win_speed': '<3 level '},
{'date': '2021-08-12',
'wea': ' cloudy ',
'wea_img': 'yun',
'tem_day': '33',
'tem_night': '27',
'win': ' No sustained wind direction ',
'win_speed': '<3 level '},
{'date': '2021-08-13',
'wea': ' cloudy ',
'wea_img': 'yun',
'tem_day': '33',
'tem_night': '27',
'win': ' No sustained wind direction ',
'win_speed': '<3 level '},
{'date': '2021-08-14',
'wea': ' cloudy ',
'wea_img': 'yun',
'tem_day': '32',
'tem_night': '27',
'win': ' No sustained wind direction ',
'win_speed': '<3 level '},
{'date': '2021-08-15',
'wea': ' cloudy ',
'wea_img': 'yun',
'tem_day': '32',
'tem_night': '27',
'win': ' No sustained wind direction ',
'win_speed': '<3 level '}]}
The output result after parsing is :
8. To explore the : Resolution with Multiple nested lists
Of Json
When one Json When there is more than one nested list in an object or object list ,record_path
Cannot include all nested lists , Because it can only receive one key value . here , We need to start with... Based on multiple nested lists key
take Json It's resolved into multiple DataFrame
, And I'll put these DataFrame
Spliced according to the actual correlation conditions , And remove duplicates .
json_obj = {
'school': 'ABC primary school',
'location': 'shenzhen',
'ranking': 2,
'info': {
'president': 'John Kasich',
'contacts': {
'email': {
'admission': '[email protected]',
'general': '[email protected]'
},
'tel': '123456789',
}
},
'students': [
{'name': 'Tom'},
{'name': 'James'},
{'name': 'Jacqueline'}
],
# add to university Nested list , add students, The JSON There are... In the object 2 A nested list
'university': [
{'university_name': 'HongKong university shenzhen'},
{'university_name': 'zhongshan university shenzhen'},
{'university_name': 'shenzhen university'}
],
}
# Try to record_path Write the names of two nested lists in , namely record_path = ['students', 'university], The result is of no avail
# So I decided to analyze it twice , Separately record_path Set to university and students, In the end 2 Results combined
df1 = pd.json_normalize(json_obj, record_path=['university'],
meta=['school', 'location', ['info', 'contacts', 'tel'],
['info', 'contacts', 'email', 'general']],
record_prefix='university->',
meta_prefix='meta->',
sep='->')
df2 = pd.json_normalize(json_obj, record_path=['students'],
meta=['school', 'location', ['info', 'contacts', 'tel'],
['info', 'contacts', 'email', 'general']],
record_prefix='students->',
meta_prefix='meta->',
sep='->')
# According to index Associate and remove duplicate Columns
df1.merge(df2, how='left', left_index=True, right_index=True, suffixes=['->', '->']).T.drop_duplicates().T
The output is :
The part marked in the red box on the way is Json Object .
summary
json_normalize()
Method is extremely powerful , It covers almost all parsing JSON Scene , When it comes to more complex scenes , Can give the existing functions for divergent Integration , for example 8. To explore the
The same thing happened in .
With this powerful Json Parsing library , I'll never be afraid to encounter complex Json Data. !
Reference resources :
边栏推荐
- Value sequence (subsequence contribution problem)
- 【webrtc】m98 screen和window采集
- [UVM foundation] what is transaction
- Solve could not find or load the QT platform plugin "xcb" in "
- Technology cloud report: from robot to Cobot, human-computer integration is creating an era
- 2022年茶艺师(中级)考试试题及模拟考试
- C语言通信行程卡后台系统
- 即刻报名|飞桨黑客马拉松第三期等你挑战
- Linux server development, redis source code storage principle and data model
- 【数字IC验证快速入门】17、SystemVerilog学习之基本语法4(随机化Randomization)
猜你喜欢
2022年茶艺师(中级)考试试题及模拟考试
Mysql高低版本切换需要修改的配置5-8(此处以aicode为例)
《动手学深度学习》(四) -- 卷积神经网络 CNN
mysql多列索引(组合索引)特点和使用场景
今日现货白银操作建议
开源生态|打造活力开源社区,共建开源新生态!
Detailed explanation of Kalman filter for motion state estimation
MySQL multi column index (composite index) features and usage scenarios
Linux server development, detailed explanation of redis related commands and their principles
Thinkcmf6.0安装教程
随机推荐
快速使用 Jacoco 代码覆盖率统计
Regular e-commerce problems part1
【webrtc】m98 screen和window采集
A bit of knowledge - about Apple Certified MFI
Most elements
[advanced digital IC Verification] command query method and common command interpretation of VCs tool
Qt学习27 应用程序中的主窗口
[SUCTF 2019]Game
[mathematical notes] radian
Resource create package method
Figure out the working principle of gpt3
pytest+allure+jenkins安装问题:pytest: error: unrecognized arguments: --alluredir
Redis technology leak detection and filling (II) - expired deletion strategy
[Matlab] Simulink 自定义函数中的矩阵乘法工作不正常时可以使用模块库中的矩阵乘法模块代替
Visualization Document Feb 12 16:42
QT learning 28 toolbar in the main window
Pytest+allure+jenkins environment -- completion of pit filling
padavan手动安装php
IO stream file
2022年茶艺师(中级)考试试题及模拟考试