Operation! Nested JSON second change dataframe!

First published in the public number ：Python Data Science
author ： Dongge takes off

call API And document databases will return nested JSON object , When we use Python When trying to convert a key in a nested structure to a column , Data loaded into pandas The results are as follows ：

df = pd.DataFrame.from_records（results [“ issues”],columns = [“ key”,“ fields”]）

explain ： here results It's a big dictionary ,issues yes results One of the keys ,issues The value of is a nested JSON List of object dictionaries , You'll see that in the back JSON Nested structure .

The problem lies in API Returns nested JSON structure , And the keys we care about are at different levels in objects .

Nested JSON The structure is like this .

And what we want is the following .

Here's a API Take the returned data as an example ,API Usually contains metadata about fields . Suppose these are the fields we want .

key：JSON secret key , At the first level .
summary： Secondary “ Field ” object .
status name： The third level position .
statusCategory name： Located at 4 Nested levels .

Above , We select the fields to extract in issues On the list JSON In the structure, they are respectively in 4 Different levels of nesting , One button one ring .

{
  "expand": "schema,names",
  "issues": [
    {
      "fields": {
        "issuetype": {
          "avatarId": 10300,
          "description": "",
          "id": "10005",
          "name": "New Feature",
          "subtask": False
        },
        "status": {
          "description": "A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.",
          "id": "5",
          "name": "Resolved",
          "statusCategory": {
            "colorName": "green",
            "id": 3,
            "key": "done",
            "name": "Done",
          }
        },
        "summary": "Recovered data collection Defraglar $MFT problem"
      },
      "id": "11861",
      "key": "CAE-160",
    },
    {
      "fields": { 
... more issues],
  "maxResults": 5,
  "startAt": 0,
  "total": 160
}

A bad solution

One option is to just roll the code , Write a function to find a specific field , But the problem is that you have to call this function for each nested field , Then call .apply To DataFrame New column in .

To get a few fields we want , First we extract fields Object to column in key ：

df = (
    df["fields"]
    .apply(pd.Series)
    .merge(df, left_index=True, right_index = True)
)

It can be seen from the above table that , Only summary Is available ,issuetype、status And so on are still buried in nested objects .

Here's the extraction issuetype Medium name One way .

#  extract issue type Of name Call a new column "issue_type"
df_issue_type = (
    df["issuetype"]
    .apply(pd.Series)
    .rename(columns={"name": "issue_type_name"})["issue_type_name"]
)
df = df.assign(issue_type_name = df_issue_type)

Like above , If there are too many levels of nesting , You need to roll a hand to return to realize , Because each level of nesting calls a method like the one above that parses and adds to the new column .

For the weak foundation of programming friends , It's really troublesome to roll one , Especially for data analysts , When you're in a hurry to use data , Hope to get structured data quickly for analysis .

I'd like to share one with you pandas Built in solutions for .

Built-in solution

pandas There's a built-in feature called .json_normalize.

pandas It is mentioned in the document of ： Will be semi-structured JSON The data is normalized to a flat table .

All the code of the previous scheme , With this built-in feature, you just need 3 OK, it's all right . The steps are simple , Understand the following usage .

Identify the fields we want to think about , Use . Symbols connect nested objects .

The nested list that you want to process （ Here is results["issues"]） Put in as a parameter .json_normalize in .

Filter what we define FIELDS list .

FIELDS = ["key", "fields.summary", "fields.issuetype.name", "fields.status.name", "fields.status.statusCategory.name"]
df = pd.json_normalize(results["issues"])
df[FIELDS]

you 're right , It's that simple .

Other operating

Record path

Except pass it on like that results["issues"] Out of the list , We also use record_path Parameter in JSON The path to the specified list in the .

#  Use paths instead of directly results["issues"]
pd.json_normalize(results, record_path="issues")[FIELDS]

Custom delimiter

You can also use sep Parameter to define the separator of nested structure connection , For example, the following will be the default “.” Replace “-”.

###  use  "-"  Replace the default  "."
FIELDS = ["key", "fields-summary", "fields-issuetype-name", "fields-status-name", "fields-status-statusCategory-name"]
pd.json_normalize(results["issues"], sep = "-")[FIELDS]

Control recursion

If you don't want to recurs to every child object , have access to max_level Parameters control depth . under these circumstances , because statusCategory.name The field is located in JSON Object number 4 level , So it won't be included in the result DataFrame in .

#  Only drill down to the second level of nesting 
pd.json_normalize(results, record_path="issues", max_level = 2)

Here is .json_normalize Of pandas Official documentation , If you don't understand, you can learn by yourself , This time Dongge introduced here .