Extracting Specific Keys/Values From A Messed-Up JSON File (Python)

What to do when a messy JSON file gives you a massive headache

image

Sometimes when we call certain APIs, the format and structure of the returned JSON object can sometimes be pretty messy and confusing. For instance (this example is considered mild BTW):

data = {
    "type": "video",
    "videoID": "vid001",
    "links": [
        {"type":"video", "videoID":"vid002", "links":[]},
        {   "type":"video",
            "videoID":"vid003",
            "links": [
            {"type": "video", "videoID":"vid004"},
            {"type": "video", "videoID":"vid005"},
            ]
        },
        {"type":"video", "videoID":"vid006"},
        {   "type":"video",
            "videoID":"vid007",
            "links": [
            {"type":"video", "videoID":"vid008", "links": [
                {   "type":"video",
                    "videoID":"vid009",
                    "links": [{"type":"video", "videoID":"vid010"}]
                }
            ]}
        ]},
    ]
}

Unfortunately this happens more often than I wish it did.

Python Code To Extract Specific Key-Value Pairs

def extract(data, keys):
    out = []
    queue = [data]
    while len(queue) > 0:
        current = queue.pop(0)
        if type(current) == dict:
            for key in keys:
                if key in current:
                    out.append({key:current[key]})

            for val in current.values():
                if type(val) in [list, dict]:
                    queue.append(val) elif type(current) == list:
            queue.extend(current)
    return outx = extract(data, ["videoID"])
print(x)

Here, we wish to extract all videoIDs from the messy dictionary, so we pass ["videoID"] as the keys argument. The output:

[{'videoID': 'vid001'}, {'videoID': 'vid002'}, {'videoID': 'vid003'},
 {'videoID': 'vid004'}, {'videoID': 'vid005'}, {'videoID': 'vid006'},
 {'videoID': 'vid007'}, {'videoID': 'vid008'}, {'videoID': 'vid009'},
 {'videoID': 'vid010'}]

The Logic Behind The Code

We need to keep track of 2 lists — 1) out, which contains our output and 2) queue, which contains the data structures we wish to search. We first initialize out as an empty list, and queue to contain our entire json data.

  1. remove the first element from the queue, and assign it to current
  2. If current is a dictionary, search it for the keys that we want, and add any found key-value pairs into out.
  3. Then add all values that are either lists or dictionaries back into queue so we can search them again later.
  4. If current is a list, we add everything inside current back into queue, so we can search the individual elements later. This can be done using the .extend method.
  5. Repeat steps 1–4 until queue is empty.

Extending Its Functionality

def extract(data, keys):
    out = []
    queue = [data]
    while len(queue) > 0:
        current = queue.pop(0)
        if type(current) == dict: for key in keys:        # CHANGE THIS BLOCK
                if key in current:
                    out.append({key:current[key]})

            for val in current.values():
                if type(val) in [list, dict]:
                    queue.append(val) elif type(current) == list:
            queue.extend(current)
    return outx = extract(data, ["videoID"])
print(x)

To change the behaviour of this function, change this block of code:

for key in keys:
    if key in current:
        out.append({key:current[key]})

Currently, this block of code simply adds ANY key-value pair whose key appears in keys into our output. If you wish to change the way this works eg. conditionally add certain key-value pairs into output, simply change this block of code to suit your needs.

Enjoyed this article?

Share it with your network to help others discover it

Continue Learning

Discover more articles on similar topics