
Sometimes when we call certain APIs, the format and structure of the returned JSON object can sometimes be pretty messy and confusing. For instance (this example is considered mild BTW):
data = {
"type": "video",
"videoID": "vid001",
"links": [
{"type":"video", "videoID":"vid002", "links":[]},
{ "type":"video",
"videoID":"vid003",
"links": [
{"type": "video", "videoID":"vid004"},
{"type": "video", "videoID":"vid005"},
]
},
{"type":"video", "videoID":"vid006"},
{ "type":"video",
"videoID":"vid007",
"links": [
{"type":"video", "videoID":"vid008", "links": [
{ "type":"video",
"videoID":"vid009",
"links": [{"type":"video", "videoID":"vid010"}]
}
]}
]},
]
}
Unfortunately this happens more often than I wish it did.
Python Code To Extract Specific Key-Value Pairs
def extract(data, keys):
out = []
queue = [data]
while len(queue) > 0:
current = queue.pop(0)
if type(current) == dict:
for key in keys:
if key in current:
out.append({key:current[key]})
for val in current.values():
if type(val) in [list, dict]:
queue.append(val) elif type(current) == list:
queue.extend(current)
return outx = extract(data, ["videoID"])
print(x)
Here, we wish to extract all videoIDs from the messy dictionary, so we pass ["videoID"] as the keys argument. The output:
[{'videoID': 'vid001'}, {'videoID': 'vid002'}, {'videoID': 'vid003'},
{'videoID': 'vid004'}, {'videoID': 'vid005'}, {'videoID': 'vid006'},
{'videoID': 'vid007'}, {'videoID': 'vid008'}, {'videoID': 'vid009'},
{'videoID': 'vid010'}]
The Logic Behind The Code
We need to keep track of 2 lists — 1) out, which contains our output and 2) queue, which contains the data structures we wish to search. We first initialize out as an empty list, and queue to contain our entire json data.
- remove the first element from the queue, and assign it to
current - If
currentis a dictionary, search it for the keys that we want, and add any found key-value pairs intoout. - Then add all values that are either lists or dictionaries back into
queueso we can search them again later. - If
currentis a list, we add everything insidecurrentback intoqueue, so we can search the individual elements later. This can be done using the.extendmethod. - Repeat steps 1–4 until
queueis empty.
Extending Its Functionality
def extract(data, keys):
out = []
queue = [data]
while len(queue) > 0:
current = queue.pop(0)
if type(current) == dict: for key in keys: # CHANGE THIS BLOCK
if key in current:
out.append({key:current[key]})
for val in current.values():
if type(val) in [list, dict]:
queue.append(val) elif type(current) == list:
queue.extend(current)
return outx = extract(data, ["videoID"])
print(x)
To change the behaviour of this function, change this block of code:
for key in keys:
if key in current:
out.append({key:current[key]})
Currently, this block of code simply adds ANY key-value pair whose key appears in keys into our output. If you wish to change the way this works eg. conditionally add certain key-value pairs into output, simply change this block of code to suit your needs.