Sometimes when we call certain APIs, the format and structure of the returned JSON object can sometimes be pretty messy and confusing. For instance (this example is considered mild BTW):
data = {
"type": "video",
"videoID": "vid001",
"links": [
{"type":"video", "videoID":"vid002", "links":[]},
{ "type":"video",
"videoID":"vid003",
"links": [
{"type": "video", "videoID":"vid004"},
{"type": "video", "videoID":"vid005"},
]
},
{"type":"video", "videoID":"vid006"},
{ "type":"video",
"videoID":"vid007",
"links": [
{"type":"video", "videoID":"vid008", "links": [
{ "type":"video",
"videoID":"vid009",
"links": [{"type":"video", "videoID":"vid010"}]
}
]}
]},
]
}
Unfortunately this happens more often than I wish it did.
Python Code To Extract Specific Key-Value Pairs
def extract(data, keys):
out = []
queue = [data]
while len(queue) > 0:
current = queue.pop(0)
if type(current) == dict:
for key in keys:
if key in current:
out.append({key:current[key]})
for val in current.values():
if type(val) in [list, dict]:
queue.append(val) elif type(current) == list:
queue.extend(current)
return outx = extract(data, ["videoID"])
print(x)
Here, we wish to extract all videoIDs
from the messy dictionary, so we pass ["videoID"]
as the keys
argument. The output:
[{'videoID': 'vid001'}, {'videoID': 'vid002'}, {'videoID': 'vid003'},
{'videoID': 'vid004'}, {'videoID': 'vid005'}, {'videoID': 'vid006'},
{'videoID': 'vid007'}, {'videoID': 'vid008'}, {'videoID': 'vid009'},
{'videoID': 'vid010'}]
The Logic Behind The Code
We need to keep track of 2 lists — 1) out
, which contains our output and 2) queue
, which contains the data structures we wish to search. We first initialize out
as an empty list, and queue
to contain our entire json data.
- remove the first element from the queue, and assign it to
current
- If
current
is a dictionary, search it for the keys that we want, and add any found key-value pairs intoout
. - Then add all values that are either lists or dictionaries back into
queue
so we can search them again later. - If
current
is a list, we add everything insidecurrent
back intoqueue
, so we can search the individual elements later. This can be done using the.extend
method. - Repeat steps 1–4 until
queue
is empty.
Extending Its Functionality
def extract(data, keys):
out = []
queue = [data]
while len(queue) > 0:
current = queue.pop(0)
if type(current) == dict: for key in keys: # CHANGE THIS BLOCK
if key in current:
out.append({key:current[key]})
for val in current.values():
if type(val) in [list, dict]:
queue.append(val) elif type(current) == list:
queue.extend(current)
return outx = extract(data, ["videoID"])
print(x)
To change the behaviour of this function, change this block of code:
for key in keys:
if key in current:
out.append({key:current[key]})
Currently, this block of code simply adds ANY key-value pair whose key appears in keys
into our output. If you wish to change the way this works eg. conditionally add certain key-value pairs into output
, simply change this block of code to suit your needs.