Explore the future of Web Scraping. Request a free invite to ScrapeCon 2024

A Quick Trick to Make DataFrames with Uneven Array Lengths

A combination of two simple techniques I learned in the past two weeks.

image

Introduction

These past few weeks, I have been extensively using Pandas after a many-month hiatus. In doing so, I have of course come across a plethora of errors that a younger version of me could likely have solved without a second thought. But alas, I was forced to scour the internet for a solution.

One of the most relatable issues I had was poorly formatted data. It is no secret that real data is almost never clean, and the majority of a data scientist's work consists of cleaning and organizing, not pushing out fancy ML models on a daily basis.

In particular, I dealt with an issue where I had data made up of multiple arrays of different lengths. I naively assumed that I could still easily define a DataFrame, but much to my chagrin, I was met with the following complaint from the Python interpreter:

>>> import pandas as pd
>>> my_data = {"A": [1, 2, 3], "B": [1, 2, 3, 4, 5, 6], "C": [1, 2]}
>>> pd.DataFrame(my_data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\murta\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\frame.py", line 614, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
  File "C:\Users\murta\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\internals\construction.py", line 464, in dict_to_mgr
    return arrays_to_mgr(
  File "C:\Users\murta\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\internals\construction.py", line 119, in arrays_to_mgr
    index = _extract_index(arrays)
  File "C:\Users\murta\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\internals\construction.py", line 635, in _extract_index
    raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length

And so, I set out in search of a solution to my problem. After searching the recesses of the internet, I came upon the following nifty trick:

>>> my_data = dict([ (k,pd.Series(v)) for k,v in my_data.items() ])
>>> my_data
{'A': 0    1
1    2
2    3
dtype: int64, 'B': 0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64, 'C': 0    1
1    2
dtype: int64}
>>> pd.DataFrame(my_data)
     A  B    C
0  1.0  1  1.0
1  2.0  2  2.0
2  3.0  3  NaN
3  NaN  4  NaN
4  NaN  5  NaN
5  NaN  6  NaN

Let's break down what the above code does:

  • We break down our original dictionary into a list of tuples, where the first item in the tuple is just the key, but the second item is our original list (the associated key's value) transformed into a Pandas series. We are able to conduct this transformation in one line through the use of a list comprehension.
  • Then, we convert this list of tuples into a dictionary, where the keys are still our original letters, but the values are now each a Pandas Series.

In doing the above, we are able to successfully define our DataFrame!

But wait, what now?

You may have noticed this is not a perfect solution. More specifically, those random NaNs are a nuisance, because they will likely interfere with whatever analysis you want to conduct next. There are multiple possible solutions — the most common one simply being to replace the NaNs with a fixed numerical value — but I would like to discuss a lesser-known one that fits my needs better.

My goal was to use my data to design a visualization using the declarative visualization software Altair. NaNs led to an error, and using a fixed replacement value led to an inaccurate graphic.

In search of this evasive solution, I did something I rarely do: I asked my own question on Stack Overflow. My query was swiftly answered with the following effective technique:

>>> import numpy as np
>>> def f(x):
...     vals = x[~x.isnull()].values
...     vals = np.resize(vals, len(x))
...     return vals
...
>>> my_df = pd.DataFrame(my_data)
>>> my_df = my_df.apply(f, axis=0)
>>> my_df
     A  B    C
0  1.0  1  1.0
1  2.0  2  2.0
2  3.0  3  1.0
3  1.0  4  2.0
4  2.0  5  1.0
5  3.0  6  2.0

Brilliant! NumPy's resize function offers the wonderful feature of automatically repeating values if your new array is larger than the original. And so, the above code 1) shrinks each column's values to just be the non-null ones, 2) resizes them to be as long as the original column length, and 3) puts the edited columns back into the DataFrame via the apply function.

Conclusion

This worked perfectly for me because it resulted in Altair just layering marks (it was a Bubble Plot, akin to a scatter plot in appearance but with much larger dots) on top of one another, and I was able to counteract the “heavier” appearance by simply changing the opacity value of all the bubbles to be darker. And so, my goal was achieved!

In this way, the above two tricks (combined into one) have been very helpful for me, and I hope they are helpful for you as well.

Until next time, folks!




Continue Learning