circuit

How to Split a Large CSV File with Python

A guide to splitting a large CSV file based on input parameters.




Photo by Joshua Brown on Unsplash

As a Data Engineer, Data Scientist, or Data Analyst each one of us must have encountered a need for the split utility to split CSV/other data files into smaller files to investigate or test scenarios.

This piece of Python code helps to split CSV files randomly or equally based on input parameters.

It is easy to split files using pandas in Python. It has powerful features to pick a number of rows and skip a number of rows. This piece of code orchestrated the number of rows to skip and pick according to the total number of rows in the file.

After the user enters his input file path and name, the code runs as shown in the following steps:

Step 1 (Using Pandas): Find the number of rows from the files.

In this piece of code, using pandas we read the CSV and find the number of rows using the index:

## find number of lines using Pandas
pd_dataframe = pd.read_csv(split_source_file, header=0)
number_of_rows = len(pd_dataframe.index) + 1

Step 1 (Using Traditional Python): Find the number of rows from the files.

Here we open the file and enumerate the data using a loop to find the number of rows:

## find number of lines using traditional python
fh = open(split_source_file, **'r'**)
for count, line in enumerate(fh):
    pass
py_number_of_rows = count

Step 2: User to input the number of lines per file (Range) and generate a random number.

Here code asks for user input to provide min and max number of rows per file, and code generates a random number between the min and max:

## Incase of equal split, provide the same number for min and max
min_rows = int(input(**"Minimum Number of rows per file? : "**))
max_rows = int(input(**"Maximum Number of rows per file? : "**))

## This value changes per file
number_of_rows_perfile = random.randint(min_rows, max_rows)

In case you want an equal split, provide the same number for max and min. For example, you want all files to contain 5000 rows. Enter 5000 as input for min_rows and max_rows.

Step 3: Read data from the file to pick and skip a number of rows using pandas.

This is the tricky part. To use pandas we create a dataframe from the source file. While reading the data, the number of rows to read is a randomly generated number from the previous step, and the sum of previously created file rows is the skip number.

## Read CSV file with number of rows and skip respective number of lines
df = pd.read_csv(split_source_file, header=None, nrows = number_of_rows_perfile,skiprows = skip_rows)

Step 4: Write Data from the dataframe to a CSV file using pandas.

Here in this step, we write data from dataframe created at Step 3 into the file. Filename is generated based on source file name and number of files to be created.

## Write to csv
df.to_csv(split_target_file, index=False, header=False, mode=**'a'**, chunksize=number_of_rows_perfile)

Step 5: Handle the skip rows and generate a number of rows after each small file is generated.

After every small file is generated, the code needs to move ahead to read and skip rows. In addition, the last part of the file doesn’t match the randomly generated row values.

## Last file handler
if skip_rows >= number_of_rows:
    number_of_rows_perfile = number_of_rows - skip_rows
else:
    number_of_rows_perfile = random.randint(min_rows, max_rows)

Now, let looks into these pieces together

And there you have it. Thank you for reading.

We hope that you will find this article insightful. If you like to, please share the link to your friends, family and colleagues.

Do you like to encourage us to spread the insights on more topics, Please encourage us with a cup of coffee.




Continue Learning