python - Split pandas dataframe in two if it has more than 10 rows

41

votes

I have a huge CSV with many tables with many rows. I would like to simply split each dataframe into 2 if it contains more than 10 rows.

If true, I would like the first dataframe to contain the first 10 and the rest in the second dataframe.

Is there a convenient function for this? I've looked around but found nothing useful...

i.e. split_dataframe(df, 2(if > 10))?

pythonpandassplitdataframe

Im building a PowerPoint deck and pasting the data from the df to a chart on each slide, and charts with more then 10 rows become unreadable. - Boosted_d16

28

votes

This will return the split DataFrames if the condition is met, otherwise return the original and None (which you would then need to handle separately). Note that this assumes the splitting only has to happen one time per df and that the second part of the split (if it is longer than 10 rows (meaning that the original was longer than 20 rows)) is OK.

df_new1, df_new2 = df[:10, :], df[10:, :] if len(df) > 10 else df, None

Note you can also use df.head(10) and df.tail(len(df) - 10) to get the front and back according to your needs. You can also use various indexing approaches: you can just provide the first dimensions index if you want, such as df[:10] instead of df[:10, :] (though I like to code explicitly about the dimensions you are taking). You can can also use df.iloc and df.ix to index in similar ways.

Be careful about using df.loc however, since it is label-based and the input will never be interpreted as an integer position. .loc would only work "accidentally" in the case when you happen to have index labels that are integers starting at 0 with no gaps.

But you should also consider the various options that pandas provides for dumping the contents of the DataFrame into HTML and possibly also LaTeX to make better designed tables for the presentation (instead of just copying and pasting). Simply Googling how to convert the DataFrame to these formats turns up lots of tutorials and advice for exactly this application.

34

votes

I used a List Comprehension to cut a huge DataFrame into blocks of 100'000:

size = 100000
list_of_dfs = [df.loc[i:i+size-1,:] for i in range(0, len(df),size)]

or as generator:

list_of_dfs = (df.loc[i:i+size-1,:] for i in range(0, len(df),size))

18

votes

There is no specific convenience function.

You'd have to do something like:

first_ten = pd.DataFrame()
rest = pd.DataFrame()

if df.shape[0] > 10: # len(df) > 10 would also work
    first_ten = df[:10]
    rest = df[10:]

6

votes

A method based on np.split:

df = pd.DataFrame({    'A':[2,4,6,8,10,2,4,6,8,10],
                       'B':[10,-10,0,20,-10,10,-10,0,20,-10],
                       'C':[4,12,8,0,0,4,12,8,0,0],
                      'D':[9,10,0,1,3,np.nan,np.nan,np.nan,np.nan,np.nan]})

listOfDfs = [df.loc[idx] for idx in np.split(df.index,5)]

A small function that uses a modulo could take care of cases where the split is not even (e.g. np.split(df.index,4) will throw an error).

(Yes, I am aware that the original question was somewhat more specific than this. However, this is supposed to answer the question in the title.)

3

votes

If you have a large data frame and need to divide into a variable number of sub data frames rows, like for example each sub dataframe has a max of 4500 rows, this script could help:

max_rows = 4500
dataframes = []
while len(df) > max_rows:
    top = df[:max_rows]
    dataframes.append(top)
    df = df[max_rows:]
else:
    dataframes.append(df)

You could then save out these data frames:

for _, frame in enumerate(dataframes):
    frame.to_csv(str(_)+'.csv', index=False)

Hope this helps someone!

2

votes

Below is a simple function implementation which splits a DataFrame to chunks and a few code examples:

import pandas as pd

def split_dataframe_to_chunks(df, n):
    df_len = len(df)
    count = 0
    dfs = []

    while True:
        if count > df_len-1:
            break

        start = count
        count += n
        #print("%s : %s" % (start, count))
        dfs.append(df.iloc[start : count])
    return dfs


# Create a DataFrame with 10 rows
df = pd.DataFrame([i for i in range(10)])

# Split the DataFrame to chunks of maximum size 2
split_df_to_chunks_of_2 = split_dataframe_to_chunks(df, 2)
print([len(i) for i in split_df_to_chunks_of_2])
# prints: [2, 2, 2, 2, 2]

# Split the DataFrame to chunks of maximum size 3
split_df_to_chunks_of_3 = split_dataframe_to_chunks(df, 3)
print([len(i) for i in split_df_to_chunks_of_3])
# prints [3, 3, 3, 1]

2

votes

def split_and_save_df(df, name, size, output_dir):
    """
    Split a df and save each chunk in a different csv file.

    Parameters:
        df : pandas df to be splitted
        name : name to give to the output file
        size : chunk size
        output_dir : directory where to write the divided df
    """
    import os
    for i in range(0, df.shape[0],size):
        start  = i
        end    = min(i+size-1, df.shape[0]) 
        subset = df.loc[start:end] 
        output_path = os.path.join(output_dir,f"{name}_{start}_{end}.csv")
        print(f"Going to write into {output_path}")
        subset.to_csv(output_path)
        output_size = os.stat(output_path).st_size
        print(f"Wrote {output_size} bytes")

1

votes

You can use the DataFrame head and tail methods as syntactic sugar instead of slicing/loc here. I use a split size of 3; for your example use headSize=10

def split(df, headSize) :
    hd = df.head(headSize)
    tl = df.tail(len(df)-headSize)
    return hd, tl

df = pd.DataFrame({    'A':[2,4,6,8,10,2,4,6,8,10],
                       'B':[10,-10,0,20,-10,10,-10,0,20,-10],
                       'C':[4,12,8,0,0,4,12,8,0,0],
                      'D':[9,10,0,1,3,np.nan,np.nan,np.nan,np.nan,np.nan]})

# Split dataframe into top 3 rows (first) and the rest (second)
first, second = split(df, 3)

0

votes

The method based on list comprehension and groupby, which stores all the split dataframes in a list variable and can be accessed using the index.

Example:

ans = [pd.DataFrame(y) for x, y in DF.groupby('column_name', as_index=False)]***
ans[0]
ans[0].column_name

python - Split pandas dataframe in two if it has more than 10 rows

9 Answers