3
votes

I have a question about pandas dataframes in Python: I have a large dataframe df that I split into two subsets, df1 and df2. df1 and df2 together do not make up all of df, they are just two mutually exclusive subsets of it. I want to plot this in ggplot with rpy2 and display the variables in the plot based on whether they come from df1 or df2. ggplot2 requires a melted dataframe so I have to create a new dataframe that has a column saying whether each entry was from df1 or df2, so that this column can be passed to ggplot. I tried doing it like this:

# add labels to df1, df2
df1["label"] = len(df1.index) * ["df1"]
df2["label"] = len(df2.index) * ["df2"]
# combine the dfs together
melted_df = pandas.concat([df1, df2])

Now it can be plotted as in:

# plot parameters from melted_df and colour them by df1 or df2
ggplot2.ggplot(melted_df) + ggplot2.ggplot(aes_string(..., colour="label"))

My question is whether there's an easier, short hand way of doing this. ggplot requires constant melting/unmelting dfs and it seems cumbersome to always manually add the melted form to distinct subsets of df. Thanks.

1
One shortcut would be to replace df1["label"] = len(df1.index) * ["df1"] with df1["label"] = "df1"beardc

1 Answers

2
votes

Certainly you can simplify by using:

df1['label'] = 'df1'

(rather than df1["label"] = len(df1.index) * ["df1"].)

If you find yourself doing this a lot, why not create your own function? (something like this):

plot_dfs(dfs):
    for i, df in enumerate(dfs):
        df['label'] =  'df%s' % i+1 # note: this *changes* df
    melted_df = pd.concat(dfs)

    # plot parameters from melted_df and colour them by df1 or df2
    ggplot2.ggplot(melted_df) + ggplot2.ggplot(aes_string(..., colour="label"))

    return # the melted_df or ggplot ?