2
votes

I've been working with Power Query for a bit to process some data and I was impressed at how flexible the framework is. Currently, I am looking to replicate a conditional column step in Pandas as I would like to include it in an automated data cleaning script pipeline.

Power Query demo

In this case, Power Query creates a new column called acc_col, looks at each column in the dataset (Tags.1, Tags.2, etc) and if the string in that column matches the beginning of the value (Acceleration-) then it outputs that value into the new column, else if no match is found it outputs Unknown Acc. This is how it looks from the Editor

 #"Added Conditional Column" = Table.AddColumn(#"Replaced Value", "acc_col", each if Text.StartsWith([Tags.1], "Acceleration-") then [Tags.1] else if Text.StartsWith([Tags.2], "Acceleration-") then [Tags.2] else if Text.StartsWith([Tags.3], "Acceleration-") then [Tags.3] else if Text.StartsWith([Tags.4], "Acceleration-") then [Tags.4] else if Text.StartsWith([Tags.5], "Acceleration-") then [Tags.5] else "Unknown Acc")

I have tried some things with Pandas but my knowledge is a little bit limited. I managed to read one of the Tag columns using the following

Tags0 Tags1 Tags2
Alumni-2017,Acceleration-2016 None None
Alumni Acceleration-2017 None
Acceleration-2015 None None
Alumni-2017 Acceleration-2015 None
Alumni-2017 Acceleration-2014 None
df['acc_col'] = df['Tags0'].where(df['Tags0'].str.contains('Acceleration', na=False), )
Tags0 Tags1 Tags2 acc_col
Alumni-2017,Acceleration-2016 None None Acceleration-2016
Alumni Acceleration-2017 None None
Acceleration-2015 None None Acceleration-2015
Alumni-2017 Acceleration-2015 None None
Alumni-2017 Acceleration-2014 None None

I see that the output took all those that contained the keyword but if I wish to do the same with the other columns it overwrites the previous results. I need them all to be on the same column as it reads through one by one.

 df['acc_col'] = df['Tags1'].where(df['Tags1'].str.contains('Acceleration', na=False), )
Tags0 Tags1 Tags2 acc_col
Alumni-2017,Acceleration-2016 None None None
Alumni Acceleration-2017 None Acceleration-2017
Acceleration-2015 None None None
Alumni-2017 Acceleration-2015 None Acceleration-2015
Alumni-2017 Acceleration-2014 None Acceleration-2014

I feel like I'm close but I just need a bit more assistance.

2
kindly add sample dataframe, with expected output. kindly add data, not pics, or redirectssammywemmy
@sammywemmy hey i edited the look so that im including the dataframes. I wanted to add the pic as a reference to what I do in powerBIErnesto

2 Answers

1
votes

I think I managed to answer my own question. I simply had to add the outcome of the condition to another pd.where statement and continue until I have scanned all columns.

df['Acceleration'] = df['Tags0'].where(df['Tags0'].str.contains('Acceleration', na=False), 
                                                       df['Tags1'].where(df['Tags1'].str.contains('Acceleration', na=False),
                                                       df['Tags2'].where(df['Tags2'].str.contains('Acceleration', na=False),
                                                       df['Tags3'].where(df['Tags3'].str.contains('Acceleration', na=False),
                                                       df['Tags4'].where(df['Tags4'].str.contains('Acceleration', na=False),'Unknown')))))
0
votes

Try this and see if it is what you are after:

transform

Assumption here is that all columns are strings; if not, you can do some modification by selecting dtypes first:

condition = df.transform(lambda x: x.str.contains("Acceleration", na=False))

Create the new column using the where function on the entire dataframe, and forward filling on the columns axis to get your desired column:

df.assign(acc_col=df.where(condition).ffill(axis = 'columns').iloc[:, -1])

                           Tags0              Tags1 Tags2                        acc_col
0  Alumni-2017,Acceleration-2016               None  None  Alumni-2017,Acceleration-2016
1                         Alumni  Acceleration-2017  None              Acceleration-2017
2              Acceleration-2015               None  None              Acceleration-2015
3                    Alumni-2017  Acceleration-2015  None              Acceleration-2015
4                    Alumni-2017  Acceleration-2014  None              Acceleration-2014

If there are any nulls in acc_col, you can use the replace function or fillna to fill it with your preferred value ('Unknown').

In Power Query, I think you could possibly convert the columns into a list, and run the search/extraction within the list, (possibly use List.Contains) combined with a single if statements, rather than individual if/else checks. This may require you to write M code. My power query skills has not been called into action in a while, so I may be wrong on the suggestion.