I've been working with Power Query for a bit to process some data and I was impressed at how flexible the framework is. Currently, I am looking to replicate a conditional column step in Pandas as I would like to include it in an automated data cleaning script pipeline.
In this case, Power Query creates a new column called acc_col, looks at each column in the dataset (Tags.1, Tags.2, etc) and if the string in that column matches the beginning of the value (Acceleration-) then it outputs that value into the new column, else if no match is found it outputs Unknown Acc. This is how it looks from the Editor
#"Added Conditional Column" = Table.AddColumn(#"Replaced Value", "acc_col", each if Text.StartsWith([Tags.1], "Acceleration-") then [Tags.1] else if Text.StartsWith([Tags.2], "Acceleration-") then [Tags.2] else if Text.StartsWith([Tags.3], "Acceleration-") then [Tags.3] else if Text.StartsWith([Tags.4], "Acceleration-") then [Tags.4] else if Text.StartsWith([Tags.5], "Acceleration-") then [Tags.5] else "Unknown Acc")
I have tried some things with Pandas but my knowledge is a little bit limited. I managed to read one of the Tag columns using the following
Tags0 | Tags1 | Tags2 |
---|---|---|
Alumni-2017,Acceleration-2016 | None | None |
Alumni | Acceleration-2017 | None |
Acceleration-2015 | None | None |
Alumni-2017 | Acceleration-2015 | None |
Alumni-2017 | Acceleration-2014 | None |
df['acc_col'] = df['Tags0'].where(df['Tags0'].str.contains('Acceleration', na=False), )
Tags0 | Tags1 | Tags2 | acc_col |
---|---|---|---|
Alumni-2017,Acceleration-2016 | None | None | Acceleration-2016 |
Alumni | Acceleration-2017 | None | None |
Acceleration-2015 | None | None | Acceleration-2015 |
Alumni-2017 | Acceleration-2015 | None | None |
Alumni-2017 | Acceleration-2014 | None | None |
I see that the output took all those that contained the keyword but if I wish to do the same with the other columns it overwrites the previous results. I need them all to be on the same column as it reads through one by one.
df['acc_col'] = df['Tags1'].where(df['Tags1'].str.contains('Acceleration', na=False), )
Tags0 | Tags1 | Tags2 | acc_col |
---|---|---|---|
Alumni-2017,Acceleration-2016 | None | None | None |
Alumni | Acceleration-2017 | None | Acceleration-2017 |
Acceleration-2015 | None | None | None |
Alumni-2017 | Acceleration-2015 | None | Acceleration-2015 |
Alumni-2017 | Acceleration-2014 | None | Acceleration-2014 |
I feel like I'm close but I just need a bit more assistance.