I'm an avid R user and am learning python along the way. One of the example code that I can easily run in R is perplexing me in Python.
Here's the original data (constructed within R):
library(tidyverse)
df <- tribble(~name, ~age, ~gender, ~height_in,
"john",20,'m',66,
'mary',NA,'f',62,
NA,38,'f',68,
'larry',NA,NA,NA
)
The output of this looks like this:
df
# A tibble: 4 x 4
name age gender height_in
<chr> <dbl> <chr> <dbl>
1 john 20 m 66
2 mary NA f 62
3 NA 38 f 68
4 larry NA NA NA
I want to do 3 things:
- I want to replace the NA values in columns that are characters with the value "zz"
- I want to replace the NA values in columns that are numeric with the value 0
- I want to convert the character columns to factors.
Here's how I did it in R (again, using the tidyverse package):
tmp <- df %>%
mutate_if(is.character, function(x) ifelse(is.na(x),"zz",x)) %>%
mutate_if(is.character, as.factor) %>%
mutate_if(is.numeric, function(x) ifelse(is.na(x), 0, x))
Here's the output of the dataframe tmp:
tmp
# A tibble: 4 x 4
name age gender height_in
<fct> <dbl> <fct> <dbl>
1 john 20 m 66
2 mary 0 f 62
3 zz 38 f 68
4 larry 0 zz 0
I'm familiar with if() and else() statements within Python. What I don't know is the correct and most readable way of executing the above code within Python. I'm guessing that there is no mutate_if equivalent in the pandas package. My question is what is the similar code that I can use in python that mimics the mutate_if, is.character, is.numeric, and as.factor functions found within tidyverse and R?
On a side note, I'm not as interested in speed/efficiency of code execution, but rather readability - which is why I really enjoy tidyverse. I would be grateful for any tips or suggestions.
Edit 1: adding code to create a pandas dataframe
Here is the code I used to create the dataframe within Python. This may assist others in getting started.
import pandas as pd
import numpy as np
my_dict = {
'name' : ['john','mary', np.nan, 'larry'],
'age' : [20, np.nan, 38, np.nan],
'gender' : ['m','f','f', np.nan],
'height_in' : [66, 62, 68, np.nan]
}
df = pd.DataFrame(my_dict)
The output of this should be similar:
print(df)
name age gender height_in
0 john 20.0 m 66.0
1 mary NaN f 62.0
2 NaN 38.0 f 68.0
3 larry NaN NaN NaN