5
votes

I'm an avid R user and am learning python along the way. One of the example code that I can easily run in R is perplexing me in Python.

Here's the original data (constructed within R):

library(tidyverse)


df <- tribble(~name, ~age, ~gender, ~height_in,
        "john",20,'m',66,
        'mary',NA,'f',62,
        NA,38,'f',68,
        'larry',NA,NA,NA
)

The output of this looks like this:

df

# A tibble: 4 x 4
  name    age gender height_in
  <chr> <dbl> <chr>      <dbl>
1 john     20 m             66
2 mary     NA f             62
3 NA       38 f             68
4 larry    NA NA            NA

I want to do 3 things:

  1. I want to replace the NA values in columns that are characters with the value "zz"
  2. I want to replace the NA values in columns that are numeric with the value 0
  3. I want to convert the character columns to factors.

Here's how I did it in R (again, using the tidyverse package):

tmp <- df %>%
  mutate_if(is.character, function(x) ifelse(is.na(x),"zz",x)) %>%
  mutate_if(is.character, as.factor) %>%
  mutate_if(is.numeric, function(x) ifelse(is.na(x), 0, x))

Here's the output of the dataframe tmp:

tmp

# A tibble: 4 x 4
  name    age gender height_in
  <fct> <dbl> <fct>      <dbl>
1 john     20 m             66
2 mary      0 f             62
3 zz       38 f             68
4 larry     0 zz             0

I'm familiar with if() and else() statements within Python. What I don't know is the correct and most readable way of executing the above code within Python. I'm guessing that there is no mutate_if equivalent in the pandas package. My question is what is the similar code that I can use in python that mimics the mutate_if, is.character, is.numeric, and as.factor functions found within tidyverse and R?

On a side note, I'm not as interested in speed/efficiency of code execution, but rather readability - which is why I really enjoy tidyverse. I would be grateful for any tips or suggestions.

Edit 1: adding code to create a pandas dataframe

Here is the code I used to create the dataframe within Python. This may assist others in getting started.

import pandas as pd
import numpy as np

my_dict = {
    'name' : ['john','mary', np.nan, 'larry'],
    'age' : [20, np.nan, 38,  np.nan],
    'gender' : ['m','f','f', np.nan],
    'height_in' : [66, 62, 68, np.nan]
}

df = pd.DataFrame(my_dict)

The output of this should be similar:

print(df)
    name   age gender  height_in
0   john  20.0      m       66.0
1   mary   NaN      f       62.0
2    NaN  38.0      f       68.0
3  larry   NaN    NaN        NaN
2

2 Answers

0
votes

Well, after some sleep, I think I have it figured out.

Here's the code I used to take the pandas dataframe and apply the comparable mutate_if functions I mentioned earlier to get the same results.

# fill in the missing values (similar to mutate_if from tidyverse)
df1 = df.select_dtypes(include=['double']).fillna(0)
df2 = df.select_dtypes(include=['object']).fillna('zz').astype('category')

df = pd.concat([df2.reset_index(drop = True), df1], axis = 1)

print(df)
    name gender   age  height_in
0   john      m  20.0       66.0
1   mary      f   0.0       62.0
2     zz      f  38.0       68.0
3  larry     zz   0.0        0.0

# check again for the data types
df.dtypes
name         category
gender       category
age           float64
height_in     float64
dtype: object

The catch is that I had to 'break' apart the original dataframe, apply the changes (i.e., fill in the missing values and change data types), and then recombine the columns (i.e., put the data frame back together).

0
votes

What about a way that aligns to the tidyverse way:

>>> from datar import f
>>> from datar.tibble import tribble
>>> from datar.base import NA, is_na, is_numeric, is_character, as_factor
>>> from datar.dplyr import mutate, across, where
>>> from datar.tidyr import replace_na
>>> # or if you are lazy
>>> # from datar.all import *
>>> 
>>> df = tribble(
...     f.name, f.age, f.gender, f.height_in,
...     "john", 20,    'm',      66,
...     'mary', NA,    'f',      62,
...     NA,     38,    'f',      68,
...     'larry',NA,    NA,       NA
... )
>>> 
>>> tmp = df >> \
...   mutate(across(where(is_character), replace_na, "zz")) >> \
...   mutate(across(where(is_character), as_factor)) >> \
...   mutate(across(where(is_numeric), replace_na, 0))
>>> 
>>> tmp
        name       age     gender  height_in
  <category> <float64> <category>  <float64>
0       john      20.0          m       66.0
1       mary       0.0          f       62.0
2         zz      38.0          f       68.0
3      larry       0.0         zz        0.0

I am the author of the datar package. Please feel free to submit issues if you have any questions about using it.