1
votes

I'm working with a HUGE data.frame in R. My dataset follows the pattern:

    +-------+------+------+------+------+
    |       | Col1 | Col2 | Col3 | Col4 |
    +-------+------+------+------+------+
    | Line1 |   43 | a    | b    |   56 |
    | Line2 |  103 | c    | d    |   85 |
    | Line3 |    7 | F    | E    |  115 |
    | Line4 |    8 | g    | h    |   0  |
    +-------+------+------+------+------+ 

I have to do the following:


For each row:
    If Col2 > Col3 (check alphabetical order of the values)
        Swap values of Col 2 and Col 3

I need the following result:

    +-------+------+------+------+------+
    |       | Col1 | Col2 | Col3 | Col4 |
    +-------+------+------+------+------+
    | Line1 |   43 | a    | b    |   56 |
    | Line2 |  103 | c    | d    |   85 |
    | Line3 |    7 | E    | F    |  115 |
    | Line4 |    8 | g    | h    |   0  |
    +-------+------+------+------+------+ 

I wrote a for loop for this, but it takes a LONG LONG time! Is there a more efficient way to do this with R?

Thank you!

2

2 Answers

1
votes

You can use transform function.

> df
      col1 col2 col3 col4
Line1   48    a    b   56
Line2  103    c    d   85
Line3    7    f    e  115
Line4    8    g    h    0

> df <- transform(df, col2 = pmin(col2, col3), col3=pmax(col2, col3))

> df
      col1 col2 col3 col4
Line1   48    a    b   56
Line2  103    c    d   85
Line3    7    e    f  115
Line4    8    g    h    0
0
votes

This might not be the most efficient or fastest solution, but it takes less than 90 seconds with 1E8 rows on my machine:

set.seed(123)
nr <- 1E8
df <- data.frame(
    Col1 = sample(1:200, nr, replace = TRUE),
    Col2 = sample(c(LETTERS, letters), nr, replace=TRUE),
    Col3 = sample(c(LETTERS, letters), nr, replace=TRUE),
    Col4 = sample(1:200, nr, replace = TRUE)
)

library(data.table)
setDT(df)
df[]
#>            Col1 Col2 Col3 Col4
#>         1:  159    P    L   70
#>         2:  179    L    o  159
#>         3:   14    w    O  168
#>         4:  195    K    H  193
#>         5:  170    r    X  116
#>        ---                    
#>  99999996:  117    O    F  163
#>  99999997:   82    Q    q  179
#>  99999998:  128    t    U   60
#>  99999999:   40    X    o   79
#> 100000000:  185    E    o  133
system.time({
df[, `:=` (Col2=fifelse(Col2 < Col3, Col2, Col3),
           Col3=fifelse(Col2 < Col3, Col3, Col2))]
})
#>    user  system elapsed 
#>  17.326   0.368  17.694
df[]
#>            Col1 Col2 Col3 Col4
#>         1:  159    L    P   70
#>         2:  179    L    o  159
#>         3:   14    O    w  168
#>         4:  195    H    K  193
#>         5:  170    r    X  116
#>        ---                    
#>  99999996:  117    F    O  163
#>  99999997:   82    q    Q  179
#>  99999998:  128    t    U   60
#>  99999999:   40    o    X   79
#> 100000000:  185    E    o  133

Created on 2020-06-04 by the reprex package (v0.3.0)