1
votes

I have a dataset with 20 variables, and quite a bit of missing data. I am trying to add a new variable with a value assigned for each row, based on those of another variable. Below is code and a smaller dataset that gives the same errors as my larger dataset. Any suggestions?

A=seq(1,6); B=seq(2,4)
length(A)=7; length(B)=7
m=cbind(A,B)

I do not understand completely what converting from a matrix to a dataframe does.

df=as.data.frame(m)
df

First trying to create a categorical variable,to use when assigning the value of the new variable

df$Acat=cut(df$A,
              breaks=c(-Inf,2.5,4.5,Inf),
              labels=c("low","mod","hi"))
df$Acat

This code below is where I get an error ": argument is of length zero"

if (df$Acat.=="low"){
  df$C=1
}else if (df$Acat.=="mod"){
  df$C=2
}else if(df$Acat.=="hi"){
  df$C=3
}else {
  df$C=NA
}
df$C

I also tried it this way, using the numeric variable for assigning the value of the new variable but I am getting this error:

the condition has length > 1 and only the first element will be used

if (df$A<2.5){
  df$D=1
} else if (df$A>=2.5 && df$A<4.5){
  df$D=2
} else if (df$A>=4.5){
  df$D=3
} else {
  df$D=NA
}
df$D
2
Try: df$C <- match(df$Acat, c("low","mod","hi"))GKi
and: df$D <- findInterval(df$A, c(-Inf,2.5,4.5,Inf))GKi

2 Answers

0
votes

You seem to be new to R. You will find out, as you go on, that some things are done quite differently in R than in other languages.

For instance, to set the column C according to your conditions, you would do:

df$C = ifelse(
  df$Acat=="low", 1, ifelse(
    df$Acat=="mod", 2, ifelse(
     df$Acat=="hi", 3, NA 
    )))

If you are working with tidyverse, you can also use case_when.

0
votes

Here are a few pointers. In R, it is common to assign variables to names using the <- operator. To be fair, I didn't even know you could assign length to a variable, so I learned something new.

A <- seq(1, 6)
length(A) <- 7
B <- seq(2, 4)
length(B) <- 7

m <- cbind(A, B)

The difference between a matrix and a data.frame is that a matrix is a vector of numbers with a dim attribute specifying the dimensions (also true for arrays), whereas a data.frame is a series of lists (along columns) of equal length (the number of rows).

What this means in practice is that data.frames can have anything in different columns, e.g. one might be a character and another an integer, whereas matrices can only contain data of the same type.

> attributes(m)
$dim
[1] 7 2

$dimnames
$dimnames[[1]]
NULL

$dimnames[[2]]
[1] "A" "B"
> df <- as.data.frame(m)
> attributes(df)
$names
[1] "A" "B"

$class
[1] "data.frame"

$row.names
[1] 1 2 3 4 5 6 7

> is.list(m)
[1] FALSE
> is.list(df)
[1] TRUE

The if-else statements you are using to try to assign values to a column are not working because these are not vectorised: they require a single TRUE or FALSE, not a vector of logicals. You can see that the expression is longer than one by evaluating it, asking for the length:

> df$Acat == "low"
[1]  TRUE  TRUE FALSE FALSE FALSE FALSE    NA

> length(df$Acat == "low")
[1] 7

Instead, you can build a named vector with the values you want, and use a subsetting operation to get them to the right place:

df$Acat <- cut(df$A,
            breaks=c(-Inf,2.5,4.5,Inf),
            labels=c("low","mod","hi"))

named_vec <- c("low" = 1, "mod" = 2, "hi" = 3)
df$C <- named_vec[df$Acat]

Which gives you this data.frame:

> df
   A  B Acat  C
1  1  2  low  1
2  2  3  low  1
3  3  4  mod  2
4  4 NA  mod  2
5  5 NA   hi  3
6  6 NA   hi  3
7 NA NA <NA> NA

There are multiple other options to get the same result, but subsetting by name is I would think relatively intuitive.