1
votes

I have following data on student scores on several pretests before their true exam.

a<-(c("2013-02-25","2013-03-13","2013-04-24","2013-05-12","2013-07-12","2013-08-11","actual_exam_date"))
b<-c(300,230,400,NA,NA,NA,"2013-04-30")
c<-c(NA,260,410,420,NA,NA,"2013-05-30")
d<-c(300,230,400,NA,370,390,"2013-08-30")
df<-as.data.frame(rbind(b,c,d))
colnames(df)<-a
rownames(df)<-(c("student 1","student 2","student 3"))

The actual datasheet is much larger. Since the dates vary so much, and the timing between the pretests and to the exam are relatively similar, I would rather convert the true dates into the number of days before the exam, so that they are the new column names, not dates. I understand that this will merge some of the columns which is OK. How would I be able to do that?

2
There are 11 different 'days before exam', and as more rows get added, I think the number of columns will keep changing/increasing. Is this what you want? Perhaps, you can show what your end goal is and the desired output is so a different approach can be suggested? - Gopala
@user3949008 my end goal is to plot trendlines for each student how the pretest scores change with time, on the x axis I will plot days before the exam. One curve per student. I want to stack the lines below each other, and end of line will mean exam. - Oposum

2 Answers

1
votes

Here is one way to approach this one. I am sure there are many others. I commented the code to explain what is going on at each step:

# Load two libraries you need
library(tidyr)
library(dplyr)

# Construct data frame you provided
a <- (c("2013-02-25","2013-03-13","2013-04-24","2013-05-12","2013-07-12","2013-08-11","actual_exam_date"))
b <- c(300,230,400,NA,NA,NA,"2013-04-30")
c <- c(NA,260,410,420,NA,NA,"2013-05-30")
d <- c(300,230,400,NA,370,390,"2013-08-30")
df <- as.data.frame(rbind(b,c,d))
colnames(df) <- a

# Add student IDs as a column instead of row names and move them to first position
df$StudentID <- row.names(df)
row.names(df) <- NULL
df <- select(df, StudentID, everything())

# Gather date columns as 'categories' with score as the new column value
newdf <- df %>% gather(Date, Score, -actual_exam_date, -StudentID) %>% arrange(StudentID)

# Convert dates coded as factor variables into actual dates so we can do days to exam computation
newdf$actual_exam_date <- as.Date(as.character(newdf$actual_exam_date))
newdf$Date <- as.Date(as.character(newdf$Date))

# Create a new column of days before exam per student ID (group) and filter
# out dates with missing scores for each student
newdf <- newdf %>% group_by(StudentID) %>% mutate(daysBeforeExam = as.integer(difftime(actual_exam_date, Date, units = 'days'))) %>% filter(!is.na(Score))

# Plot the trends using ggplot
ggplot(newdf, aes(x = daysBeforeExam, y = Score, col = StudentID, group = StudentID)) + geom_line(size = 1) + geom_point(size = 2)
1
votes

This is another good use case for reshape2, because you want to go to long form for plotting. For example:

# you are going to need the student id as a field
df$student_id <- row.names(df)

library('reshape2')

df2 <- melt(df, id.vars = c('student_id','actual_exam_date'),
                variable.name = 'pretest_date',
                value.name = 'pretest_score')

# drop empty observations
df2 <- df2[!is.na(df2$pretest_score),]

# these need to be dates
df2$actual_exam_date <- as.Date(df2$actual_exam_date)
df2$pretest_date <- as.Date(df2$pretest_date)

# date difference
df2$days_before_exam <- as.integer(df2$actual_exam_date - df2$pretest_date)

# scores need to be numeric
df2$pretest_score <- as.numeric(df2$pretest_score)

# now you can make some plots
library('ggplot2')

ggplot(df2, aes(x = days_before_exam, y = pretest_score, col=student_id) ) + 
  geom_line(lwd=1) + scale_x_reverse() + 
  geom_vline(xintercept = 0, linetype = 'dashed', lwd = 1) +
  ggtitle('Pretest Performance') + xlab('Days Before Exam') + ylab('Pretest Score') 

enter image description here