2
votes

Summarize the question:

I am trying to organize a complex data string that is delivered from a sensor system. Unfortunately, because the order of the variables can change due to sensor errors, I cannot use a simple delimiter-based parsing scheme, as data can end up in the wrong column. My new concept is to pre-determine the columns to sort data into, read repeating segments of the string, and use those repeating variable names to parse the values into the correct column. Does anyone have an idea of a function/functions that will perform that task?

Why the simple solution doesn't work:

The sensor system is a wireless collection of soil moisture, temperature, pressure, and rain sensors that collect readings every five minutes and automatically upload the data to my server.

Unfortunately, I didn't get to decide how data is delivered. Each entry of this string includes 15 readings from ~10 different sensors. The variable names and values are listed together in the string with a ~ delimiter, like so:

vbat~3.63~A0~2821~A1~2931~A3~2917~A4~2963~A5~2984~T11/dP~2950.32~T11/t~12.40~p5/ms5803/p~942.31~p5/ms5803/t~13.43~p6/ms5803/p~942.91~p6/ms5803/t~8.64~p7/t~13.72~p7/h~59.06~Tip_Ct~0

The text characters ("vbat", "A0", "A1", ... "T11/dP", "T11/t", "p5/ms5803/p", ... "Tip_Ct") are each variable names of the sensor/data type, and the numeric values (3.63, 2821, 2931, ... 2950.32, 12.40, 942.31,...0) are the sensor readings.

Unfortunately, I realized that this won't consistently sort my variables, as the order of readings can change if the microprocessor malfunctions or one of the sensors fails. For example, when sensor "p6" failed, the values for "p6/ms5803/p" and "p6/ms5803/t" are no longer included in the string, like so:

vbat~3.89~A0~2815~A1~2723~A3~2737~A4~2660~A5~2620~T11/dP~3024.08~T11/t~10.20~p5/ms5803/p~935.46~p5/ms5803/t~10.37~p7/t~10.28~p7/h~89.54~Tip_Ct~0

That means that a simple delimiter-based sorting mechanism puts sensor readings at the end of the string into the wrong column.

I am looking for a set of functions that will read my string, identify specific variable names (e.g. "A0"), and then sort the next tilde-delimited value into a column with the same name. If anyone has an idea for a simpler workflow, I would be happy to hear that too!

What I intended:

My original plan was simple: parse by delimeter, remove repetitive columns, and re-name data columns according to variable name. My spreadsheet is available at https://drive.google.com/open?id=1USD_zTSj7rDL-P3yiRt6Aa5Mk00Li3Si

*#LOAD LIBRARIES AND DATA*

>library(splitstackshape)

>Harbor = read.csv(file="filepath/HARBOR2_190828.csv",header=T)

*#PARSE DATA*

>Harbor = concat.split(Harbor, split.col="full_data",sep="~",drop=T) 

>Harbor = Harbor[-seq(from=4,to=32,by=2)]

>names(Harbor) = c("Date","Time","DeviceID","vbat","A0","A1","A3","A4","A5","T11/dP","T11/t","p5/ms5803/p","p5/ms5803/t","p6/ms5803/p","p6/ms5803/t","p7/t","p7/h","Tip_Ct")

I realized that my data string length was inconsistent when I received the following error after the "concat.split" function:

>Error in `[.data.table`(indt, , splitCols, with = FALSE) : 
  Column 2 ['full_data'] is length 4487 but column 1 is length 11; malformed data.table.

Side note: Row 1877 is the last row with readings from sensor p6/ms5803

1
Please use the editing capability and tools within to tidy up your question. It would also be a hoot if your question was self-contained.Roman Luštrik
@RomanLuštrik, thank you for the advice. Could you clarify what you mean by a self-contained question?AP Geoscience
@AP_Geoscience I think it means don't give us the link to your file, give us a sample of itPavoDive

1 Answers

0
votes

I think this should do the trick. This loops through each line in your Harbor dataset, parses it by the ~, separates out the variables and their corresponding values, creates a new dataframe, and finally combines dataframes while leaving NA's for missing columns.

hseq = seq(1,lengths(Harbor),1)
mydf <- data.frame(matrix(ncol=15))
colnames(mydf) <- c("vbat", "A0", "A1","A3","A4","A5","T11/dP","T11/t","p5/ms5803/p","p5/ms5803/t","p6/ms5803/p","p6/ms5803/t","p7/t","p7/h","Tip_Ct")
for (x in hseq){
  delimitedline <- strsplit(toString(Harbor[x]),split="~")
  index <- seq(1,lengths(delimitedline)-1,by=2)
  vars <- delimitedline[[1]][index]
  val <- delimitedline[[1]][index+1]
  tempdf <- data.frame(t(val))
  colnames(tempdf) <- vars
  mydf <- bind_rows(mydf,tempdf)
}
cleaneddf = mydf[-1,]

And by the way, @RomanLuštrik, it would be a "hoot" if you were kinder to new users.