Load frequent subsequences from TXT

Question

Is it possible to load a list of frequent subsequences from a .txt file, and make TraMineR recognize it as a sequence object?

Unfortunately I don't have the raw data, therefore I am not able to recreate the analysis. The only file what I have is a .txt file containing the frequent subsequences. I assume it was created with the seqefsub() function from the TraMineR package, with maxGap=2, because the data looks like as an output of the mentioned function.

read.table() reads it as a data frame but as far as I understood, TraMineR handles event sequences as lists with many additional attributes, that for example are not contained in this file. Or I don't know how to extract them...

This is how the a couple of lines from the .txt file look like:

                                             Subsequence    Support  Count
16                                           (WT4)-(WT3) 0.76666667    805
17                                                 (WL2) 0.76380952    802
18                                                  (S1) 0.76000000    798
19                                             (FRF,WL2) 0.74380952    781
20                                           (WT2)-(WT1) 0.70571429    741

Why do you want to save the print of the outcome of seqefsub as text and then read it back as sequence object? The seqefsub function already returns an event sequence object. Do you want to transform the event sequence object into a state sequence object? (if yes look at [stackoverflow.com/a/28968342/1586731] ). Please clarify your question. — Gilbert
@Gilbert – I've edited my question, I hope it is more clear now. — Balazs Dukai

Gilbert Gilbert · Accepted Answer · 2015-04-21T20:19:21

To create an event sequence object from the (text) subsequences, you have to transform them into vertical time stamped event (TSE) form. The function below does the job for your data

## Function subseq.to.TSE
##  puts the sequences into TSE format using
##  position as timestamp
##  sdf: a data frame with columns Id, Subsequence, Support and Count.

subseq.to.TSE <- function(sdf){
  tse <- data.frame(id=0, event="", time=0)
  k <- 0
  for (i in 1:nrow(sdf)){
    id <- sdf[i,"Id"]
    s <- sdf[i,"Subsequence"]
    ss <- gsub("\\(","",s)
    ss <- gsub("\\)","",ss)
    # split transitions
    st <- strsplit(ss, split="-")[[1]]
    for (j in 1:length(st)){
      stt <- strsplit(st[j], split=",")[[1]]
      for(jj in 1:length(stt)){
        k <- k+1
        tse[k,1] <- id
        ## parsing for simultaneous events
        if (!(stt[jj] %in% levels(tse[,2])))
          {levels(tse[,2]) <- c(levels(tse[,2]),stt[jj])}
        tse[k,2] <- stt[jj]
        tse[k,3] <- j
      }
    }
  }

  return(tse)
 }

Here is how you would use it on the example data.

We first create the data frame that we name s.df

s.df <- data.frame(scan(what=list(Id=0, Subsequence="", Support=double(), Count=0)))
16 (WT4)-(WT3) 0.76666667    805
17 (WL2) 0.76380952    802
18 (S1) 0.76000000    798
19 (FRF,WL2) 0.74380952    781
20 (WT2)-(WT1) 0.70571429    741

# leave a blank line to end the scan

Then we extract the TSE data from s.df and create from it the event sequence object using seqecreate. Finally, we assign the counts as sequence weights.

s.tse <- subseq.to.TSE(s.df)
seqe <- seqecreate(s.tse)
seqeweight(seqe) <- s.df[,"Count"]

Now you can for instance plot the event sequences with

seqpcplot(seqe)

Load frequent subsequences from TXT

1 Answers