
I load a text file (tree.txt) to R, with the below content (copy pasted from JWEKA - J48 command). I use the following command to load the text file:

data3 <-read.table (file.choose(), header = FALSE,sep = ",")

I would like to insert each column into a separate variables named like the following format COL1, COL2 ... COL8 (in this example since we have 8 columns). If you load it to EXCEL with delimited separation each row will be separated in one column (this is the required result). Each COLn will contain the relevant characters of the tree in this example. How can separate and insert the text file into these columns automatically while ignoring the header and footer content of the file?

Here is the text file content:

J48 pruned  tree                                                        

MSTV    <=  0.4                                                     
|   MLTV    <=  4.1:    3   -2                                          
|   MLTV    >   4.1                                                 
|   |   ASTV    <=  79                                              
|   |   |   b   <=  1383:00:00  2   -18                                 
|   |   |   b   >   1383                                            
|   |   |   |   UC  <=  05:00   1   -2                              
|   |   |   |   UC  >   05:00   2   -2                              
|   |   ASTV    >   79:00:00    3   -2                                      
MSTV    >   0.4                                                     
|   DP  <=  0                                                   
|   |   ALTV    <=  09:00   1   (170.0/2.0)                                     
|   |   ALTV    >   9                                               
|   |   |   FM  <=  7                                           
|   |   |   |   LBE <=  142:00:00   1   (27.0/1.0)                              
|   |   |   |   LBE >   142                                     
|   |   |   |   |   AC  <=  2                                   
|   |   |   |   |   |   e   <=  1058:00:00  1   -5                      
|   |   |   |   |   |   e   >   1058                                
|   |   |   |   |   |   |   DL  <=  04:00   2   (9.0/1.0)                   
|   |   |   |   |   |   |   DL  >   04:00   1   -2                  
|   |   |   |   |   AC  >   02:00   1   -3                          
|   |   |   FM  >   07:00   2   -2                                  
|   DP  >   0                                                   
|   |   DP  <=  1                                               
|   |   |   UC  <=  03:00   2   (4.0/1.0)                                   
|   |   |   UC  >   3                                           
|   |   |   |   MLTV    <=  0.4:    3   -2                              
|   |   |   |   MLTV    >   0.4:    1   -8                              
|   |   DP  >   01:00   3   -8                                      

Number  of  Leaves  :   16                                              

Size    of  the tree    :   31

An example of the COL1 content will be: MSTV | | | | | | | | MSTV | | | | | | | | | | | | | | | | | | | |

COL2 content will be: MLTV MLTV | | | | | | > DP | | | | | | | | | | | | DP | | | | | |

Please show how you ultimately want the input represented in a data frame/table.Tim Biegeleisen
will you paste the first 20 or so lines of the tree.txt into a separate block, please, for test data. And so your first data row is "MSTV","","<=","0.4" or is there truly an MLTV in the [0,1] element of the total data frame?Shawn Mehan
And just looking at it, every row really only has 4 elements, and possibly an index that shows what its indent is in the tree. Is there any reason that your internal data structure needs to be 8 wide? Even then, you have difficulties as some of your rows are longer than 8, e.g., DL <= 04:00 2 (9.0/1.0) which begins in COL8 in your proposed structure.Shawn Mehan
as @TimBiegeleisen suggests, your desired output is unclear.MichaelChirico
The example attached is output of J48 of JWEKA. It is in the format 'as-is'. Finally all I need is to extract the words (variables) in each column and in the next step their associated values (i.e.MSTV 0.4). All I need at this first stage is to separate the txt into columns as EXCEL does (just load it to excel and then you can see the needed results).Avi

1 Answers


Try this:

cleaned.txt <- capture.output(cat(paste0(tail(head(readLines("FILE_LOCATION"), -4), -4), collapse = '\n'), sep = '\n'))
cleaned.df <- read.fwf(file = textConnection(cleaned.txt), 
                   header = FALSE, 
                   widths = rep.int(4, max(nchar(cleaned.txt)/4)),
                   strip.white= TRUE
cleaned.df <- cleaned.df[,colSums(is.na(cleaned.df))<nrow(cleaned.df)]

For the cleaning process, I end up using a combination of head and tail to remove the 4 spaces on the top and the bottom. There's probably a more efficient way to do this outside of R, but this isn't so bad. Generally, I'm just making the file readable to R.

Your file looks like a fixed-width file so I use read.fwf, and use textConnection() to point the function to the cleaned output.

Finally, I'm not sure how your data is actually structured, but when I copied it from stackoverflow, it pasted with a bunch of whitespace at the end of each line. I'm using some tricks to guess at how long the file is, and removing extraneous columns over here

widths = rep.int(4, max(nchar(cleaned.txt)/4))
cleaned.df <- cleaned.df[,colSums(is.na(cleaned.df))<nrow(cleaned.df)]

Next, I'm creating the data in the way you would like it structured.

for (i in colnames(cleaned.df)) {
  assign(i, subset(cleaned.df, select=i))
  assign(i, capture.output(cat(paste0(unlist(get(i)[get(i)!=""])),sep = ' ', fill = FALSE)))


What this does is it creates a loop for each column header in your data frame.

From there it uses assign() to put all the data in each column into its' own data frame. In your case, they are named V1 through V15.

Next, it uses a combination of cat() and paste() with unlist() an capture.output() to concatenate your list into a single character vectors, for each of the data frames, so they are now character vectors, instead of data frames.

Keep in mind that because you wanted a space at each new character, I'm using a space as a separator. But because this is a fixed-width file, some columns are completely blank, which I'm removing using


(Your question said you wanted COL2 to be: MLTV MLTV | | | | | | > DP | | | | | | | | | | | | DP | | | | | |).

If we just use get(i), there will be a leading whitespace in the output.