Reading fixed width format file in R

Question

I'm attempting to read this fixed width file into R using read.fwf:

http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for

When I perform this function I'm getting some weird errors that I cannot sort out unless I read it a very specific way:

> fwf <- read.fwf("getdata_wksst8110.for", 1:9, skip = 4)
> head(fwf)
  V1 V2  V3   V4 V5     V6  V7       V8   V9
1 NA  3 JAN 1990 NA 23.4-0 0.4 25.1-0.3 26.6
2 NA 10 JAN 1990 NA 23.4-0 0.8 25.2-0.3 26.6
3 NA 17 JAN 1990 NA 24.2-0 0.3 25.3-0.3 26.5
4 NA 24 JAN 1990 NA 24.4-0 0.5 25.5-0.4 26.5
5 NA 31 JAN 1990 NA 25.1-0 0.2 25.8-0.2 26.7
6 NA  7 FEB 1990 NA 25.8 0 0.2 26.1-0.1 26.8

However, you clearly see that by comparing the output to the original file it's not right. There should indeed be 9 columns, but it's cutting up my date columns and the other columns.

If I use a sep = " " argument it just throws an error:

> fwf <- read.fwf("getdata_wksst8110.for", 1:9, skip = 4, sep = " ")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 6 did not have 25 elements

Could someone, please, help me figure out why this isn't reading in the way I would expect?

This is a helpful link I found related to using this function but it's more of a performance related question. The author never defined his widths = col arguments.

Thank you for your consideration of this puny question.

So I re-ran the operation using the vector of widths as recommended by @MrFlick and the data is looking a lot better. However, what I am seeing is that the "sep" argument is clearly reeking havoc. If I use sep = " " it's throwing a strange error. But if I don't use sep then it jerks up my column results.

*

Non-jerked results using widths = c(10, 4, 4, 4, 4, 4, 4, 4, 4)
    > head(fwf)
              V1 V2 V3   V4 V5 V6   V7  V8 V9
    1  03JAN1990 NA 23 4-0.  4 25 .1-0 0.3  2
    2  10JAN1990 NA 23 4-0.  8 25 .2-0 0.3  2
    3  17JAN1990 NA 24 2-0.  3 25 .3-0 0.3  2
    4  24JAN1990 NA 24 4-0.  5 25 .5-0 0.4  2
    5  31JAN1990 NA 25 1-0.  2 25 .8-0 0.2  2
    6  07FEB1990 NA 25 8 0.  2 26 .1-0 0.1  2

Jerked results using:

fwf <- read.fwf("getdata_wksst8110.for", widths = c(10, 4, 4, 4, 4, 4, 4, 4, 4), skip = 4, sep = " ") Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 6 did not have 25 elements

Am I missing something with sep?

#

A modification of the awesome @MrFlick's script appears to have fit the bill (more or less)! That first row remained troublesome and made it impossible for my to summarize/sum on hd[4]. Removing the first row hd[-1,] didn't seem to help at all oddly enough. Oh well.

hd<-read.fwf("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for", 
             widths=c(10,rep(c(9,4),4)), skip=3)

trim <- function(x) gsub("^\\s+|\\s+$","",x)
main <- paste0(trim(hd[1,seq(2, ncol(hd), by=2)]), trim(hd[1,seq(3, ncol(hd), by=2)]))
sub <- trim(as.vector(hd[2,]))

names(hd) <- make.names(c(sub[1],paste(rep(main, each=2), sub[-1])))

What do you think 1:9 is doing? That parameter should be specifying the width of each column (in terms of number of characters). It doesn't seem as though you've correctly specified the column widths at all. Also, you may want to look at the read_fwf function from the readr package because the base read.fwf function is pretty inefficient (should that be a concern). — MrFlick
I read the docs and read.table as well. The width of the columns is variable across all of the columns. The date for instance is 9L, and the other 8 columns as varied generally between 3 and 4L. -0.50 = 4, 25.5 = 3, 0.03 = 3, etc. — Zach
That's why you supply a vector of widths. So if the first is 8 char and the second is 4, then you start with c(8, 4, ...). You specify a width for each of the 9 columns. — MrFlick
<face palm> I keep forgetting you can supply a vector of data points to be used. — Zach
Setting widths = 4, means you have just one column with width 4. If you have 9 columns of width 4, you would do widths=c(4,4,4,4,4,4,4,4,4) or, more succinctly, widths=rep(4,9). That's the thing with fixed-width files, you need to specify all the widths of all the columns; that's the only way to know how to parse the file. — MrFlick

MrFlick MrFlick · Accepted Answer · 2015-05-06T20:47:11

Here's a command that should read in the data

dd<-read.fwf("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for",
    widths=c(10, rep(c(9,4),4)), skip=4)

Note that the widths need to account for all characters in each line, so even if there are blank spaces between columns, you need to assign those to one of the columns.

Then I can't think of a super-clean way to get the headers. This works but it's ugly and makes assumptions

hd<-read.fwf("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for", 
    widths=c(10,rep(c(9,4),4)), skip=2, nrow=2, as.is=T)

trim <- function(x) gsub("^\\s+|\\s+$","",x)
main <- paste0(trim(hd[1,seq(2, ncol(hd), by=2)]), trim(hd[1,seq(3, ncol(hd), by=2)]))
sub <- trim(as.vector(hd[2,]))

names(dd) <- make.names(c(sub[1],paste(rep(main, each=2), sub[-1])))

and finally, you can make a proper date value with

dd$Week <- as.Date(as.character(dd$Week), "%d%b%Y")

You shouldn't be using the sep= parameter at all. What read.fwf actually does is re-write the fixed with file as a delimited file using sep as the delimiter and then reads the delimited file with the more standard read.table(). The default value of sep="\t" is usually safe as generally you do not have tabs in your actual data.

Reading fixed width format file in R

#

1 Answers