4
votes

I'm writing a small webapp in golang, and it involves parsing a file uploaded by the user. I'd like to auto-detect if the file is gzipped or not and create readers / scanners appropriately. One twist is that I can't read the whole file into memory, I can only operate on the stream alone. Here's what I've got:

func scannerFromFile(reader io.Reader) (*bufio.Scanner, error) {

var scanner *bufio.Scanner
//create a bufio.Reader so we can 'peek' at the first few bytes
bReader := bufio.NewReader(reader)

testBytes, err := bReader.Peek(64) //read a few bytes without consuming
if err != nil {
    return nil, err
}
//Detect if the content is gzipped
contentType := http.DetectContentType(testBytes)

//If we detect gzip, then make a gzip reader, then wrap it in a scanner
if strings.Contains(contentType, "x-gzip") {
    gzipReader, err := gzip.NewReader(bReader)
    if (err != nil) {
        return nil, err
    }

    scanner = bufio.NewScanner(gzipReader)

} else {
    //Not gzipped, just make a scanner based on the reader
    scanner = bufio.NewScanner(bReader)
}

return scanner, nil
}

This works fine for plain text, but for gzipped data it inflates incorrectly, and after a few kb I inevitably get garbled text. Is there a simpler method out there? Any ideas why after a few thousand lines it uncompresses incorrectly?

2
Makes me wonder if something outside this code is faulty--garbled text from a gzip reader is definitely something I don't expect. (Edit: whoops, that "don't" is important. :) )twotwotwo
The code looks correct to me. I recommend contentType == "application/x-gzip" instead of strings.Contains.Cerise Limón
If the compressed stream itself were corrupted I'd expect you to get a CRC error; could be something before compression or after decompression--anyway, I fear there may not be enough information here to solve the problem.twotwotwo

2 Answers

5
votes

You can detect that a file is gziped by checking if the first 2 bytes are equal to 0x1f8b (I found that information here).

In comments someone mentioned that you should check these bytes separately, so the first one is 0x1f and the second is 0x8b.

testBytes, err := bReader.Peek(2) //read 2 bytes
....
if testBytes[0] == 31 && testBytes[1] == 139 {
    //gzip
}else{
   ...
}

Hope that helps.

0
votes

Thanks everyone - turns out that twotwotwo and thundercat were correct, and the stream was getting corrupted in a spot unrelated to the code I posted. Weirdly, it seems to be related to writing to the http response while still reading from the request stream. I'm still investigating it, but it seems the original question was misguided.