Getting absolute filepath from Google Cloud Storage for use with pdf-text-extract

Question

The package pdf-text-extract requires an absolute filepath. I need to use this package on files in a Google Cloud Storage bucket.

Is there a way to pass a file from a GCS bucket as an absolute URL?

Here is my code (Node.js):

var storage = require('@google-cloud/storage')();
const extract = require('pdf-text-extract');

const bucketFile = 'http://bucketname.storage.googleapis.com/fileName.pdf';

extract(bucketFile, 
    (err, pages) => {
        if (err) {
            console.error(err)
            return
        }

        console.log(pages);
    }
);

This returns an error:

Error: pdf-text-extract command failed: I/O Error: Couldn't open file 'D:\Libraries\Documents\pdf-text-extract-bucket\http:\bucketname.storage.googleapis.com\fileName.pdf'

I've also tried passing this to the extract function:

file = storage.bucket('bucketname').file('fileName.pdf');

The way it works with local files (instead of GCS buckets):

const filePath = path.join(__dirname, tempFileName);  
extract(filePath, callback);

Héctor Neri Héctor Neri · Accepted Answer · 2018-08-15T22:41:02

Checking the source code for the pdf-text-extract, it 's only designed to work on local files (it uses path.resolve() for the filepaths you pass as parameters). Unless you want to change the way this module works, you can just download the file to your local system:

const Storage = require('@google-cloud/storage');
const storage = new Storage();
const options = { destination: destFilename,};
storage.bucket(bucketName).file(srcFilename).download(options);

And then you can use it:

const localFile = path.join(destFilename, srcFilename);
extract(localFile, 
    (err, pages) => {
        if (err) {
            console.error(err)
            return
        }

        console.log(pages);
    }
);

Getting absolute filepath from Google Cloud Storage for use with pdf-text-extract

2 Answers