0
votes

The package pdf-text-extract requires an absolute filepath. I need to use this package on files in a Google Cloud Storage bucket.

Is there a way to pass a file from a GCS bucket as an absolute URL?

Here is my code (Node.js):

var storage = require('@google-cloud/storage')();
const extract = require('pdf-text-extract');

const bucketFile = 'http://bucketname.storage.googleapis.com/fileName.pdf';

extract(bucketFile, 
    (err, pages) => {
        if (err) {
            console.error(err)
            return
        }

        console.log(pages);
    }
);

This returns an error:

Error: pdf-text-extract command failed: I/O Error: Couldn't open file 'D:\Libraries\Documents\pdf-text-extract-bucket\http:\bucketname.storage.googleapis.com\fileName.pdf'

I've also tried passing this to the extract function:

file = storage.bucket('bucketname').file('fileName.pdf');

The way it works with local files (instead of GCS buckets):

const filePath = path.join(__dirname, tempFileName);  
extract(filePath, callback);
2

2 Answers

1
votes

Checking the source code for the pdf-text-extract, it 's only designed to work on local files (it uses path.resolve() for the filepaths you pass as parameters). Unless you want to change the way this module works, you can just download the file to your local system:

const Storage = require('@google-cloud/storage');
const storage = new Storage();
const options = { destination: destFilename,};
storage.bucket(bucketName).file(srcFilename).download(options);

And then you can use it:

const localFile = path.join(destFilename, srcFilename);
extract(localFile, 
    (err, pages) => {
        if (err) {
            console.error(err)
            return
        }

        console.log(pages);
    }
);
0
votes

I needed to use the filepath from a file on the google cloud storage to open an sqlite database, this is the code I found and used, worked like a charm

// Download file from bucket.
const bucket = gcs.bucket(fileBucket);
const tempFilePath = path.join(os.tmpdir(), fileName);
const metadata = {
  contentType: contentType,
};
return bucket.file(filePath).download({
  destination: tempFilePath,
}).then(() => {
  console.log('Image downloaded locally to', tempFilePath);
  // Generate a thumbnail using ImageMagick.
  return spawn('convert', [tempFilePath, '-thumbnail', '200x200>', tempFilePath]);
}).then(() => {
  console.log('Thumbnail created at', tempFilePath);
  // We add a 'thumb_' prefix to thumbnails file name. That's where we'll upload the thumbnail.
  const thumbFileName = `thumb_${fileName}`;
  const thumbFilePath = path.join(path.dirname(filePath), thumbFileName);
  // Uploading the thumbnail.
  return bucket.upload(tempFilePath, {
    destination: thumbFilePath,
    metadata: metadata,
  });
  // Once the thumbnail has been uploaded delete the local file to free up disk space.
}).then(() => fs.unlinkSync(tempFilePath));