1
votes

I try to run dataflow pipeline which use a python files that integrated with pickle file below:

dataflow.py

    from stopwords import StopWords
    stopwords = StopWords()
    ...
    data = (pipeline | 'read' >> ReadFromText (gs://some/inputData.txt)      
| 'stopwords' >> beam.Map(lambda x:{'id':x['id'],'text': stopwords.validate(x['text'])}))

stopwords.py

class StopWords:
def __init__ (self):

module_dir = os.path.dirname(__file__)
self.words = pickle.load(open(os.path.join(module_dir, 'model/sw.p'), "rb"))

How ever, I found an error:

IOError: [Errno 2] No such file or directory: '/usr/local/lib/python2.7/dist-packages/dataflow/model/sw.p'

I try to debug self.words and it run smoothly. however, it countered a problem when I run it in google cloud dataflow job.

Anyone can help?

1

1 Answers

0
votes

Your StopWords class is attempting to load the model (sw.p) from the same directory as the stopwords.py file, but it seems like the models hasn't gotten deployed along with your code.

Perhaps try putting the sw.p file you have locally into a Google Cloud Storage bucket and loading it from there instead?