1
votes

I am crawling data from an URL and for crawling using beautiful soup. I want to store that crawled data into an AZURE BLOB STORAGE as a blob. below is my code when am saving data in my local, same thing I want to perform for direct upload into Azure.

soup = BeautifulSoup(urlopen('www.abc.html')) 
outfile = open('C:\\Users\\ADMIN\\filename.txt','w') 
data = soup.encode("ascii","ignore") 
outfile.write(data) 
outfile.close

this code successfully saving data of website in my local folder, please help me with saving the data of the same website in azure blob storage directly. I have key and account in AZURE BLOB STORAGE.

soup=BeautifulSoup(urlopen('www.abc.html'))
data = soup.encode("ascii","ignore")        

block_blob_service.create_blob_from_text('containername', 'filename.txt', data)

I am trying above piece of code but its not working.

1
Please edit your question and include more details about what's not working. Are you getting any errors?Gaurav Mantri
@GauravMantri in my blob code the parameter 'data' is not a text type, and the create_blob_from_text() expect text argument. I am not able to find out other approaches. Castig 'data' to text is giving error as it is a str type.user5372164
Can you convert this data to a byte array?Gaurav Mantri

1 Answers

1
votes

There is not any information that shows what version of BeautifulSoup and the method urlopen comes from urllib or urllib2 or urllib3 in Python 2. Just according to your code, and per my experience, I think you were using BeautifulSoup4 with urllib2, and I tried to reproduce your issue about the data type is not str, but failed becaused of my code below works.

Here is my sample code.

from bs4 import BeautifulSoup 
import urllib2

soup = BeautifulSoup(urllib2.urlopen("http://bing.com"))
data = soup.encode("ascii","ignore") 
print type(data) # It's <type 'str'> here

from azure.storage.blob.blockblobservice import BlockBlobService

block_blob_service = BlockBlobService(account_name='<your-account-name>', account_key='<your-account-key>')
block_blob_service.create_container('mycontainer')
block_blob_service.create_blob_from_text('mycontainer1', 'filename.txt', data)

Even I replaced urllib2 with urllib, the data type is str. So I think you may try to use the StringIO & block_blob_service.create_blob_from_stream for your code, as below.

from StringIO import StringIO
block_blob_service.create_blob_from_stream('mycontainer', 'filename2.txt', StringIO(data))

It also works for me.

Hope it helps.