4
votes

Summary:

I have an issue where sometimes a the google-drive-sdk for python does not detect the end of the document being exported. It seems to think that the google document is of infinite size.

Background, source code and tutorials I followed:

I am working on my own python based google-drive backup script (one with a nice CLI interface for browsing around). git link for source code

Its still in the making and currently only finds new files and downloads them (with 'pull' command).

To do the most important google-drive commands, I followed the official google drive api tutorials for downloading media. here

What works:

When a document or file is a non-google-docs document, the file is downloaded properly. However, when I try to "export" a file. I see that I need to use a different mimeType. I have a dictionary for this.

For example: I map application/vnd.google-apps.document to application/vnd.openxmlformats-officedocument.wordprocessingml.document when exporting a document.

When downloading google documents documents from google drive, this seems to work fine. By this I mean: my while loop with the code status, done = downloader.next_chunk() will eventual set done to true and the download completes.

What does not work:

However, on some files, the done flag never gets to true and script will download forever. This eventually amounts to several Gb. Perhaps I am looking for the wrong flag that says the file is complete when doing an export. I am surprised that google-drive never throws an error. Anybody know what could cause this?

Current status

For now I have exporting of google documents disabled in my code.

When I use scripts like "drive by rakyll" (at least the version I have) just puts a link to the online copy. I would really like to do a proper export so that my offline system can maintain a complete backup of everything on drive.

P.s. It's fine to put "you should use this service instead of the api" for the sake of others finding this page. I know that there are other services out there for this, but I'm really looking to explore the drive-api functions for integration with my own other systems.

2
From this documentation, make sure that the requests are authorized by an authenticated user through the OAuth 2.0 protocol. In addition to other scopes an application might need (such as https://www.googleapis.com/auth/drive), all applications attempting to import or export Google Apps Script projects must request the special scope https://www.googleapis.com/auth/drive.scripts.abielita
I have it set to full scope: googleapis.com/auth/drive. Also, if the scope was wrong, I would not be seeing that it does in-fact work for some of the exports just fine. I think it has something to do with pulling multiple chunks.SpiRail
Same here! Did you find the problem? Thanks!Diego Jancic
I found the problem. I'm using the v3 API, copied the Python code from the Google's site. The API call never completes because (as it took me a while to discover), the HTTP call doesn't return the Content-Length. I couldn't find the solution yet.Diego Jancic

2 Answers

4
votes

OK. I found a pseudo solution here.

The problem is that the Google API never returns the Content-Length and the response is done in Chunks. However, either the chunk returned is wrong, or the Python API is not able to process it correctly.

What I did was, grab the code for the MediaIoBaseDownload from here

I left all the same, but changed this part:

if 'content-range' in resp:
    content_range = resp['content-range']
    length = content_range.rsplit('/', 1)[1]
    self._total_size = int(length)
elif 'content-length' in resp:
    self._total_size = int(resp['content-length'])
else:
    # PSEUDO BUG FIX: No content-length, no chunk info, cut the response here.
    self._total_size = self._progress 

The else at the end is what I've added. I've also changed the default chunk size by setting DEFAULT_CHUNK_SIZE = 2*1024*1024. Also you will have to copy a few imports from that file, including this one from googleapiclient.http import _retry_request, _should_retry_response

Of course this is not a solution, it just says "if I don't understand the response, just stop it here". This will probably make some exports not work, but at least it doesn't kill the server. This is only until we can find a good solution.

UPDATE:

Bug is already reported here: https://github.com/google/google-api-python-client/issues/15

and as of January 2017, the only workaround is to not use MediaIoBaseDownload and do this instead (not suitable to large files):

req = service.files().export(fileId=file_id, mimeType=mimeType)
resp = req.execute(http=http)
0
votes

I'm using this and it's works with the following library:

google-auth-oauthlib==0.4.1
google-api-python-client
google-auth-httplib2

This is the snippet I'm using:

from apiclient import errors
from googleapiclient.http import MediaIoBaseDownload
from googleapiclient.discovery import build

def download_google_document_from_drive(self, file_id):
    try:

        request = self.service.files().get_media(fileId=file_id)
        fh = io.BytesIO()
        downloader = MediaIoBaseDownload(fh, request)
        done = False
        while done is False:
            status, done = downloader.next_chunk()
            print('Download %d%%.' % int(status.progress() * 100))
        return fh
    except Exception as e:
        print('Error downloading file from Google Drive: %s' % e)

You can write the file stream to a file:

import xlrd
workbook = xlrd.open_workbook(file_contents=fh.getvalue())