0
votes

I have been tasked to develop an interface layer that will be used internally by other developers as a means to access Google Cloud Storage (GCS). For me the process began by reading the online documentation. I am now at the point of making a decision as to which API we'll use. There are a couple of outstanding questions and that documentation mentions posting questions to SO, so here I am. A quick tidbit of background seems in order. We are predominately a C house, though we do have means to build and invoke C++ methods from C. The internal users will write C code that invokes my C code, which in turn must provide access to GCS. So that's the high-level call stack. The question then becomes one of how do I provide that access in the best manner, with performance being the top criteria?

To answer that, I began by reading the online documentation regarding the different GCS options. There is an XML and a JSON REST API to be had. We have an internal HTTP mechanism that would enable the C code to invoke those JSON/XML API methods directly. The documentation states the following:

"...JSON API is RESTful...and is specifically intended to be used with the Google Cloud Client Libraries."

What are Google Cloud libraries I wondered. Reading the documentation gleaned that they are libraries that are written in a particular language and can be utilized to access GCS. The documentation seems to steer you in the direction of using one of these client libraries versus the JSON/XML ones.

So the first question I had was "what does the 'specifically intended" business mean exactly?" After more reading, I arrived at the notion that these client libraries are an abstraction of the RESTful interfaces. These libraries invoke the JSON API methods, just as you could do yourself through the JSON API directly. Is that correct? If so they seem like a means of convenience to interact with GCS. The documentation even states that the cloud libraries "...provide better performance and usability" versus the JSON HTTP interface.

In the end, I see two paths forward:

  1. Invoke the JSON API directly from C using our HTTP mechanism
  2. Use the bridge to invoke C++ methods from C. Those C++ methods then invoke methods in the client libraries, which ultimately (if the above is correct) invoke the JSON API.

Note that I've already written some C code in-house that uses the aforementioned internal HTTP mechanism to interact with the Apache WebHDFS API, which is also RESTful and for which there are no client libraries. Thus I could leverage a fair portion of that code to re-use in this new development. For me it boils down to a question of performance. The second option above seems rather circuitous in comparison to the first. Thus the first would seem to yield some performance improvement over the second. However Google mentions that the client libraries provide better performance than the RESTful APIs. How is that? The documentation states that the client libraries handle all of the low level communication with the server, including authentication. Is this part of the reason?

And so I am posing this question to those who are more experienced with GCS (and perhaps GCP for that matter): which route would provide better performance in your opinion? Invoking the JSON API directly or using the client libraries (C++ in our case)? Thanks!

1

1 Answers

0
votes

Disclaimer: I work on the C++ client libraries for Google Cloud, specifically the C++ GCS library.

However Google mentions that the client libraries provide better performance than the RESTful APIs. How is that?

The libraries are tuned to use the RPCs "well". They have good defaults for retry and backoff strategies, they pick the right RPC if two alternatives are available. They can recover efficiently from an interrupted download or upload. They may implement higher-level abstractions to improve performance.

If you are asking "how can running their code to call an RPC be more efficient than me calling the same RPC", then the answer is "probably it cannot". I do not think that is the interesting problem. For example, if you want to upload a large file (say multiple GiB) the most efficient way in GCS is to split the file in chunks, use several parallel uploads to different GCS objects, then compose the objects, and delete the intermediate objects. That would be several times faster than uploading the file sequentially. Could you do the same? Sure! Do you want to? Maybe not.

We also worry about resuming an interrupted download correctly. And verifying the integrity of your data as you upload and download it (you can turn off those features, of course).

Finally, we can change the protocol under the hood if there is a more efficient way to do things. For example, we already switch between XML and JSON when we both are equivalent and one seems to perform better than the other. We are also working on supporting gRPC instead of REST, which is showing good improvements over the REST baseline (unfortunately it is not GA yet, and I do not have an estimated date).

HTH