How is it that json serialization is so much faster than yaml serialization in Python?

votes

I have code that relies heavily on yaml for cross-language serialization and while working on speeding some stuff up I noticed that yaml was insanely slow compared to other serialization methods (e.g., pickle, json).

So what really blows my mind is that json is so much faster that yaml when the output is nearly identical.

>>> import yaml, cjson; d={'foo': {'bar': 1}}
>>> yaml.dump(d, Dumper=yaml.SafeDumper)
'foo: {bar: 1}\n'
>>> cjson.encode(d)
'{"foo": {"bar": 1}}'
>>> import yaml, cjson;
>>> timeit("yaml.dump(d, Dumper=yaml.SafeDumper)", setup="import yaml; d={'foo': {'bar': 1}}", number=10000)
44.506911039352417
>>> timeit("yaml.dump(d, Dumper=yaml.CSafeDumper)", setup="import yaml; d={'foo': {'bar': 1}}", number=10000)
16.852826118469238
>>> timeit("cjson.encode(d)", setup="import cjson; d={'foo': {'bar': 1}}", number=10000)
0.073784112930297852

PyYaml's CSafeDumper and cjson are both written in C so it's not like this is a C vs Python speed issue. I've even added some random data to it to see if cjson is doing any caching, but it's still way faster than PyYaml. I realize that yaml is a superset of json, but how could the yaml serializer be 2 orders of magnitude slower with such simple input?

pythonjsonserializationyaml

5 Answers

votes

In general, it's not the complexity of the output that determines the speed of parsing, but the complexity of the accepted input. The JSON grammar is very concise. The YAML parsers are comparatively complex, leading to increased overheads.

JSON’s foremost design goal is simplicity and universality. Thus, JSON is trivial to generate and parse, at the cost of reduced human readability. It also uses a lowest common denominator information model, ensuring any JSON data can be easily processed by every modern programming environment.

In contrast, YAML’s foremost design goals are human readability and support for serializing arbitrary native data structures. Thus, YAML allows for extremely readable files, but is more complex to generate and parse. In addition, YAML ventures beyond the lowest common denominator data types, requiring more complex processing when crossing between different programming environments.

I'm not a YAML parser implementor, so I can't speak specifically to the orders of magnitude without some profiling data and a big corpus of examples. In any case, be sure to test over a large body of inputs before feeling confident in benchmark numbers.

Update Whoops, misread the question. :-( Serialization can still be blazingly fast despite the large input grammar; however, browsing the source, it looks like PyYAML's Python-level serialization constructs a representation graph whereas simplejson encodes builtin Python datatypes directly into text chunks.

votes

In applications I've worked on, the type inference between strings to numbers (float/int) is where the largest overhead is for parsing yaml is because strings can be written without quotes. Because all strings in json are in quotes there is no backtracking when parsing strings. A great example where this would slow down is the value 0000000000000000000s. You cannot tell this value is a string until you've read to the end of it.

The other answers are correct but this is a specific detail that I've discovered in practice.

votes

Speaking about efficiency, I used YAML for a time and felt attracted by the simplicity that some name/value assignments take on in this language. However, in the process I tripped so and so often about one of YAML’s finesses, subtle variations in the grammar that allow you to write special cases in a more concise style and such. In the end, although YAML’s grammar is almost for certain formally consistent, it has left me with a certain feeling of ‘vagueness’. I then restricted myself to not touch existing, working YAML code and write everything new in a more roundabout, fail-safe syntax—which made me abandon all of YAML. The upshot is that YAML tries to look like a W3C standard, and produces a small library of hard to read literature concerning its concepts and rules.

This, I feel, is by far more intellectual overhead than needed. Look at SGML/XML: developed by IBM in the roaring 60s, standardized by the ISO, known (in a dumbed-down and modified form) as HTML to uncounted millions of people, documented and documented and documented again the world over. Comes up little JSON and slays that dragon. How could JSON become so widely used in so short a time, with just one meager website (and a javascript luminary to back it)? It is in its simplicity, the sheer absence of doubt in its grammar, the ease of learning and using it.

XML and YAML are hard for humans, and they are hard for computers. JSON is quite friendly and easy to both humans and computers.

votes

A cursory look at python-yaml suggests its design is much more complex than cjson's:

>>> dir(cjson)
['DecodeError', 'EncodeError', 'Error', '__doc__', '__file__', '__name__', '__package__', 
'__version__', 'decode', 'encode']

>>> dir(yaml)
['AliasEvent', 'AliasToken', 'AnchorToken', 'BaseDumper', 'BaseLoader', 'BlockEndToken',
 'BlockEntryToken', 'BlockMappingStartToken', 'BlockSequenceStartToken', 'CBaseDumper',
'CBaseLoader', 'CDumper', 'CLoader', 'CSafeDumper', 'CSafeLoader', 'CollectionEndEvent', 
'CollectionNode', 'CollectionStartEvent', 'DirectiveToken', 'DocumentEndEvent', 'DocumentEndToken', 
'DocumentStartEvent', 'DocumentStartToken', 'Dumper', 'Event', 'FlowEntryToken', 
'FlowMappingEndToken', 'FlowMappingStartToken', 'FlowSequenceEndToken', 'FlowSequenceStartToken', 
'KeyToken', 'Loader', 'MappingEndEvent', 'MappingNode', 'MappingStartEvent', 'Mark', 
'MarkedYAMLError', 'Node', 'NodeEvent', 'SafeDumper', 'SafeLoader', 'ScalarEvent', 
'ScalarNode', 'ScalarToken', 'SequenceEndEvent', 'SequenceNode', 'SequenceStartEvent', 
'StreamEndEvent', 'StreamEndToken', 'StreamStartEvent', 'StreamStartToken', 'TagToken', 
'Token', 'ValueToken', 'YAMLError', 'YAMLObject', 'YAMLObjectMetaclass', '__builtins__', 
'__doc__', '__file__', '__name__', '__package__', '__path__', '__version__', '__with_libyaml__', 
'add_constructor', 'add_implicit_resolver', 'add_multi_constructor', 'add_multi_representer', 
'add_path_resolver', 'add_representer', 'compose', 'compose_all', 'composer', 'constructor', 
'cyaml', 'dump', 'dump_all', 'dumper', 'emit', 'emitter', 'error', 'events', 'load', 
'load_all', 'loader', 'nodes', 'parse', 'parser', 'reader', 'representer', 'resolver', 
'safe_dump', 'safe_dump_all', 'safe_load', 'safe_load_all', 'scan', 'scanner', 'serialize', 
'serialize_all', 'serializer', 'tokens']

More complex designs almost invariably mean slower designs, and this is far more complex than most people will ever need.

votes

Although you have an accepted answer, unfortunately that only does some handwaving in the direction of the PyYAML documentation and quotes a statement in that documentation that is not correct: PyYAML does not make a representation graph during dumping, it creates a lineair stream (and just like json keeps a bucket of IDs to see if there are recursions).

First of all you have to realize that while the cjson dumper is handcrafted C-code only, YAML's CSafeDumper shares two of the four dump stages (Representer and Resolver) with the normal pure Python SafeDumper and that the other two stages (the Serializer and Emitter) are not written completely handcrafted in C, but consist of a Cython module which calls the C library libyaml for emitting.

Apart from that significant part, the simple answer to your question why it takes longer, is that dumping YAML does more. This is not so much because YAML is harder as @flow claims, but because that extra that YAML can do, makes it so much more powerful than JSON and also more user friendly, if you need to process the result with an editor. That means more time is spent in the YAML library even when applying these extra features, and in many cases also just checking if something applies.

Here is an example: even if you have never gone through the PyYAML code, you'll have noticed that the dumper doesn't quote foo and bar. That is not because these strings are are keys, as YAML doesn't have the restriction that JSON has, that a key for a mapping needs to be string. E.g. a Python string that is a value in mapping can also be unquoted (i.e. plain).

The emphasis is on can, because it is not always so. Take for instance a string that consists of numeral characters only: 12345678. This needs to be written out with quotes as otherwise this would look exactly like a number (and read back in as such when parsing).

How does PyYAML know when to quote a string and when not? On dumping it actually first dumps the string, then parses the result to make sure, that when it reads that result back, it gets the original value. And if that proves not to be the case, it applies quotes.

Let me repeat the important part of the previous sentence again, so you don't have to re-read it:

it dumps the string, then parses the result

This means it applies all of the regex matching it does when loading to see if the resulting scalar would load as an integer, float, boolean, datetime, etc., to determine whether quotes need to be applied or not.¹

In any real application with complex data, a JSON based dumper/loader is too simple to use directly and a lot more intelligence has to be in your program compared to dumping the same complex data directly to YAML. A simplified example is when you want to work with date-time stamps, in that case you have to convert a string back and forth to datetime.datetime yourself if you are using JSON. During loading you have to do that either based on the fact that this is a value associated with some (hopefully recognisable) key:

{ "datetime": "2018-09-03 12:34:56" }

or with a position in a list:

["FirstName", "Lastname", "1991-09-12 08:45:00"]

or based on the format of the string (e.g. using regex).

In all of these cases much more work needs to be done in your program. The same holds for dumping and that does not only mean extra development time.

Lets regenerate your timings with what I get on my machine so we can compare them with other measurements. I rewrote your code somewhat, because it was incomplete (timeit?) and imported other things twice. It was also impossible to just cut and paste because of the >>> prompts.

from __future__ import print_function

import sys
import yaml
import cjson
from timeit import timeit

NR=10000
ds = "; d={'foo': {'bar': 1}}"
d = {'foo': {'bar': 1}}

print('yaml.SafeDumper:', end=' ')
yaml.dump(d, sys.stdout, Dumper=yaml.SafeDumper)
print('cjson.encode:   ', cjson.encode(d))
print()


res = timeit("yaml.dump(d, Dumper=yaml.SafeDumper)", setup="import yaml"+ds, number=NR)
print('yaml.SafeDumper ', res)
res = timeit("yaml.dump(d, Dumper=yaml.CSafeDumper)", setup="import yaml"+ds, number=NR)
print('yaml.CSafeDumper', res)
res = timeit("cjson.encode(d)", setup="import cjson"+ds, number=NR)
print('cjson.encode    ', res)

and this outputs:

yaml.SafeDumper: foo: {bar: 1}
cjson.encode:    {"foo": {"bar": 1}}

yaml.SafeDumper  3.06794905663
yaml.CSafeDumper 0.781533956528
cjson.encode     0.0133550167084

Now lets dump a simple data structure that includes a datetime

import datetime
from collections import Mapping, Sequence  # python 2.7 has no .abc

d = {'foo': {'bar': datetime.datetime(1991, 9, 12, 8, 45, 0)}}

def stringify(x, key=None):
    # key parameter can be used to dump
    if isinstance(x, str):
       return x
    if isinstance(x, Mapping):
       res = {}
       for k, v in x.items():
           res[stringify(k, key=True)] = stringify(v)  # 
       return res
    if isinstance(x, Sequence):
        res = [stringify(k) for k in x]
        if key:
            res = repr(res)
        return res
    if isinstance(x, datetime.datetime):
        return x.isoformat(sep=' ')
    return repr(x)

print('yaml.CSafeDumper:', end=' ')
yaml.dump(d, sys.stdout, Dumper=yaml.CSafeDumper)
print('cjson.encode:    ', cjson.encode(stringify(d)))
print()

This gives:

yaml.CSafeDumper: foo: {bar: '1991-09-12 08:45:00'}
cjson.encode:     {"foo": {"bar": "1991-09-12 08:45:00"}}

For the timing of the above I created a module myjson that wraps cjson.encode and has the above stringify defined. If you use that:

d = {'foo': {'bar': datetime.datetime(1991, 9, 12, 8, 45, 0)}}
ds = 'import datetime, myjson, yaml; d=' + repr(d)
res = timeit("yaml.dump(d, Dumper=yaml.CSafeDumper)", setup=ds, number=NR)
print('yaml.CSafeDumper', res)
res = timeit("myjson.encode(d)", setup=ds, number=NR)
print('cjson.encode    ', res)

giving:

yaml.CSafeDumper 0.813436031342
cjson.encode     0.151570081711

That still rather simple output, already brings you back from two orders of magnitude difference in speed to less than only one order of magnitude.

YAML's plain scalars and block style formatting make for better readable data. That you can have a trailing comma in a sequence (or mapping) makes for less failures when manually editing YAML data as with same data in JSON.

YAML tags allow for in-data indication of your (complex) types. When using JSON you have to take care, in your code, of anything more complex than mappings, sequences, integers, floats, booleans and strings. Such code requires development time, and is unlikely to be as fast as python-cjson (you are of course free to write your code in C as well.

Dumping some data, like recursive data-structures (e.g. topological data), or complex keys is pre-defined in the PyYAML library. There the JSON library just errors out, and implement workaround for that is non-trivial and most likely slows things that speed differences are less relevant.

Such power and flexibility comes at a price of lower speed. When dumping many simple things JSON is the better choice, you are unlikely going to edit the result by hand anyway. For anyting that involves editing or complex objects or both, you should still consider using YAML.

¹ _{It is possible to force dumping of all Python strings as YAML
scalars with (double) quotes, but setting the style is not enough to
prevent all readback.}