Although you have an accepted answer, unfortunately that only does
some handwaving in the direction of the PyYAML documentation and
quotes a statement in that documentation that is not correct: PyYAML
does not make a representation graph during dumping, it creates a
lineair stream (and just like json
keeps a bucket of IDs to see if there are
recursions).
First of all you have to realize that while the cjson
dumper is
handcrafted C-code only, YAML's CSafeDumper shares two of the four dump stages
(Representer
and Resolver
) with the normal pure Python SafeDumper
and that the other two stages (the Serializer and Emitter) are not
written completely handcrafted in C, but consist of a Cython module
which calls the C library libyaml
for emitting.
Apart from that significant part, the simple answer to your question
why it takes longer, is that dumping YAML does more. This is not so
much because YAML is harder as @flow claims, but because that extra
that YAML can do, makes it so much more powerful than JSON and also more
user friendly, if you need to process the result with an editor. That
means more time is spent in the YAML library even when applying these extra features,
and in many cases also just checking if something applies.
Here is an example: even if you have never gone through the PyYAML
code, you'll have noticed that the dumper doesn't quote foo
and
bar
. That is not because these strings are are keys, as YAML doesn't
have the restriction that JSON has, that a key for a mapping needs to
be string. E.g. a Python string that is a value in mapping can
also be unquoted (i.e. plain).
The emphasis is on can, because it is not always so. Take for
instance a string that consists of numeral characters only:
12345678
. This needs to be written out with quotes as otherwise this
would look exactly like a number (and read back in as such when parsing).
How does PyYAML know when to quote a string and when not? On dumping
it actually first dumps the string, then parses the result to make
sure, that when it reads that result back, it gets the original value.
And if that proves not to be the case, it applies quotes.
Let me repeat the important part of the previous sentence again, so
you don't have to re-read it:
it dumps the string, then parses the result
This means it applies all of the regex matching it does when
loading to see if the resulting scalar would load as an integer,
float, boolean, datetime, etc., to determine whether quotes need to be
applied or not.¹
In any real application with complex data, a JSON based
dumper/loader is too simple to use directly and a lot more
intelligence has to be in your program compared to dumping the same
complex data directly to YAML. A simplified example is when you want to work
with date-time stamps, in that case you have to convert a string back
and forth to datetime.datetime
yourself if you are using JSON. During loading
you have to do that either based on the fact that this is a value
associated with some (hopefully recognisable) key:
{ "datetime": "2018-09-03 12:34:56" }
or with a position in a list:
["FirstName", "Lastname", "1991-09-12 08:45:00"]
or based on the format of the string (e.g. using regex).
In all of these cases much more work needs to be done in your program. The same
holds for dumping and that does not only mean extra development time.
Lets regenerate your timings with what I get on my machine
so we can compare them with other measurements. I rewrote your code
somewhat, because it was incomplete (timeit
?) and imported other
things twice. It was also impossible to just cut and paste because of the >>>
prompts.
from __future__ import print_function
import sys
import yaml
import cjson
from timeit import timeit
NR=10000
ds = "; d={'foo': {'bar': 1}}"
d = {'foo': {'bar': 1}}
print('yaml.SafeDumper:', end=' ')
yaml.dump(d, sys.stdout, Dumper=yaml.SafeDumper)
print('cjson.encode: ', cjson.encode(d))
print()
res = timeit("yaml.dump(d, Dumper=yaml.SafeDumper)", setup="import yaml"+ds, number=NR)
print('yaml.SafeDumper ', res)
res = timeit("yaml.dump(d, Dumper=yaml.CSafeDumper)", setup="import yaml"+ds, number=NR)
print('yaml.CSafeDumper', res)
res = timeit("cjson.encode(d)", setup="import cjson"+ds, number=NR)
print('cjson.encode ', res)
and this outputs:
yaml.SafeDumper: foo: {bar: 1}
cjson.encode: {"foo": {"bar": 1}}
yaml.SafeDumper 3.06794905663
yaml.CSafeDumper 0.781533956528
cjson.encode 0.0133550167084
Now lets
dump a simple data structure that includes a datetime
import datetime
from collections import Mapping, Sequence # python 2.7 has no .abc
d = {'foo': {'bar': datetime.datetime(1991, 9, 12, 8, 45, 0)}}
def stringify(x, key=None):
# key parameter can be used to dump
if isinstance(x, str):
return x
if isinstance(x, Mapping):
res = {}
for k, v in x.items():
res[stringify(k, key=True)] = stringify(v) #
return res
if isinstance(x, Sequence):
res = [stringify(k) for k in x]
if key:
res = repr(res)
return res
if isinstance(x, datetime.datetime):
return x.isoformat(sep=' ')
return repr(x)
print('yaml.CSafeDumper:', end=' ')
yaml.dump(d, sys.stdout, Dumper=yaml.CSafeDumper)
print('cjson.encode: ', cjson.encode(stringify(d)))
print()
This gives:
yaml.CSafeDumper: foo: {bar: '1991-09-12 08:45:00'}
cjson.encode: {"foo": {"bar": "1991-09-12 08:45:00"}}
For the timing of the above I created a module myjson that wraps
cjson.encode
and has the above stringify
defined. If you use that:
d = {'foo': {'bar': datetime.datetime(1991, 9, 12, 8, 45, 0)}}
ds = 'import datetime, myjson, yaml; d=' + repr(d)
res = timeit("yaml.dump(d, Dumper=yaml.CSafeDumper)", setup=ds, number=NR)
print('yaml.CSafeDumper', res)
res = timeit("myjson.encode(d)", setup=ds, number=NR)
print('cjson.encode ', res)
giving:
yaml.CSafeDumper 0.813436031342
cjson.encode 0.151570081711
That still rather simple output, already brings you back from two orders
of magnitude difference in speed to less than only one order of magnitude.
YAML's plain scalars and block style formatting make for better readable data.
That you can have a trailing comma in a sequence (or mapping) makes for
less failures when manually editing YAML data as with same data in JSON.
YAML tags allow for in-data indication of your (complex) types. When
using JSON you have to take care, in your code, of anything more
complex than mappings, sequences, integers, floats, booleans and
strings. Such code requires development time, and is unlikely to be
as fast as python-cjson
(you are of course free to write your code
in C as well.
Dumping some data, like recursive data-structures (e.g. topological
data), or complex keys is pre-defined in the PyYAML library. There the
JSON library just errors out, and implement workaround for that is
non-trivial and most likely slows things that speed differences are less relevant.
Such power and flexibility comes at a price of lower speed. When
dumping many simple things JSON is the better choice, you are unlikely
going to edit the result by hand anyway. For anyting that involves
editing or complex objects or both, you should still consider using
YAML.
¹ It is possible to force dumping of all Python strings as YAML
scalars with (double) quotes, but setting the style is not enough to
prevent all readback.