Last week, while assessing different Tcl only implementations for JSON parsing in the Tcler’s Wiki I was struck by a crazy idea, namely that parsing JSON strings from end to start would be faster then the other way ‘round, because there are less options on which to decide.
Well, I could have looked closer to verify the soundness of the idea, but I got so obsessed with it, that I just started coding, and guess what? I got the fasted Tcl only implementation for parsing JSON strings out there!
That was astonishing, because I did not even parse directly into a Tcl data structure, but into an intermediate format, which has to be converted into the wanted form by a second step.
The bad news?: The worst thing to happen for an engineer happened: it works and he does not know why!
After chilling down a little bit from the unexpected good outcome I rewrote the JSON parser again to parse from left to right now. This should be faster then the right to left parser, because it does not have to go through some hoops required by the unusual approach of the latter. But: the forward parser, is slower than the reverse parser…
In this article, I will show you what this reverse parsing is like, which benchmarks I used and what numbers they yielded, I will give clues about the possible causes for the performance differences of the available Tcl JSON parsers and make guesses about further potential improvements.
Reverse JSON parsing
When parsing JSON from left to right, you do more or less the following:
When you see a
{
assemble the contents as an object, whose elements are presented as lists of comma separated “key”: value elements, terminated by a closing}
.When you see a
[
assemble the contained comma separated value list until the closing]
into an ordered arrayWhen you see a
"
, then what follows is a string value until the closing"
. A quote character inside a string is escaped by a backslash:\"
.true
,false
ornull
, are just their value.When you see a number … it’s a number. It can be either an integer or a double float. Floats look like:
-1.602e-19
.
Now: let’s do the same thing from right to left:
When you see a
}
assemble an object until the opening{
. Remember the “key”: value notation; you will parse any value up to the next:
and then parse the key string. After that, a,
tells you there is more, a{
tells you you are done. Now reverse the list.With a
]
assemble the array by gathering the individual values. A,
tells you there is another value, a[
that you are done. Now reverse the list.With a
"
assemble the string until the opening"
.When you see an
l
, it’snull
, when you see ane
, check the previous letter, if it is anu
, it’strue
, withs
it’sfalse
, everything else is an error. Go back three or four characters respectively.When you see
0
through9
or a lone.
it’s a number: scan back to the next,
,:
or[
and parse the number from there.
TON - The Intermediate Tcl Object Notation Format
We use the active file design pattern, which is nicely explained by Koen Van Damme.
An object is represented by a proc - Tcl parlance for a function - which evaluates its arguments into a Tcl dictionary - Tcl parlance for a key/value list. The array proc turns its arguments into a list, the string function into a string, and so on. Here is the list of functions:
o
- object.
a
- array
s
- string
l
-
literal, for representing
true
,false
ornull
. i
andd
- integer and number functions. We take the extra tour to differentiate between these two, so the user of the format can take advantage of the additional information if she likes.
We also have used n
- number - instead of i
and d
, more about
that later.
Let’s see an example.
{
"Me": {
"Name": "Georg Lehner",
"Height": 178,
"Pets": ["Kaiser", "Biba", "Fortunatus"],
"Male": true
}
}
Converting this JSON string to TON, looks like this (line breaks by me):
o Me [o Name [s {Georg Lehner}] \
Height [i 178] \
Pets [a [s Kaiser] [s Biba] [s Fortunatus]] \
Male [l true]]
In Tcl everything is a string, so we don’t need quotes in most of the
cases. Everything inside []
is evaluated as a script. If we run
this object representation as a Tcl script in the right environment,
the first o
bject proc waits for all its arguments to be evaluated
and then returns them as a data structure or as a string in a
different object presentation format. Let’s take a try on a TON to
JSON converter in order of appearance of the above example:
The
o
bject proc takes all key value pairs in its arguments, wraps key inside"
, followed by a:
and the value and glues the results together as a comma separated list, wrapped into a pair of curly braces. The firsto
has only one key/value argument, of which the value is ono
call again.The
s
proc takes one argument, encloses it into double quotes and returns it.The
i
andd
proc take their arguments and return the respective integer or double number string representation.The
a
rray proc takes its arguments and turns them into a comma separated list, enclosed in square brackets.The
l
iteral proc takes its argument and returns it as a string.
Easy? … Easy!
The complete code for a validating JSON to TON parser, a TON to JSON decoder, two types of TON to Tcl dictionary decoders and a TON to Tcl list decoder fit into 172 lines of compact Tcl code.
Other Tcl JSON parsers
The Tcler’s Wikis’ JSON page lists a wealth of parsers for the JSON format. We have to distinguish between parsers written in the C language, with a wrapper to make them accessible from within the Tcl interpreter and parsers written in pure Tcl.
The C parsers are fast, but in order for them to work, you need to compile the C parser and the Tcl glue code for your target platform, something which might not be readily available in a given situation. We include the SQLite3 JSON1 extension in our benchmark, so that you can see, that it is magnitutes faster then any Tcl JSON parser.
Here comes the list of all Tcl only parsers in order of appearance:
The jimhttp web framework contains a JSON parser.
tcl-json is a JSON parser created from a grammer by yeti a parser framework for Tcl á la yacc. On our small test data it truncates arrays at the first element, it fails to parse bigger test data and does not terminate on the biggest json file we feed to him, so tcl-json does not qualify for competition.
The Tcl “standard library” - Tcllib - provides the so far fasted Tcl only parser. Tcllib is capable of using a C-parser if one is available. We have to force it to use Tcl only mode for the benchmark.
Alternative JSON implements a jsonDecode proc which also uses an intermediate format, a Tcl dictionary where all data is tagged with its type.
There is also a json2dict one liner on the JSON page. It
just replaces []
s with {}
s and eliminates commas and colons. The
resulting string is legal Tcl dict/list syntax. This is a superfast
conversion of JSON to Tcl and scales good, however it destroys data
containing any of the JSON delimiter characters. Most enoying might
be, that URLs are corrupted, but also simple English text is
mutilated. For the fun of it, we hack the benchmark program and trick
it into believing that the mangled output is correct, allowing
json2dict to compete.
Interpreting the parsed results
The output of the parsers comes in different forms. One of the
typical representations is used by Tcllib: JSON objects are converted
into Tcl dict’s, JSON arrays into Tcl lists. This requires some
effort to extract data from the resulting representation, since nested
calls to dict get
and lindex
are required to get to a specific
item. An example from the benchmark is:
dict get [lindex $parsed 43] gender
Where parsed
is a variable holding the parsed Tcl representation of
the JSON input. What we achieve here is extracting the 43rd element
of a JSON array, which is an object, from which we extract the
‘gender’ field.
The second typical representation, coined ‘dictionary format’,
converts JSON arrays into Tcl dict’s, using the decimal integer string
representation of the array index of a given element used as the
dictionary key. Extraction of an element is a straightforward dict
get
with all the keys and array indexes listed in order. The example
from above:
dict get 43 gender
Some of the foreign language parsers use a special path syntax:
- SQLite-JSON1:
-
$[43].gender
- jq:
-
[43].gender
Finally, the jsonDecode parser allows to precisely check out each path by type, a good thing for a validating extractor function. Written by hand it gets however cumbersome:
dict get [lindex [dict get $parsed array] 43] \
object gender string
We first extract the (only) array from parsed
, slice out the 43rd
element from it, who’s (only) object has a gender element which in
turn is extracted as a string.
The TON to JSON variants
With TON, the JSON parsing process has two steps. First JSON is converted to TON, than TON executes into the desired data format.
It is trivial to write the conversion to the Tcllib and the dictionary
format, we call them 2dict
and a2dict
respectively. a2dict
as in
array 2 dictionary.
A third implementation, 2list
is remotely similar to
Alternative JSONs tagged format, but more compact. Arrays are
lists, prefixed with an a
rray tag, Objects are key/value lists,
prefixed with an o
bject tag. We provide a get
function which
iterates through a dictionary format like list of keys. Array indexes
can be specified in any Tcl supported number base.
Remember that TON must be executed in an environment, where the Tcl
proc’s o
, a
, s
et.al. are defined. In order to isolate
different TON to X decoders, we put them in different namespaces.
To get a TON string in the variable parsed
converted to the
‘dictionary format’ we do the following:
namespace eval ton::a2dict $parsed
If we want the ‘Tcllib format’ we do:
namespace eval ton::2dict $parsed
And for getting the gender of the 43rd element of the test data as
shown above with the 2list
decoder we do:
set l [namespace eval ton::2list $parsed]
ton::2list::get $l 43 gender
Let’s see, how the different TON decoders behave differently in terms of performance and depending on the input data:
The data generation step for 2dict
and 2list
is almost identical,
but a2dict
will need more time because it has to generate and insert
the array indexes.
On the other hand, data extraction for a2dict
only needs Tcls’
built-in dict
mecanism, while 2dict
requires costly nested calls,
just like the other Tcllib like representations. Finally 2list
needs
to iterate through the keys with the get
ter function.
We will revisit this differences later when discussing the benchmark results.
The Benchmark
D. Bohdan has lead the way for JSON parser benchmarking with the JSON value extraction benchmark page of the Tcler’s Wiki.
He uses all available card data from ‘Magic: The Gathering’ converted into JSON format and available at https://mtgjson.com/.
This file grows with time, currently offering a 29 MB worth of JSON data, with almost a million lines and 2.8 million words.
This is by no way the intended use case I had, when starting to select a JSON parser. I want to handle a typical web API, where at most some ten to hundred relatively shallow items come back.
So I modified and streamlined Bohdans benchmark program in a way, which lets me specify any number of different tests, which are comprised of a JSON test data sets, a path to a specific datum for the respective test and the expected value.
The benchmark program is configurable, so that one can select, which JSON parser implementations are to be run in a test batch, and which of the specified tests are to be run.
In order to get an idea how the parsers do with growing data sizes and complexities I collected the following JSON test sets, listed in order of size:
The object example from the JSON specification in RFC 8259.
A file with 100 data sets from Mockaroo, an array with fake personal data.
A second Mockaroo file with 1000 data sets.
A file from the JSON generator with 3895 data sets of fake personal data.
The already mentioned ‘Magic: The Gathering’ data set.
The following table shows some data about the data sets, extracted
with wc
lines | words | bytes | sets | depth | filename |
---|---|---|---|---|---|
14 | 31 | 364 | 1 | 3 | test.json |
99 | 1201 | 13276 | 100 | 2 | mock_100.json |
999 | 12021 | 135133 | 1000 | 2 | mock_1000.json |
175276 | 550972 | 5381126 | 3895 | 3 | generated.json |
973974 | 2794332 | 29425030 | 221x355 | 4 | AllSets.json |
(78455) |
The ‘depth’ column corresponds to the depth of the object tree at which the benchmark extracts data and presumable also the deepest depth available in the respective data set.
This selection of data sets is by no way fit for a systematic exploration of the parsers time behaviour. For this we would have to vary at least the following parameters independently one of the other:
- Amount of data
- Depth of data sets
- Amount of different data types in the sets: numbers cost more to parse than literals, …
However you will see, that we get some interesting insights, although essentially only the first parameter is varied.
Benchmark Results
The complete result data can be found in the file benchdata.log. In order to keep the discussion focused I will only present the respective fastest ton implementation result (one of three) and also suppress results from other benchmarks where they do not contribute much to the discussion. (Sidenote: we call the format TON and our Tcl implementations to produce and decode it ton).
The ton results are always very close one to each other. We tested three implementations:
- A standard left to right parser.
- A right to left parser which does not validate numbers.
- A ‘validating’ right to left parser.
The non-validating parser is slightly faster on small data sets, but does not hold this advantage on bigger ones. We consider security more critical than speed in the use case of small data sets. They are expected on the Internet and in web APIs where any vulnerability is easily exploited. Since TON is executed Tcl code, it is vulnerable to code injection. With a non-validating ton parser, any Tcl code can be inserted and executed into a “number” field, if it only ends with a digit. We have demonstrated this with an example data set.
The left to right parser surprisingly was slower than the right to left parsers, we leave the discussion about this discovery for later.
The benchmark was run on a Intel Duo E6850 CPU with a 64 bit Debian GNU/Linux operating system, version 9.4 at 3 GHz clock frequency with 3 GB RAM.
ton is designed to take advantage of performance improvements added in Tcl 8.5 and Tcl 8.6. The benchmark is run with Tcl 8.6, ton will most likely have a much poorer performance with lower Tcl versions.
Software Package Versions
Package versions:
Tcl -- 8.6.6
Tcllib json -- 1.3.3
jimhttp json -- 2.1.0
sqlite3 -- 3.16.2
jsonDecode -- Tcler's wiki
ton -- 0.1
RFC 8259 example object test
SQLite JSON1 7.61 μs/run
json2dict 30.80 μs/run
tona2dict 375.52 μs/run
Tcllib JSON 418.01 μs/run
jsonDecode 864.98 μs/run
jimhttp JSON 2074.32 μs/run
The C-parser provided by SQLite3 is about fifty times faster than the fastest Tcl implementation, with larget data sets this ratio goes up to 200 and peaks at 400 in our test run. With this small data set it is four times faster than the infamous json2dict, with larger data sets the ratio levels around 10.
json2dict, not a parser, just a string mangling Tcl one liner, is twelve times faster here than the fastest Tcl only parser. It peaks at 30 times with the large array data sets but falls back to 20 times with the complexer and bigger data sets.
The jimhttp JSON parsers is out of league, taking 5 times longer
than our tona2dict
parser. Interestingly enough, it seems to be
very good with the Mockaroo and the JSON Generator tests. With the two
Mockaroo tests, jimhttp is faster then its closest competitor,
jsonDecode, however it is still two to four times slower then ton.
jsonDecode is consistently about two times slower than ton or Tcllib. This could indicate a similar design, but some throttling features or missed code bottlenecks.
Mockaroo test with 100 data sets
json2dict 0.58 ms/run
ton2dict 16.37 ms/run
Tcllib JSON 23.37 ms/run
jimhttp JSON 38.40 ms/run
Mockaroo test with 1000 data sets
json2dict 5.82 ms/run
ton2dict 167.31 ms/run
Tcllib JSON 241.14 ms/run
jsonDecode 463.61 ms/run
jimhttp JSON 389.61 ms/run
JSON Generator test
ton2list 4.14 s/run
Tcllib JSON 4.99 s/run
jimhttp JSON 17.47 s/run
MTG data set test
SQLite JSON1 0.13 s/run
json2dict 1.20 s/run
ton2dict 24.48 s/run
Tcllib JSON 27.69 s/run
The Tcllib JSON parser is 10-20 % slower than ton. This might come from the penalties of a robuster implementation of the former. Nontheless Tcllib makes heavy use of Tcls’ excelent regular expression support by splitting the JSON string into a list of tokens. The regular expression processing might be a big upfront cost both for small JSON strings, where the cost/benefit ratio is against it, and for big JSON strings, where the regular expresion takes proportional more time, as well as the allocation of the respectively big token list. Also the final stage: processing the token list, consumes increasing time with increasing input size.
ton scans the JSON string character by character, trying to make the smallest number of choices possible. There is still a proportional cost, but it is much smaller per unit.
The different ton decoders are very close one to each other. Most
of the times, the 2dict
decoder is slightly faster then his cousins.
With the big array of the JSON generator test, 2list
is fastest,
maybe because of the benefit of the one to one mapping of arrays to
Tcl lists.
The a2dict
decoder only is faster with the smallest data set: the
RFC 8259 example object. This is propably caused by the
simpler dict get .. keys ..
item extraction process and could be an
indicator, to use this decoder when processing great amounts of small
JSON data sets.
In short:
- With small data sets: use
a2dict
. - With huge arrays: use
a2list
. - With any other, or mixed workloads: use
2dict
.
In order to prove these hypotesis’ specific benchmarks would have to be designed and run, though.
Comparision of the different parser implementations
The Tcllib JSON Parser maps the whole JSON input string with a carefully crafted regular expression directly to a list of tokens, which is then parsed by a conventional recursive parser. This is a fast and powerful implementation.
The Alternative JSON jsonDecode parser is a single proc which fits into 87 lines. It scans the JSON string from left to right. It’s relative slow speed might stem from the fact, that it makes heavy use of regular expresions for advancing through the JSON text.
The very slow jimhttp JSON decoder scans the JSON string into a token list, just like the Tcllib parser, and then decodes the token list. The scan process goes left to right with an index pointer and disects the string directly by the respective character at the point. The value parsing, however, is done with regular expressions. The decoder “executes” the token list, just as in TON. jimhttp outperforms jsonDecode with the string only Mockaroo data sets. The reason might be, that strings have a short code path. jimhttp is faster then ton when run under the Jim interpreter, see also the Errata section at the end.
The ton JSON parser does not use regular expressions at all. It scans the JSON input string by one of the following means:
- Direct character to character comparison.
- Pattern matching, if more than one characters are looked for.
- Tcls’ built in character class detection (
string is space
).
When detecting a token, ton scans for the next syntactic separator and afterwards trims whitespace. This strategies keep the cost for string comparision low and there are no regexp setup penalties for small JSON strings.
ton Speed Tuning Prospects
The TON parser could generate the desired form of Tcl data representation of the JSON input directly. This would take away the burden of the required decode step and probably speed up TON. A small experiment can give us a clue how much we could gain from this:
I have timed TON 10000 times converting the RFC 8259 example object
into the TON intermediate format; it takes about 317 µs per run. The
a2dict
decoder for the resulting TON string needs only 11.8 µs per
iteration, also timed 10000 times.
This is only a 3% speed gain! Why is it so small?
When repeating the experiment with only one iteration we get a clue:
it now takes 1385 µs versus 219 µs. The json2ton
proc and the a
,
o
, s
, etc. procs in the ton::a2dict
namespace get byte compile
during the first run, which speeds things up in later runs. The gain
of leaving out the decode process is still only 15%. The gain of
bytecompiling the json2ton
parser is only about 5:1, while the gain
for the decoder procs is almost 19:1! This indicates, that TON by
itself is quite a fast format to store data in.
All parsers reviewed in this article are recursive, yet none of them tries to take advantage of tail recursion. This could bring significant speed gains for large data sets. Typical JSON texts are composed of a lot of small data items which lets expect a great deal of time spent with call overhead. The Mockaroo data sets have six key/value words per line, which sums to 12 string or number parsing calls per record, the medium line length is 134 characters, so we have a function call each ten characters, one for the key and one for the value.
A similar option could be to rewrite the parser to be not recursive. I have done this, but speed is just the half of the reverse JSON parser. Did I do something very wrong in this code, or is reverse parsing simply better?
Why is reverse JSON parsing fast?
First, the question is not stated correctly. It has not been proven that parsing JSON from back to front is faster than the other way round.
The finding can still be an artefact of the way the Tcl interpreter works, or a happy coincidence of optimized codepaths in my implementation.
The left to right parser I wrote, though he does not need to reverse the parsed lists of tokens, is slower, although it uses the same programming techniques as the reverse parser.
At this moment I have only two guesses.
Reverse scanning reduces the amount of separation characters of the
input stream - if only by two: -
does not “start” a number and two
of the three literals end both in e
. If character comparing is a
costly operation, this could give us the respective advantage.
Second, maybe the very structure of the JSON data leads to shorter code paths when processed from end to start. Maybe there is some gain in collecting first the “leaf” data , I cannot see how, though.
Oh wait! Looking closer to the JSON data sets, it can be noticed,
that the majority of the data are of type string, namely in all but
the RFC 8259 data set. Parsing strings is fast with ton,
because we only scan for a "
character, while other implementation
use regular expresions. This might also explain, why ton is 40
percent faster on the Mockaroo datasets then Tcllib but only 11-20
percent faster on the other ones. The Mockaroo tests are string only,
apart from the array/object grouping of the data records.
This does still not explain satisfactorily, why the ton left to right implementation is slower then the reversed one.
Open ends
Our ton parser does not care about escaped characters other then
\"
. We just validate the numbers according to Tcl, they could be in
illegal hex or octal notation. Trailing text after the initial JSON
data is happily parsed and the initial JSON data is discarded.
It would be interesting to categorize the different implementations according to their functionality with these and other considerations in mind.
ton is not just the parser, it is a manipulation libary for an alternative data notation format TON. As such it can be used to serialize and consume data in applications. Especially in Tcl it will be very fast to consume.
A typical use case would be a configuration file or a file for
storing user preferences. This type of data is traditionally held in
Tcl arrays (hashes). Serialization of an array is however already
built in (array get
) and the resulting format easier to write by
hand than the ton intermediate format. Tcl does not know the
difference between “array” and “object”, so the mapping of typical Tcl
datastructures to JSON or ton is not really natural. ton is also
susceptible to line breaks, since the Tcl interpreter, which executes
it, interpretes a newline inside []
as the start of a new command.
Decoding the ton format should be done inside a safe interpreter, to limit potential damage by malformed or maliciously crafted ton strings.
Show me the code
All test data, the benchmark program, the ton, jimhttp, and tcl-json implementations can be downloaded from this page as a compressed archive: ton.tar.gz.
We will continue to develop ton and have set up a repository for it.
Errata
In the first version of this article I made a lengthy statement about
the possible causes of the bad performance of the jimhttp JSON
interpreter: costly string comparision of the tokenized JSON input and
use of apply
during the decoding stage. D. Bohdan kindly pointed
out, that this are severe misconceptions with modern Tcl interpreters:
Constant strings in Tcl are interned, comparision for equality is done by pointer comparision and does not depend on string length.
In contrast to
eval
,apply
is byte compiled and does not affect speed.
Another interesting detail is, that jimhttp executed with the Jim interpreter is faster then our ton JSON parser. It would be interesting to explore this differences between the two Tcl interpreters further.
After introducing the test suite from json.org a list of bugs,
mostly concerned with corner cases, could be identified in ton.
After fixing them, the speed difference to the left to right parsers
decreases, but ton is still the fastest pure Tcl JSON parser in our
benchmark. Please do not use the ton.tcl
file linked to this
article, go to my software repository to get the up to date
ton.