Reverse JSON parsing with TCL

Last week, while assessing different Tcl only implementations for JSON parsing in the Tcler’s Wiki I was struck by a crazy idea, namely that parsing JSON strings from end to start would be faster then the other way ‘round, because there are less options on which to decide.

Well, I could have looked closer to verify the soundness of the idea, but I got so obsessed with it, that I just started coding, and guess what? I got the fasted Tcl only implementation for parsing JSON strings out there!

That was astonishing, because I did not even parse directly into a Tcl data structure, but into an intermediate format, which has to be converted into the wanted form by a second step.

The bad news?: The worst thing to happen for an engineer happened: it works and he does not know why!

After chilling down a little bit from the unexpected good outcome I rewrote the JSON parser again to parse from left to right now. This should be faster then the right to left parser, because it does not have to go through some hoops required by the unusual approach of the latter. But: the forward parser, is slower than the reverse parser…

In this article, I will show you what this reverse parsing is like, which benchmarks I used and what numbers they yielded, I will give clues about the possible causes for the performance differences of the available Tcl JSON parsers and make guesses about further potential improvements.

Reverse JSON parsing

When parsing JSON from left to right, you do more or less the following:

When you see a { assemble the contents as an object, whose elements are presented as lists of comma separated “key”: value elements, terminated by a closing }.
When you see a [ assemble the contained comma separated value list until the closing ] into an ordered array
When you see a ", then what follows is a string value until the closing ". A quote character inside a string is escaped by a backslash: \".
true, false or null, are just their value.
When you see a number … it’s a number. It can be either an integer or a double float. Floats look like: -1.602e-19.

Now: let’s do the same thing from right to left:

When you see a } assemble an object until the opening {. Remember the “key”: value notation; you will parse any value up to the next : and then parse the key string. After that, a , tells you there is more, a { tells you you are done. Now reverse the list.
With a ] assemble the array by gathering the individual values. A , tells you there is another value, a [ that you are done. Now reverse the list.
With a " assemble the string until the opening ".
When you see an l, it’s null, when you see an e, check the previous letter, if it is an u, it’s true, with s it’s false, everything else is an error. Go back three or four characters respectively.
When you see 0 through 9 or a lone . it’s a number: scan back to the next ,, : or [ and parse the number from there.

TON - The Intermediate Tcl Object Notation Format

We use the active file design pattern, which is nicely explained by Koen Van Damme.

An object is represented by a proc - Tcl parlance for a function - which evaluates its arguments into a Tcl dictionary - Tcl parlance for a key/value list. The array proc turns its arguments into a list, the string function into a string, and so on. Here is the list of functions:

o: object.
a: array
s: string
l: literal, for representing true, false or null.
i and d: integer and number functions. We take the extra tour to differentiate between these two, so the user of the format can take advantage of the additional information if she likes.

We also have used n - number - instead of i and d, more about that later.

Let’s see an example.

{
    "Me": {
        "Name": "Georg Lehner",
        "Height": 178,
        "Pets": ["Kaiser", "Biba", "Fortunatus"],
        "Male": true
    }
}

Converting this JSON string to TON, looks like this (line breaks by me):

o Me [o Name [s {Georg Lehner}] \
        Height [i 178] \
        Pets [a [s Kaiser] [s Biba] [s Fortunatus]] \
        Male [l true]]

In Tcl everything is a string, so we don’t need quotes in most of the cases. Everything inside [] is evaluated as a script. If we run this object representation as a Tcl script in the right environment, the first object proc waits for all its arguments to be evaluated and then returns them as a data structure or as a string in a different object presentation format. Let’s take a try on a TON to JSON converter in order of appearance of the above example:

The object proc takes all key value pairs in its arguments, wraps key inside ", followed by a : and the value and glues the results together as a comma separated list, wrapped into a pair of curly braces. The first o has only one key/value argument, of which the value is on o call again.
The s proc takes one argument, encloses it into double quotes and returns it.
The i and d proc take their arguments and return the respective integer or double number string representation.
The array proc takes its arguments and turns them into a comma separated list, enclosed in square brackets.
The literal proc takes its argument and returns it as a string.

Easy? … Easy!

The complete code for a validating JSON to TON parser, a TON to JSON decoder, two types of TON to Tcl dictionary decoders and a TON to Tcl list decoder fit into 172 lines of compact Tcl code.

Other Tcl JSON parsers

The Tcler’s Wikis’ JSON page lists a wealth of parsers for the JSON format. We have to distinguish between parsers written in the C language, with a wrapper to make them accessible from within the Tcl interpreter and parsers written in pure Tcl.

The C parsers are fast, but in order for them to work, you need to compile the C parser and the Tcl glue code for your target platform, something which might not be readily available in a given situation. We include the SQLite3 JSON1 extension in our benchmark, so that you can see, that it is magnitutes faster then any Tcl JSON parser.

Here comes the list of all Tcl only parsers in order of appearance:

The jimhttp web framework contains a JSON parser.

tcl-json is a JSON parser created from a grammer by yeti a parser framework for Tcl á la yacc. On our small test data it truncates arrays at the first element, it fails to parse bigger test data and does not terminate on the biggest json file we feed to him, so tcl-json does not qualify for competition.

The Tcl “standard library” - Tcllib - provides the so far fasted Tcl only parser. Tcllib is capable of using a C-parser if one is available. We have to force it to use Tcl only mode for the benchmark.

Alternative JSON implements a jsonDecode proc which also uses an intermediate format, a Tcl dictionary where all data is tagged with its type.

There is also a json2dict one liner on the JSON page. It just replaces []s with {}s and eliminates commas and colons. The resulting string is legal Tcl dict/list syntax. This is a superfast conversion of JSON to Tcl and scales good, however it destroys data containing any of the JSON delimiter characters. Most enoying might be, that URLs are corrupted, but also simple English text is mutilated. For the fun of it, we hack the benchmark program and trick it into believing that the mangled output is correct, allowing json2dict to compete.

Interpreting the parsed results

The output of the parsers comes in different forms. One of the typical representations is used by Tcllib: JSON objects are converted into Tcl dict’s, JSON arrays into Tcl lists. This requires some effort to extract data from the resulting representation, since nested calls to dict get and lindex are required to get to a specific item. An example from the benchmark is:

dict get [lindex $parsed 43] gender

Where parsed is a variable holding the parsed Tcl representation of the JSON input. What we achieve here is extracting the 43rd element of a JSON array, which is an object, from which we extract the ‘gender’ field.

The second typical representation, coined ‘dictionary format’, converts JSON arrays into Tcl dict’s, using the decimal integer string representation of the array index of a given element used as the dictionary key. Extraction of an element is a straightforward dict get with all the keys and array indexes listed in order. The example from above:

dict get 43 gender

Some of the foreign language parsers use a special path syntax:

SQLite-JSON1:

$[43].gender

jq:

[43].gender

Finally, the jsonDecode parser allows to precisely check out each path by type, a good thing for a validating extractor function. Written by hand it gets however cumbersome:

dict get [lindex [dict get $parsed array] 43] \
    object gender string

We first extract the (only) array from parsed, slice out the 43rd element from it, who’s (only) object has a gender element which in turn is extracted as a string.

The TON to JSON variants

With TON, the JSON parsing process has two steps. First JSON is converted to TON, than TON executes into the desired data format.

It is trivial to write the conversion to the Tcllib and the dictionary format, we call them 2dict and a2dict respectively. a2dict as in array 2 dictionary.

A third implementation, 2list is remotely similar to Alternative JSONs tagged format, but more compact. Arrays are lists, prefixed with an array tag, Objects are key/value lists, prefixed with an object tag. We provide a get function which iterates through a dictionary format like list of keys. Array indexes can be specified in any Tcl supported number base.

Remember that TON must be executed in an environment, where the Tcl proc’s o, a, s et.al. are defined. In order to isolate different TON to X decoders, we put them in different namespaces. To get a TON string in the variable parsed converted to the ‘dictionary format’ we do the following:

namespace eval ton::a2dict $parsed

If we want the ‘Tcllib format’ we do:

namespace eval ton::2dict $parsed

And for getting the gender of the 43rd element of the test data as shown above with the 2list decoder we do:

 set l [namespace eval ton::2list $parsed]
    ton::2list::get $l 43 gender

Let’s see, how the different TON decoders behave differently in terms of performance and depending on the input data:

The data generation step for 2dict and 2list is almost identical, but a2dict will need more time because it has to generate and insert the array indexes.

On the other hand, data extraction for a2dict only needs Tcls’ built-in dict mecanism, while 2dict requires costly nested calls, just like the other Tcllib like representations. Finally 2list needs to iterate through the keys with the getter function.

We will revisit this differences later when discussing the benchmark results.

The Benchmark

D. Bohdan has lead the way for JSON parser benchmarking with the JSON value extraction benchmark page of the Tcler’s Wiki.

He uses all available card data from ‘Magic: The Gathering’ converted into JSON format and available at https://mtgjson.com/.

This file grows with time, currently offering a 29 MB worth of JSON data, with almost a million lines and 2.8 million words.

This is by no way the intended use case I had, when starting to select a JSON parser. I want to handle a typical web API, where at most some ten to hundred relatively shallow items come back.

So I modified and streamlined Bohdans benchmark program in a way, which lets me specify any number of different tests, which are comprised of a JSON test data sets, a path to a specific datum for the respective test and the expected value.

The benchmark program is configurable, so that one can select, which JSON parser implementations are to be run in a test batch, and which of the specified tests are to be run.

In order to get an idea how the parsers do with growing data sizes and complexities I collected the following JSON test sets, listed in order of size:

The object example from the JSON specification in RFC 8259.
A file with 100 data sets from Mockaroo, an array with fake personal data.
A second Mockaroo file with 1000 data sets.
A file from the JSON generator with 3895 data sets of fake personal data.
The already mentioned ‘Magic: The Gathering’ data set.

The following table shows some data about the data sets, extracted with wc

lines	words	bytes	sets	depth	filename
14	31	364	1	3	`test.json`
99	1201	13276	100	2	`mock_100.json`
999	12021	135133	1000	2	`mock_1000.json`
175276	550972	5381126	3895	3	`generated.json`
973974	2794332	29425030	221x355	4	`AllSets.json`
			(78455)

The ‘depth’ column corresponds to the depth of the object tree at which the benchmark extracts data and presumable also the deepest depth available in the respective data set.

This selection of data sets is by no way fit for a systematic exploration of the parsers time behaviour. For this we would have to vary at least the following parameters independently one of the other:

Amount of data
Depth of data sets
Amount of different data types in the sets: numbers cost more to parse than literals, …

However you will see, that we get some interesting insights, although essentially only the first parameter is varied.

Benchmark Results

The complete result data can be found in the file benchdata.log. In order to keep the discussion focused I will only present the respective fastest ton implementation result (one of three) and also suppress results from other benchmarks where they do not contribute much to the discussion. (Sidenote: we call the format TON and our Tcl implementations to produce and decode it ton).

The ton results are always very close one to each other. We tested three implementations:

A standard left to right parser.
A right to left parser which does not validate numbers.
A ‘validating’ right to left parser.

The non-validating parser is slightly faster on small data sets, but does not hold this advantage on bigger ones. We consider security more critical than speed in the use case of small data sets. They are expected on the Internet and in web APIs where any vulnerability is easily exploited. Since TON is executed Tcl code, it is vulnerable to code injection. With a non-validating ton parser, any Tcl code can be inserted and executed into a “number” field, if it only ends with a digit. We have demonstrated this with an example data set.

The left to right parser surprisingly was slower than the right to left parsers, we leave the discussion about this discovery for later.

The benchmark was run on a Intel Duo E6850 CPU with a 64 bit Debian GNU/Linux operating system, version 9.4 at 3 GHz clock frequency with 3 GB RAM.

ton is designed to take advantage of performance improvements added in Tcl 8.5 and Tcl 8.6. The benchmark is run with Tcl 8.6, ton will most likely have a much poorer performance with lower Tcl versions.

Software Package Versions

Package versions:
Tcl           --  8.6.6
Tcllib json   --  1.3.3
jimhttp json  --  2.1.0
sqlite3       --  3.16.2
jsonDecode    --  Tcler's wiki
ton           --  0.1

RFC 8259 example object test

SQLite JSON1            7.61 μs/run
json2dict              30.80 μs/run
tona2dict             375.52 μs/run
Tcllib JSON           418.01 μs/run
jsonDecode            864.98 μs/run
jimhttp JSON         2074.32 μs/run

The C-parser provided by SQLite3 is about fifty times faster than the fastest Tcl implementation, with larget data sets this ratio goes up to 200 and peaks at 400 in our test run. With this small data set it is four times faster than the infamous json2dict, with larger data sets the ratio levels around 10.

json2dict, not a parser, just a string mangling Tcl one liner, is twelve times faster here than the fastest Tcl only parser. It peaks at 30 times with the large array data sets but falls back to 20 times with the complexer and bigger data sets.

The jimhttp JSON parsers is out of league, taking 5 times longer than our tona2dict parser. Interestingly enough, it seems to be very good with the Mockaroo and the JSON Generator tests. With the two Mockaroo tests, jimhttp is faster then its closest competitor, jsonDecode, however it is still two to four times slower then ton.

jsonDecode is consistently about two times slower than ton or Tcllib. This could indicate a similar design, but some throttling features or missed code bottlenecks.

Mockaroo test with 100 data sets

json2dict               0.58 ms/run
ton2dict               16.37 ms/run
Tcllib JSON            23.37 ms/run
jimhttp JSON           38.40 ms/run

Mockaroo test with 1000 data sets

json2dict               5.82 ms/run
ton2dict              167.31 ms/run
Tcllib JSON           241.14 ms/run
jsonDecode            463.61 ms/run
jimhttp JSON          389.61 ms/run

JSON Generator test

ton2list                4.14 s/run
Tcllib JSON             4.99 s/run
jimhttp JSON           17.47 s/run

MTG data set test

SQLite JSON1            0.13 s/run
json2dict               1.20 s/run
ton2dict               24.48 s/run
Tcllib JSON            27.69 s/run

The Tcllib JSON parser is 10-20 % slower than ton. This might come from the penalties of a robuster implementation of the former. Nontheless Tcllib makes heavy use of Tcls’ excelent regular expression support by splitting the JSON string into a list of tokens. The regular expression processing might be a big upfront cost both for small JSON strings, where the cost/benefit ratio is against it, and for big JSON strings, where the regular expresion takes proportional more time, as well as the allocation of the respectively big token list. Also the final stage: processing the token list, consumes increasing time with increasing input size.

ton scans the JSON string character by character, trying to make the smallest number of choices possible. There is still a proportional cost, but it is much smaller per unit.

The different ton decoders are very close one to each other. Most of the times, the 2dict decoder is slightly faster then his cousins. With the big array of the JSON generator test, 2list is fastest, maybe because of the benefit of the one to one mapping of arrays to Tcl lists.

The a2dict decoder only is faster with the smallest data set: the RFC 8259 example object. This is propably caused by the simpler dict get .. keys .. item extraction process and could be an indicator, to use this decoder when processing great amounts of small JSON data sets.

In short:

With small data sets: use a2dict.
With huge arrays: use a2list.
With any other, or mixed workloads: use 2dict.

In order to prove these hypotesis’ specific benchmarks would have to be designed and run, though.

Comparision of the different parser implementations

The Tcllib JSON Parser maps the whole JSON input string with a carefully crafted regular expression directly to a list of tokens, which is then parsed by a conventional recursive parser. This is a fast and powerful implementation.

The Alternative JSON jsonDecode parser is a single proc which fits into 87 lines. It scans the JSON string from left to right. It’s relative slow speed might stem from the fact, that it makes heavy use of regular expresions for advancing through the JSON text.

The very slow jimhttp JSON decoder scans the JSON string into a token list, just like the Tcllib parser, and then decodes the token list. The scan process goes left to right with an index pointer and disects the string directly by the respective character at the point. The value parsing, however, is done with regular expressions. The decoder “executes” the token list, just as in TON. jimhttp outperforms jsonDecode with the string only Mockaroo data sets. The reason might be, that strings have a short code path. jimhttp is faster then ton when run under the Jim interpreter, see also the Errata section at the end.

The ton JSON parser does not use regular expressions at all. It scans the JSON input string by one of the following means:

Direct character to character comparison.
Pattern matching, if more than one characters are looked for.
Tcls’ built in character class detection (string is space).

When detecting a token, ton scans for the next syntactic separator and afterwards trims whitespace. This strategies keep the cost for string comparision low and there are no regexp setup penalties for small JSON strings.

ton Speed Tuning Prospects

The TON parser could generate the desired form of Tcl data representation of the JSON input directly. This would take away the burden of the required decode step and probably speed up TON. A small experiment can give us a clue how much we could gain from this:

I have timed TON 10000 times converting the RFC 8259 example object into the TON intermediate format; it takes about 317 µs per run. The a2dict decoder for the resulting TON string needs only 11.8 µs per iteration, also timed 10000 times.

This is only a 3% speed gain! Why is it so small?

When repeating the experiment with only one iteration we get a clue: it now takes 1385 µs versus 219 µs. The json2ton proc and the a, o, s, etc. procs in the ton::a2dict namespace get byte compile during the first run, which speeds things up in later runs. The gain of leaving out the decode process is still only 15%. The gain of bytecompiling the json2ton parser is only about 5:1, while the gain for the decoder procs is almost 19:1! This indicates, that TON by itself is quite a fast format to store data in.

All parsers reviewed in this article are recursive, yet none of them tries to take advantage of tail recursion. This could bring significant speed gains for large data sets. Typical JSON texts are composed of a lot of small data items which lets expect a great deal of time spent with call overhead. The Mockaroo data sets have six key/value words per line, which sums to 12 string or number parsing calls per record, the medium line length is 134 characters, so we have a function call each ten characters, one for the key and one for the value.

A similar option could be to rewrite the parser to be not recursive. I have done this, but speed is just the half of the reverse JSON parser. Did I do something very wrong in this code, or is reverse parsing simply better?

Why is reverse JSON parsing fast?

First, the question is not stated correctly. It has not been proven that parsing JSON from back to front is faster than the other way round.

The finding can still be an artefact of the way the Tcl interpreter works, or a happy coincidence of optimized codepaths in my implementation.

The left to right parser I wrote, though he does not need to reverse the parsed lists of tokens, is slower, although it uses the same programming techniques as the reverse parser.

At this moment I have only two guesses.

Reverse scanning reduces the amount of separation characters of the input stream - if only by two: - does not “start” a number and two of the three literals end both in e. If character comparing is a costly operation, this could give us the respective advantage.

Second, maybe the very structure of the JSON data leads to shorter code paths when processed from end to start. Maybe there is some gain in collecting first the “leaf” data , I cannot see how, though.

Oh wait! Looking closer to the JSON data sets, it can be noticed, that the majority of the data are of type string, namely in all but the RFC 8259 data set. Parsing strings is fast with ton, because we only scan for a " character, while other implementation use regular expresions. This might also explain, why ton is 40 percent faster on the Mockaroo datasets then Tcllib but only 11-20 percent faster on the other ones. The Mockaroo tests are string only, apart from the array/object grouping of the data records.

This does still not explain satisfactorily, why the ton left to right implementation is slower then the reversed one.

Open ends

Our ton parser does not care about escaped characters other then \". We just validate the numbers according to Tcl, they could be in illegal hex or octal notation. Trailing text after the initial JSON data is happily parsed and the initial JSON data is discarded.

It would be interesting to categorize the different implementations according to their functionality with these and other considerations in mind.

ton is not just the parser, it is a manipulation libary for an alternative data notation format TON. As such it can be used to serialize and consume data in applications. Especially in Tcl it will be very fast to consume.

A typical use case would be a configuration file or a file for storing user preferences. This type of data is traditionally held in Tcl arrays (hashes). Serialization of an array is however already built in (array get) and the resulting format easier to write by hand than the ton intermediate format. Tcl does not know the difference between “array” and “object”, so the mapping of typical Tcl datastructures to JSON or ton is not really natural. ton is also susceptible to line breaks, since the Tcl interpreter, which executes it, interpretes a newline inside [] as the start of a new command.

Decoding the ton format should be done inside a safe interpreter, to limit potential damage by malformed or maliciously crafted ton strings.

Show me the code

All test data, the benchmark program, the ton, jimhttp, and tcl-json implementations can be downloaded from this page as a compressed archive: ton.tar.gz.

We will continue to develop ton and have set up a repository for it.

Errata

In the first version of this article I made a lengthy statement about the possible causes of the bad performance of the jimhttp JSON interpreter: costly string comparision of the tokenized JSON input and use of apply during the decoding stage. D. Bohdan kindly pointed out, that this are severe misconceptions with modern Tcl interpreters:

Constant strings in Tcl are interned, comparision for equality is done by pointer comparision and does not depend on string length.
In contrast to eval, apply is byte compiled and does not affect speed.

Another interesting detail is, that jimhttp executed with the Jim interpreter is faster then our ton JSON parser. It would be interesting to explore this differences between the two Tcl interpreters further.

After introducing the test suite from json.org a list of bugs, mostly concerned with corner cases, could be identified in ton. After fixing them, the speed difference to the left to right parsers decreases, but ton is still the fastest pure Tcl JSON parser in our benchmark. Please do not use the ton.tcl file linked to this article, go to my software repository to get the up to date ton.