Welcome to the MagmaSoft Tech Blog.

We publish short technical stories as we explore, expand and fix our technical environment.

You may subscribe to the blog by using one of the buttons below or see the full list of posts.

The most recent articles can be read further down this page.

Reverse JSON parsing with TCL

Last week, while assessing different Tcl only implementations for JSON parsing in the Tcler’s Wiki I was struck by a crazy idea, namely that parsing JSON strings from end to start would be faster then the other way ‘round, because there are less options on which to decide.

Well, I could have looked closer to verify the soundness of the idea, but I got so obsessed with it, that I just started coding, and guess what? I got the fasted Tcl only implementation for parsing JSON strings out there!

That was astonishing, because I did not even parse directly into a Tcl data structure, but into an intermediate format, which has to be converted into the wanted form by a second step.

The bad news?: The worst thing to happen for an engineer happened: it works and he does not know why!

After chilling down a little bit from the unexpected good outcome I rewrote the JSON parser again to parse from left to right now. This should be faster then the right to left parser, because it does not have to go through some hoops required by the unusual approach of the latter. But: the forward parser, is slower than the reverse parser…

In this article, I will show you what this reverse parsing is like, which benchmarks I used and what numbers they yielded, I will give clues about the possible causes for the performance differences of the available Tcl JSON parsers and make guesses about further potential improvements.

Reverse JSON parsing

When parsing JSON from left to right, you do more or less the following:

  • When you see a { assemble the contents as an object, whose elements are presented as lists of comma separated “key”: value elements, terminated by a closing }.

  • When you see a [ assemble the contained comma separated value list until the closing ] into an ordered array

  • When you see a ", then what follows is a string value until the closing ". A quote character inside a string is escaped by a backslash: \".

  • true, false or null, are just their value.

  • When you see a number … it’s a number. It can be either an integer or a double float. Floats look like: -1.602e-19.

Now: let’s do the same thing from right to left:

  • When you see a } assemble an object until the opening {. Remember the “key”: value notation; you will parse any value up to the next : and then parse the key string. After that, a , tells you there is more, a { tells you you are done. Now reverse the list.

  • With a ] assemble the array by gathering the individual values. A , tells you there is another value, a [ that you are done. Now reverse the list.

  • With a " assemble the string until the opening ".

  • When you see an l, it’s null, when you see an e, check the previous letter, if it is an u, it’s true, with s it’s false, everything else is an error. Go back three or four characters respectively.

  • When you see 0 through 9 or a lone . it’s a number: scan back to the next ,, : or [ and parse the number from there.

TON - The Intermediate Tcl Object Notation Format

We use the active file design pattern, which is nicely explained by Koen Van Damme.

An object is represented by a proc - Tcl parlance for a function - which evaluates its arguments into a Tcl dictionary - Tcl parlance for a key/value list. The array proc turns its arguments into a list, the string function into a string, and so on. Here is the list of functions:

o
object.
a
array
s
string
l
literal, for representing true, false or null.
i and d
integer and number functions. We take the extra tour to differentiate between these two, so the user of the format can take advantage of the additional information if she likes.

We also have used n - number - instead of i and d, more about that later.

Let’s see an example.

{
    "Me": {
        "Name": "Georg Lehner",
        "Height": 178,
        "Pets": ["Kaiser", "Biba", "Fortunatus"],
        "Male": true
    }
}

Converting this JSON string to TON, looks like this (line breaks by me):

o Me [o Name [s {Georg Lehner}] \
        Height [i 178] \
        Pets [a [s Kaiser] [s Biba] [s Fortunatus]] \
        Male [l true]]

In Tcl everything is a string, so we don’t need quotes in most of the cases. Everything inside [] is evaluated as a script. If we run this object representation as a Tcl script in the right environment, the first object proc waits for all its arguments to be evaluated and then returns them as a data structure or as a string in a different object presentation format. Let’s take a try on a TON to JSON converter in order of appearance of the above example:

  • The object proc takes all key value pairs in its arguments, wraps key inside ", followed by a : and the value and glues the results together as a comma separated list, wrapped into a pair of curly braces. The first o has only one key/value argument, of which the value is on o call again.

  • The s proc takes one argument, encloses it into double quotes and returns it.

  • The i and d proc take their arguments and return the respective integer or double number string representation.

  • The array proc takes its arguments and turns them into a comma separated list, enclosed in square brackets.

  • The literal proc takes its argument and returns it as a string.

Easy? … Easy!

The complete code for a validating JSON to TON parser, a TON to JSON decoder, two types of TON to Tcl dictionary decoders and a TON to Tcl list decoder fit into 172 lines of compact Tcl code.

Other Tcl JSON parsers

The Tcler’s Wikis’ JSON page lists a wealth of parsers for the JSON format. We have to distinguish between parsers written in the C language, with a wrapper to make them accessible from within the Tcl interpreter and parsers written in pure Tcl.

The C parsers are fast, but in order for them to work, you need to compile the C parser and the Tcl glue code for your target platform, something which might not be readily available in a given situation. We include the SQLite3 JSON1 extension in our benchmark, so that you can see, that it is magnitutes faster then any Tcl JSON parser.

Here comes the list of all Tcl only parsers in order of appearance:

The jimhttp web framework contains a JSON parser.

tcl-json is a JSON parser created from a grammer by yeti a parser framework for Tcl á la yacc. On our small test data it truncates arrays at the first element, it fails to parse bigger test data and does not terminate on the biggest json file we feed to him, so tcl-json does not qualify for competition.

The Tcl “standard library” - Tcllib - provides the so far fasted Tcl only parser. Tcllib is capable of using a C-parser if one is available. We have to force it to use Tcl only mode for the benchmark.

Alternative JSON implements a jsonDecode proc which also uses an intermediate format, a Tcl dictionary where all data is tagged with its type.

There is also a json2dict one liner on the JSON page. It just replaces []s with {}s and eliminates commas and colons. The resulting string is legal Tcl dict/list syntax. This is a superfast conversion of JSON to Tcl and scales good, however it destroys data containing any of the JSON delimiter characters. Most enoying might be, that URLs are corrupted, but also simple English text is mutilated. For the fun of it, we hack the benchmark program and trick it into believing that the mangled output is correct, allowing json2dict to compete.

Interpreting the parsed results

The output of the parsers comes in different forms. One of the typical representations is used by Tcllib: JSON objects are converted into Tcl dict’s, JSON arrays into Tcl lists. This requires some effort to extract data from the resulting representation, since nested calls to dict get and lindex are required to get to a specific item. An example from the benchmark is:

dict get [lindex $parsed 43] gender

Where parsed is a variable holding the parsed Tcl representation of the JSON input. What we achieve here is extracting the 43rd element of a JSON array, which is an object, from which we extract the ‘gender’ field.

The second typical representation, coined ‘dictionary format’, converts JSON arrays into Tcl dict’s, using the decimal integer string representation of the array index of a given element used as the dictionary key. Extraction of an element is a straightforward dict get with all the keys and array indexes listed in order. The example from above:

dict get 43 gender

Some of the foreign language parsers use a special path syntax:

SQLite-JSON1:
$[43].gender
jq:
[43].gender

Finally, the jsonDecode parser allows to precisely check out each path by type, a good thing for a validating extractor function. Written by hand it gets however cumbersome:

dict get [lindex [dict get $parsed array] 43] \
    object gender string

We first extract the (only) array from parsed, slice out the 43rd element from it, who’s (only) object has a gender element which in turn is extracted as a string.

The TON to JSON variants

With TON, the JSON parsing process has two steps. First JSON is converted to TON, than TON executes into the desired data format.

It is trivial to write the conversion to the Tcllib and the dictionary format, we call them 2dict and a2dict respectively. a2dict as in array 2 dictionary.

A third implementation, 2list is remotely similar to Alternative JSONs tagged format, but more compact. Arrays are lists, prefixed with an array tag, Objects are key/value lists, prefixed with an object tag. We provide a get function which iterates through a dictionary format like list of keys. Array indexes can be specified in any Tcl supported number base.

Remember that TON must be executed in an environment, where the Tcl proc’s o, a, s et.al. are defined. In order to isolate different TON to X decoders, we put them in different namespaces. To get a TON string in the variable parsed converted to the ‘dictionary format’ we do the following:

namespace eval ton::a2dict $parsed

If we want the ‘Tcllib format’ we do:

namespace eval ton::2dict $parsed

And for getting the gender of the 43rd element of the test data as shown above with the 2list decoder we do:

 set l [namespace eval ton::2list $parsed]
    ton::2list::get $l 43 gender

Let’s see, how the different TON decoders behave differently in terms of performance and depending on the input data:

The data generation step for 2dict and 2list is almost identical, but a2dict will need more time because it has to generate and insert the array indexes.

On the other hand, data extraction for a2dict only needs Tcls’ built-in dict mecanism, while 2dict requires costly nested calls, just like the other Tcllib like representations. Finally 2list needs to iterate through the keys with the getter function.

We will revisit this differences later when discussing the benchmark results.

The Benchmark

D. Bohdan has lead the way for JSON parser benchmarking with the JSON value extraction benchmark page of the Tcler’s Wiki.

He uses all available card data from ‘Magic: The Gathering’ converted into JSON format and available at https://mtgjson.com/.

This file grows with time, currently offering a 29 MB worth of JSON data, with almost a million lines and 2.8 million words.

This is by no way the intended use case I had, when starting to select a JSON parser. I want to handle a typical web API, where at most some ten to hundred relatively shallow items come back.

So I modified and streamlined Bohdans benchmark program in a way, which lets me specify any number of different tests, which are comprised of a JSON test data sets, a path to a specific datum for the respective test and the expected value.

The benchmark program is configurable, so that one can select, which JSON parser implementations are to be run in a test batch, and which of the specified tests are to be run.

In order to get an idea how the parsers do with growing data sizes and complexities I collected the following JSON test sets, listed in order of size:

  • The object example from the JSON specification in RFC 8259.

  • A file with 100 data sets from Mockaroo, an array with fake personal data.

  • A second Mockaroo file with 1000 data sets.

  • A file from the JSON generator with 3895 data sets of fake personal data.

  • The already mentioned ‘Magic: The Gathering’ data set.

The following table shows some data about the data sets, extracted with wc

lines words bytes sets depth filename
14 31 364 1 3  test.json
99 1201 13276 100 2  mock_100.json
999 12021 135133 1000 2  mock_1000.json
175276 550972 5381126 3895 3  generated.json
973974 2794332 29425030 221x355 4  AllSets.json
(78455)

The ‘depth’ column corresponds to the depth of the object tree at which the benchmark extracts data and presumable also the deepest depth available in the respective data set.

This selection of data sets is by no way fit for a systematic exploration of the parsers time behaviour. For this we would have to vary at least the following parameters independently one of the other:

  • Amount of data
  • Depth of data sets
  • Amount of different data types in the sets: numbers cost more to parse than literals, …

However you will see, that we get some interesting insights, although essentially only the first parameter is varied.

Benchmark Results

The complete result data can be found in the file benchdata.log. In order to keep the discussion focused I will only present the respective fastest ton implementation result (one of three) and also suppress results from other benchmarks where they do not contribute much to the discussion. (Sidenote: we call the format TON and our Tcl implementations to produce and decode it ton).

The ton results are always very close one to each other. We tested three implementations:

  • A standard left to right parser.
  • A right to left parser which does not validate numbers.
  • A ‘validating’ right to left parser.

The non-validating parser is slightly faster on small data sets, but does not hold this advantage on bigger ones. We consider security more critical than speed in the use case of small data sets. They are expected on the Internet and in web APIs where any vulnerability is easily exploited. Since TON is executed Tcl code, it is vulnerable to code injection. With a non-validating ton parser, any Tcl code can be inserted and executed into a “number” field, if it only ends with a digit. We have demonstrated this with an example data set.

The left to right parser surprisingly was slower than the right to left parsers, we leave the discussion about this discovery for later.

The benchmark was run on a Intel Duo E6850 CPU with a 64 bit Debian GNU/Linux operating system, version 9.4 at 3 GHz clock frequency with 3 GB RAM.

ton is designed to take advantage of performance improvements added in Tcl 8.5 and Tcl 8.6. The benchmark is run with Tcl 8.6, ton will most likely have a much poorer performance with lower Tcl versions.

Software Package Versions

Package versions:
Tcl           --  8.6.6
Tcllib json   --  1.3.3
jimhttp json  --  2.1.0
sqlite3       --  3.16.2
jsonDecode    --  Tcler's wiki
ton           --  0.1

RFC 8259 example object test

SQLite JSON1            7.61 μs/run
json2dict              30.80 μs/run
tona2dict             375.52 μs/run
Tcllib JSON           418.01 μs/run
jsonDecode            864.98 μs/run
jimhttp JSON         2074.32 μs/run

The C-parser provided by SQLite3 is about fifty times faster than the fastest Tcl implementation, with larget data sets this ratio goes up to 200 and peaks at 400 in our test run. With this small data set it is four times faster than the infamous json2dict, with larger data sets the ratio levels around 10.

json2dict, not a parser, just a string mangling Tcl one liner, is twelve times faster here than the fastest Tcl only parser. It peaks at 30 times with the large array data sets but falls back to 20 times with the complexer and bigger data sets.

The jimhttp JSON parsers is out of league, taking 5 times longer than our tona2dict parser. Interestingly enough, it seems to be very good with the Mockaroo and the JSON Generator tests. With the two Mockaroo tests, jimhttp is faster then its closest competitor, jsonDecode, however it is still two to four times slower then ton.

jsonDecode is consistently about two times slower than ton or Tcllib. This could indicate a similar design, but some throttling features or missed code bottlenecks.

Mockaroo test with 100 data sets

json2dict               0.58 ms/run
ton2dict               16.37 ms/run
Tcllib JSON            23.37 ms/run
jimhttp JSON           38.40 ms/run

Mockaroo test with 1000 data sets

json2dict               5.82 ms/run
ton2dict              167.31 ms/run
Tcllib JSON           241.14 ms/run
jsonDecode            463.61 ms/run
jimhttp JSON          389.61 ms/run

JSON Generator test

ton2list                4.14 s/run
Tcllib JSON             4.99 s/run
jimhttp JSON           17.47 s/run

MTG data set test

SQLite JSON1            0.13 s/run
json2dict               1.20 s/run
ton2dict               24.48 s/run
Tcllib JSON            27.69 s/run

The Tcllib JSON parser is 10-20 % slower than ton. This might come from the penalties of a robuster implementation of the former. Nontheless Tcllib makes heavy use of Tcls’ excelent regular expression support by splitting the JSON string into a list of tokens. The regular expression processing might be a big upfront cost both for small JSON strings, where the cost/benefit ratio is against it, and for big JSON strings, where the regular expresion takes proportional more time, as well as the allocation of the respectively big token list. Also the final stage: processing the token list, consumes increasing time with increasing input size.

ton scans the JSON string character by character, trying to make the smallest number of choices possible. There is still a proportional cost, but it is much smaller per unit.

The different ton decoders are very close one to each other. Most of the times, the 2dict decoder is slightly faster then his cousins. With the big array of the JSON generator test, 2list is fastest, maybe because of the benefit of the one to one mapping of arrays to Tcl lists.

The a2dict decoder only is faster with the smallest data set: the RFC 8259 example object. This is propably caused by the simpler dict get .. keys .. item extraction process and could be an indicator, to use this decoder when processing great amounts of small JSON data sets.

In short:

  • With small data sets: use a2dict.
  • With huge arrays: use a2list.
  • With any other, or mixed workloads: use 2dict.

In order to prove these hypotesis’ specific benchmarks would have to be designed and run, though.

Comparision of the different parser implementations

The Tcllib JSON Parser maps the whole JSON input string with a carefully crafted regular expression directly to a list of tokens, which is then parsed by a conventional recursive parser. This is a fast and powerful implementation.

The Alternative JSON jsonDecode parser is a single proc which fits into 87 lines. It scans the JSON string from left to right. It’s relative slow speed might stem from the fact, that it makes heavy use of regular expresions for advancing through the JSON text.

The very slow jimhttp JSON decoder scans the JSON string into a token list, just like the Tcllib parser, and then decodes the token list. The scan process goes left to right with an index pointer and disects the string directly by the respective character at the point. The value parsing, however, is done with regular expressions. The decoder “executes” the token list, just as in TON. jimhttp outperforms jsonDecode with the string only Mockaroo data sets. The reason might be, that strings have a short code path. jimhttp is faster then ton when run under the Jim interpreter, see also the Errata section at the end.

The ton JSON parser does not use regular expressions at all. It scans the JSON input string by one of the following means:

  • Direct character to character comparison.
  • Pattern matching, if more than one characters are looked for.
  • Tcls’ built in character class detection (string is space).

When detecting a token, ton scans for the next syntactic separator and afterwards trims whitespace. This strategies keep the cost for string comparision low and there are no regexp setup penalties for small JSON strings.

ton Speed Tuning Prospects

The TON parser could generate the desired form of Tcl data representation of the JSON input directly. This would take away the burden of the required decode step and probably speed up TON. A small experiment can give us a clue how much we could gain from this:

I have timed TON 10000 times converting the RFC 8259 example object into the TON intermediate format; it takes about 317 µs per run. The a2dict decoder for the resulting TON string needs only 11.8 µs per iteration, also timed 10000 times.

This is only a 3% speed gain! Why is it so small?

When repeating the experiment with only one iteration we get a clue: it now takes 1385 µs versus 219 µs. The json2ton proc and the a, o, s, etc. procs in the ton::a2dict namespace get byte compile during the first run, which speeds things up in later runs. The gain of leaving out the decode process is still only 15%. The gain of bytecompiling the json2ton parser is only about 5:1, while the gain for the decoder procs is almost 19:1! This indicates, that TON by itself is quite a fast format to store data in.

All parsers reviewed in this article are recursive, yet none of them tries to take advantage of tail recursion. This could bring significant speed gains for large data sets. Typical JSON texts are composed of a lot of small data items which lets expect a great deal of time spent with call overhead. The Mockaroo data sets have six key/value words per line, which sums to 12 string or number parsing calls per record, the medium line length is 134 characters, so we have a function call each ten characters, one for the key and one for the value.

A similar option could be to rewrite the parser to be not recursive. I have done this, but speed is just the half of the reverse JSON parser. Did I do something very wrong in this code, or is reverse parsing simply better?

Why is reverse JSON parsing fast?

First, the question is not stated correctly. It has not been proven that parsing JSON from back to front is faster than the other way round.

The finding can still be an artefact of the way the Tcl interpreter works, or a happy coincidence of optimized codepaths in my implementation.

The left to right parser I wrote, though he does not need to reverse the parsed lists of tokens, is slower, although it uses the same programming techniques as the reverse parser.

At this moment I have only two guesses.

Reverse scanning reduces the amount of separation characters of the input stream - if only by two: - does not “start” a number and two of the three literals end both in e. If character comparing is a costly operation, this could give us the respective advantage.

Second, maybe the very structure of the JSON data leads to shorter code paths when processed from end to start. Maybe there is some gain in collecting first the “leaf” data , I cannot see how, though.

Oh wait! Looking closer to the JSON data sets, it can be noticed, that the majority of the data are of type string, namely in all but the RFC 8259 data set. Parsing strings is fast with ton, because we only scan for a " character, while other implementation use regular expresions. This might also explain, why ton is 40 percent faster on the Mockaroo datasets then Tcllib but only 11-20 percent faster on the other ones. The Mockaroo tests are string only, apart from the array/object grouping of the data records.

This does still not explain satisfactorily, why the ton left to right implementation is slower then the reversed one.

Open ends

Our ton parser does not care about escaped characters other then \". We just validate the numbers according to Tcl, they could be in illegal hex or octal notation. Trailing text after the initial JSON data is happily parsed and the initial JSON data is discarded.

It would be interesting to categorize the different implementations according to their functionality with these and other considerations in mind.

ton is not just the parser, it is a manipulation libary for an alternative data notation format TON. As such it can be used to serialize and consume data in applications. Especially in Tcl it will be very fast to consume.

A typical use case would be a configuration file or a file for storing user preferences. This type of data is traditionally held in Tcl arrays (hashes). Serialization of an array is however already built in (array get) and the resulting format easier to write by hand than the ton intermediate format. Tcl does not know the difference between “array” and “object”, so the mapping of typical Tcl datastructures to JSON or ton is not really natural. ton is also susceptible to line breaks, since the Tcl interpreter, which executes it, interpretes a newline inside [] as the start of a new command.

Decoding the ton format should be done inside a safe interpreter, to limit potential damage by malformed or maliciously crafted ton strings.

Show me the code

All test data, the benchmark program, the ton, jimhttp, and tcl-json implementations can be downloaded from this page as a compressed archive: ton.tar.gz.

We will continue to develop ton and have set up a repository for it.

Errata

In the first version of this article I made a lengthy statement about the possible causes of the bad performance of the jimhttp JSON interpreter: costly string comparision of the tokenized JSON input and use of apply during the decoding stage. D. Bohdan kindly pointed out, that this are severe misconceptions with modern Tcl interpreters:

  • Constant strings in Tcl are interned, comparision for equality is done by pointer comparision and does not depend on string length.

  • In contrast to eval, apply is byte compiled and does not affect speed.

Another interesting detail is, that jimhttp executed with the Jim interpreter is faster then our ton JSON parser. It would be interesting to explore this differences between the two Tcl interpreters further.

After introducing the test suite from json.org a list of bugs, mostly concerned with corner cases, could be identified in ton. After fixing them, the speed difference to the left to right parsers decreases, but ton is still the fastest pure Tcl JSON parser in our benchmark. Please do not use the ton.tcl file linked to this article, go to my software repository to get the up to date ton.

Posted
SSL and dietlibc March 2018

Building small SSL libraries is a challenge. This series of articles follow up on how to build SSL libraries with diet libc at different points in time.

We use this to compile gatling with https support. The respective recipe is also included in this post.

Preconditions

Building mbedtls with dietlibc

cd ~/progs
curl -O https://tls.mbed.org/download/start/mbedtls-2.7.0-apache.tgz
tar xzf ~/mbedtls-2.7.0-apache.tgz
cd mbedtls-2.7.0
make CC="diet -Os gcc -pipe -nostdinc" DESTDIR=/opt/diet install

Building openssl with dietlibc

cd ~/progs
curl -O https://www.openssl.org/source/openssl-1.1.0g.tar.gz
cd openssl-1.1.0g
./config no-dso no-shared no-engine -L/opt/diet/lib-i386 --prefix=/opt/diet -lpthread
make CC="diet -Os gcc -pipe -nostdinc" install_sw 

Building gatling with https support

gatling with mbedtls 2.7.0

You might want to make clean before.

make LDFLAGS="-L../libowfat -L/opt/diet/lib" LDLIBS="-lmbedx509 -lmbedcrypto -lpthread -lowfat -lz `cat libsocket libiconv libcrypt`" ptlsgatling
make install

gatling with openssl-1.1.0g

Apply the following patch to gatlings ssl.c file.

Index: ssl.c
===================================================================
RCS file: /cvs/gatling/ssl.c,v
retrieving revision 1.26
diff -u -r1.26 ssl.c
--- ssl.c  1 Feb 2018 02:06:18 -0000   1.26
+++ ssl.c  28 Mar 2018 16:28:18 -0000
@@ -8,6 +8,7 @@
 #include <fcntl.h>
 #include <openssl/ssl.h>
 #include <openssl/engine.h>
+#include <openssl/err.h>
 #include <ctype.h>
 #include <sys/time.h>
 #include <sys/stat.h>
@@ -325,7 +326,7 @@
   if (!library_inited) {
     library_inited=1;
     SSL_library_init();
-    ENGINE_load_builtin_engines();
+//    ENGINE_load_builtin_engines();
     SSL_load_error_strings();
   }
   if (!(ctx=SSL_CTX_new(SSLv23_client_method()))) return -1;
@@ -354,7 +355,7 @@

 void free_tls_memory() {
   SSL_CTX_free(ctx);
-  ENGINE_cleanup();
+//  ENGINE_cleanup();
   CRYPTO_cleanup_all_ex_data();
   ASN1_STRING_TABLE_cleanup();
   ERR_free_strings();

You might want to make clean before.

make CFLAGS=-I/opt/diet/include gatling
make CFLAGS=-I/opt/diet/include LDFLAGS="-L/opt/diet/lib-x86_64 -L/opt/diet/lib" LDLIBS="-lpthread -lowfat" tlsgatling
make install

Results

$ ls -1hs /opt/diet/bin/*gatling
204K /opt/diet/bin/gatling
704K /opt/diet/bin/ptlsgatling
2.4M /opt/diet/bin/tlsgatling
Georg Lehner
The Only Way For SSH Forced Commands

Secure Shell is a wonderful tool for automated or interactive pass wordless remote access, but it is not easy to get security right with this setup.

Out of the box you have two options: either you allow complete shell access to the remote system, or you restrict access to just one (1) specific command line.

Several proposals can be found on the Internet which show how to solve various different use cases of remote access where something in between these two extremes is required.

One popular use case is automated backups, mirror scripts - often involving rsync - or access to revision control systems and the like. Joey Hess’ article gives an overview and points out potentially insecure solutions.

Another use case is to allow (real) remote users to run one out of several possible commands, check out this bin blog post for a simple example. An interesting way to generalize this by use of a dedicated tools directory is described in this StackExchange article, all along with a security related discussion.

Let’s cover both classes of use cases with just one script which can serve any user account on the server.

In comes ‘only’.

The following sections will talk you through the perception and raise of ‘only’. If you don’t want to listen to the whole biography of ‘only’ just skip down to The Grown Up ‘only’.

The one and ‘only’

‘only’ is a shell script, which only allows to run some command, let’s see the embryonic ‘only’ script:

#!/bin/sh
allowed=$1
set -- $SSH_ORIGINAL_COMMAND
if [ "$1" = "$allowed" ]; then
exec "$@"
fi
echo you may only $allowed, denied: $@ >&2
exit 1`

We copy ‘only’ into the PATH of a remote host and put it in front of a ssh key in the authorized_keys file like this:

command="only ls" ssh-rsa AAAA...

Let’s see what this does when we access the remote host with, e.g.

ssh user@remote.host ls /bin
  1. ‘only’ is run with the following command line: ls
  2. We save $1, the only allowed command in the environment variable allowed
  3. We set the command line to the originally sent command line making use of ssh’s feature of setting it up in the environment variable SSH_ORIGINAL_COMMAND, it will be: ls /bin in this case.
  4. Now $1 holds the first token of the original command line: ls.
  5. $allowed = $1 = ls, they are the same, so we
  6. execute the entire command line "$@", ls /bin in our case, which prints all files inside /bin and exit.

You might want to review the sh(1) man page if these dollars, brackets and “$@” warts look strange to you. The “ssh” and “remote” talk can be enlightened by the sshd(8) manual page.

If we use any other command, like rm -rf / we get:

  1. only is run with: ls
  2. $allowed = ls, $1 = rm, they are different so we
  3. print a diagnostic message on stderr(3) and exit with a failure code of 1 so that the sending end gets notified.

Bingo: we allow to run the one and only explicitly allowed command, but we now can give it any command line we want. This helps us with the second use case, however is rather insecure. Consider running rm -rf / on a command="only rm" ... setup in the root account…

But let’s see into this later (security comes, ahem … late, right?)

Not just ‘only’ one

With some care, ‘only’ grows a little bit and is now able to handle more then one command:

#!/bin/sh
cmds="$@"
set -- $SSH_ORIGINAL_COMMAND
for allowed in $cmds; do
    if [ "$allowed" = "$1" ]; then
        exec "$@"
    fi
done    
echo you may only $cmds, denied: $@ >&2
exit 1

I allow here myself to mimic the example of the bin blog article. The line in the authorized_keys file now might look like this:

command="only ps vmstat cups"

Please still don’t try this at home, rather follow me, when I mentally run

ssh user@remote.host cups stop
  1. ‘only’ receives the following command line ps vmstat cups which we save in cmds. After the set --, the command line becomes cups stop.
  2. The for loop picks ps as the first value for allowed from cmds.
  3. $allowed = ps, $1 is cups, we fall through the if and enter the next iteration of for.
  4. $allowed becomes vmstat in the second, and then cups in the last iteration of for.
  5. cups is equal to $1, so we execute cups stop.

If we had sent the rm -rf / command, we would have fallen through the loop and printed denied to the user.

Picky ‘only’

Youngsters are complicated. So is ‘only’. He grows up a little bit and start to refuse some of the command line arguments thrown at him, he just got picky.

If you are not into adolescent psychology I offer you once again to skip to The Grown Up ‘only’ section. Go download the adult only script and abuse his workers rights by making him work for you all day long without paying him a dime. But maybe you are curious about his teenager years. If so, then stay with me. I’ll put the young only on the couch and analyze him as quick as possible.

First a small preamble. Unless we want to write a whole string matching utility library in shell, using variable expansion with their ugly ${17##*/} and whatelse evil constructs, we can just get a little help of a good old friend. I refer to the humble and ubiquitous sed(1) command.

Note that this means introducing (shock) regular expressions (huhh!). Yes, I also would prefer some simple glob(3) like pattern matching as done by sudo(1) in these cases, but this would involve some exotic character like Tcl(n) or friends, and those guys and girls are not always handily available. So let’s make these regex(7)es as painless as possible.

We just want to know, if some line of text matches, or not, a specific pattern. sed(1) is quite reserved if we tell him to via the -n flag, he will just swallow all input he can get, just like PAC-MAN, and only talk back if we tell him so with a print command. To match, for example, the line ps we’d use the following rule: \:^ps$:p. The \: .. : construct delimits the matching pattern, ^ is the start, $ the end of the line and in between the string we are looking for: ps. If sed(1) -n receives any line of text he will remain quiet, unless the line is exactly ps, in which case he mutters back ps, via the trailing p command. So matching results in an echo of the input, not matching in remaining silent.

We’ll put the rules to filter the allowed command line in a file in the remote users home directory and call it ~/.onlyrules. So you can set up ‘only’ for any number of users and adapt allowed commands and rules individually, no need to change ‘only’ at all. Nice, isn’t it?

To continue with our example we stuff the following lines into ~/.onlyrules:

\:^ps$:p
\:^vmstat$:p
\:^cups \(stop\|start\)$:p

The last line is somewhat sophisticated: it allows ether ‘stop’ or | ‘start’ as arguments. The alternatives are bound together by the () parenthesis, our ornamentalophile sed(1) wants them all to be prefixed with a \ backslash.

And here is our ‘only’ youngster.

#!/bin/sh
cmds="$@"
set -- $SSH_ORIGINAL_COMMAND
for allowed in $cmds; do
    if [ "$allowed" = "$1" ]; then
    if [ -z "$(echo $@ | sed -nf ~/.onlyrules)" ]; then
        break
    fi
        exec "$@"
    fi
done    
echo you may only $cmds, denied: $@ >&2
exit 1

You already see the pimples, the attitude?!

When ‘only’ sees an allowed command, let’s say cups, it shouts the whole command line over to sed(1). sed(1) takes line by line of ~/.onlyrules, compares it with the input and shouts it back on stdout(3) if it matches. Consider the last run of the for loop (with cups as $allowed command). Suppose we sent cups stop to the remote host.

  1. cups is allowed, so we hand the command line over to sed(1).
  2. cups stop matches the last line in ~/.onlyrules, so the output of the $() command substitution is cups stop, which is not a zero length string (-z). Thus we skip the break and
  3. we exec the command line. Done!

Now we test the other way round. Let’s run cups status:

  1. cups is allowed.
  2. cups status does not match any line in ~/.onlyrules, sed(1) does not shout anything back to us and the command substitution is the zero length string "".
  3. break breaks out of the for loop and
  4. we deny the command.

So, ‘only’ can do now everything that was promised in the introduction, everything?:

  • We can lock down a remote account to one command, but with (controlled) arbitrary arguments.
  • We can enable a set of allowed commands for a remote account.
  • We can adapt the behavior for any number of remote accounts, without changing the ‘only’ script.

‘only’, however, is still a very fragile teenager, don’t entrust him your servers, yet, better wait for him to mature.

Why not this cute simple ‘only’?!

Verbosity

Although it is perfectly understandable that any unacquainted user wants to know why o dear was my command not accepted by the remote host, we wouldn’t want to give too many hints to the unacquainted attacker who is just trying to get into this remote server by means of a stolen ssh key and eager for any useful information. Especially in non interactive scenarios we’d just leave him wondering why.

The Grown Up ‘only’ allows you to tune him from complete silence up to idiotic verbosity towards the invoking user.

Accountability

Until now nobody notices when, what and for whom ‘only’ is working. You’ll just get a short log message from ssh itself, informing that somebody connected via ssh(1) to a specific account on your remote host.

The Grown Up ‘only’ uses logger(1) to tell us what command line has been run by which user at the auth.info level, and what command line has been denied for which user at the auth.warn level, so we can sort things out while struggling to forge these rules and after that, in production use, find abusers.

Absolute command paths

If a user, human or automated, e.g. sends /usr/bin/vmstat instead of vmstat to the remote host, we still would like the command to be executed even if we only allowed vmstat. Our stubborn teenager ‘only’ would reject this command because of his simplistic equality match.

The Grown Up ‘only’ has gained some tolerance with his peers already. He patiently looks up it’s allowed command in the users PATH and compares the result with the given command in case the latter comes in with an absolute path. Thus only commands inside the users PATH are allowed. If you want to lock down commands to a specific directory put it as the only directory into the PATH environment variable for this user (sshd(8) can help you with this). Then set LOGGER, WHICH and SED at the top of the ‘only’ script to the respective programs full path specification on your remote host.

You can also allow commands outside of PATH, by specifying them with an absolute path in the authorized_keys file. In this case, The Grown Up ‘only’ requires an exact match with the sent command, but does not enforce it to be in the PATH.

Smarter command line matching

Since commands can come in either with absolute or relative paths, the rules for the command line filter would have to take this into account and would become complicated, difficult to read and therefore error prone.

To make it easier to write command line filters, the lines sent to sed(1) are instead composed of the $allowed command in question followed by the command line arguments (stripping off the actually send notation of the command). Recurring to the previous example, the lines sent to sed(1) in the second iteration of the for loop would be: vmstat and not /usr/bin/vmstat and thus will match with The Grown Up ‘only’ but not with Picky ‘only’.

Quoting hell

When starting to grow ‘only’ I used the youngster to restrict access to a darcs repository, only to find out, that darcs sends the repository directory path single quoted '' when creating the repository but without quotes when getting it. With exec "$@" the repository directory repo gets created as 'repo' on the remote host, and thus becomes inaccessible to the other darcs commands, which naturally expect it to be repo.

The Grown Up ‘only’ therefore does eval "$@", which parses away the quotes.

Note however, that I fear that other command line constructs now might horribly fail or disaster be injected into your server by evil forces finding out how to take advantage of quoting hell.

The Grown Up ‘only’

Installation and basic configuration

Please download the only script and the example rules file. Both start with an explanation on how to use and configure them, please read these comments in place of a manual. An example ~/.onlyrc is available too.

You can also get ‘only’ from my public darcs repository. You don’t need darcs(1) for this, just wget(1), curl(1) or your browser.

1. Put the ‘only’ script into a location accessible to all users on the remote host, e.g. into /usr/local/bin.

2. Create a ssh key pair. For a starter, the following command line will create the files only_key and only_key.pub without a pass phrase for you.

    ssh-keygen -P "" -f only_key
  1. Copy only_key.pub to authorized_keys, and prefix:

    command="only ls grep who",no-agent-forwarding,no-port-forwarding,no-pty,no-user-rc,no-X11-forwarding
    

    to the first and only line, leaving a space before the ssh-rsa AAAA... part. Of course, instead of ls grep who, you’ll put in the command(s) you want to allow on the remote host.

  2. Install this authorized_keys file in the .ssh sub directory of the user accounts home directory on the remote host which should run the respective commands. You might want to deny the user account write permissions on the file.

  3. Copy .onlyrules into the home directory of the same user account on the remote host and adapt to your needs. See Writing ‘only’ rules for some tips.

  4. You are done with this user account. Repeat, starting from “2. Create a ssh key pair.” as often as needed for this remote host.

  5. You are done with this remote host. Repeat from “1. Put the ‘only’ script” for any further remote host you want to access.

Writing ‘only’ rules

Always consider locking down the command line to only match precisely the wanted alternatives. While options will likely come in a fixed set and variation, the arguments like file paths or user names might vary considerably and unforeseeable. For paths you might consider require a given prefix and disallow dot-dot .. so attackers or mad gone scripts can’t break out of their allowed realm.

You surely have rules how a user name may be constructed on your remote host: minimum/maximum length and a set of allowed characters come to mind. Create the respective sed(1) rules for these, check that they don’t allow white space or comment and escape characters in between.

Always filter on the whole command line, that is, make the filter have a ^ at the start and a $ at the end, else anybody can prefix or annex arbitrary strings and thus circumventing your allowed command list.

All that said, you might not know all possible variants of the invocation of a command in advance and/or are too lazy to figure it out beforehand. Shame on you, but anyway… lock down the command (let’s name it new_kid for just another example), then start with a completely open filter like this:

\:^new_kid:{p;q}

You note the missing $ at the end of the command string, do you?.

Now run all variants of new_kids invocations you can imagine, or gather them after one day or so running and get the results out of syslog(3).

Let’s say that new_kid gives as allowed command lines like the following:

new_kid --server -P ./
new_kid --cleanup -P ./
new_kid --stats -P ./ /var/log/new_kid.log
new_kid --discard /var/log/new_kid.log.10
new_kid --rotate /var/log/new_kid.log.9
...
new_kid --rotate --/var/log/new_kid.log

Then we could consider a filter like:

\:new_kid --\(server\|cleanup\|stats\) -P \(/var/log/new_kid.log\)\?\./$:{p;q}
\:new_kid --\(rotate\|discard\) /var/log/new_kid\.log\(\.[1-9]0?\)\?$:{p;q}

or just stop that regex(7) head pain and pack all found lines literally between \:^ and $:{p;q} and you are done.

If you did not get all possible invocations in the first run, you will get the others as denied in your logs. Watch out for lost ones after a month or so and then after a year. (Just kidding, you know your monthly and yearly scripts well, don’t you?).

Finally note that shell magic is helping you when writing filters, since white space between arguments is reduced to exactly one space.

Substitution rules

Attentive readers have already noticed that our Picky ‘only’ is not an equivalent replacement for the example in the bin blog post. If, for example, we send cups start to the remote server, the command /etc/init.d/cupsys start should be executed instead.

Well, while I find this startling, from a security point of view, I needed to support this capability to hold my word on the claim in the introduction.

Create a file ~/.onlyrc in the home directory of the user running ‘only’ and write enable_command_line_substitution on a line by itself. From this moment on, ‘only’ will substitute the original command it got sent to with the string printed out by sed(1). In the rules file substitute \:^cups \(stop\|start\)$:p with the following monster:

\:^cups \(stop\|start\)$:{
    s:^cups \(.*\):/etc/init.d/cupsys \1:p
    q
}   

sed(1)s substitute command will replace cupswith /etc/init.d/cupsys and the \1 place holder with whatever command line option (of ‘stop’ or ‘start’) it encounters within the \(.*\) parentheses.

This is a contrived example. The bin blog example replaces ps with ps -ef. We don’t need substitute here, instead we:

\:^ps$:{
    c\
ps -ef
    q
}

Well, this looks ugly. But hey! The c\ command puts out the subsequent line, which is: ps -ef and omits the input completely. We must put the q command on its proper line so it gets not appended too. c\ allows us to write long complicated command lines easily.

Look! You can go wild on sed(1) and invent your super-hyper-uber substitutions sed(1) programs if you want to, people to even math with it! But once again: don’t do command line substitutions for your own mental health and your servers integrity’s sake.

Verbose feedback

By default ‘only’ just exits with a failure code if a command is not allowed to run.

The ~/.onlyrc file can be used to make ‘only’ chatty about denied commands. You can:

  • Show an enigmatic denied to the user with show_terse_denied.
  • Show the allowed command to the user with show_allowed.
  • Show the exact denied command line with show_denied.
  • Print out a complete manual by appending text in the ~/.onlyrc file after help_text.

The provided example ~/.onlyrc file illustrates and documents all of these options. If you want to mimic the bin blog example your ~/.onlyrc would look like:

show_allowed
help_text
Sorry. Only these commands are available to you.

Security considerations

As I told you before, security comes late, right? Now, a serious review on security related topics is way out of the scope of this article. I just want to throw in two thoughts, or three, or four…

Any command which just reads from the remote system (files, process listings, kernel or interface stats, etc.) can be abused to gain insight into the system (for later hacking it) or to obtain information which might be private (user data, like credit card numbers or passwords, or habits like login statistics, emails, …). One of the objectives of running commands via ssh with public/private key authentication is to restrict them to user accounts which don’t have excessive rights for obtaining information. ‘only’ can help you lock down this further, but do your homework on securing the remote host first.

Commands which can write to the remote system or modify any other of its resources (processes, kernel variables, interface settings, etc.) are even more sensible. Let’s start with the possibility of overwriting the settings for ‘only’ which can be used to gain unrestricted access to the system. But the same principle as before applies: if the user account is already ‘secure’, an attacker cannot go much further.

Additionally consider resource depletion. Although you can craft denial of service with a read only access, with “write” access come additional risks into play. Don’t allow the user to allocate a lot of processes or disk space, as this can be abused for writing oodles of senseless data to fill up your disk just to annoy you, or better for storing images and videos with disputable content, for gratuitous distribution from your server. This is where quotas and limits come in, start with quota(1) and prlimit(1) if you are on Linux and want to go into further detail.

Note that a lock down script like ‘only’ is just one concept for ssh security. You might get yourself a restricted shell and cast that upon the remote user, like indicated in this article for rsync. A funny article by Doug Stilwell does not inspire confidence into the security of restricted rbash though. Similar consideration like in this article might apply to other restricted shells, and of course they do for ‘only’.

Postamble

Why did I perceive ‘only’? I wanted to set up an unprivileged account for managing a private darcs(1) repository for distributing configuration data of my servers. It did not seem right to me to allow complete shell access for this, so I soon stumbled upon the issue of the inflexible forced command in ssh. When I came up with the embryonic ‘only’ approach I started to look around on the Internet and saw that it has not been proposed yet in this form. With a sudo(1) background of pattern matching on allowed command lines I started to go for the picky ‘only’. Writing this article whetted my appetite and while I don’t think (ab)using ‘only’ for interactive session lock down ever, I implemented all the related surplus.

Playing around with the different aged ‘only’s soon brought me to test driven design. So there are (primitive) test suites available. This is another interesting area, but also quite another story.

Credits for ‘only’ go to all inspiring inputs, some of them referred to by the external links.

Please give me feedback if you encounter any bug or issue with ‘only’. I am especially interested in comments with respect to ‘only’s (lack of) security.

Posted
Language Selection in Roundup Templates

What language is this page?

One internal goal of the gia project is to showcase a html5 compliant Roundup tracker template.

This raises the need to determine the language used to render the respective template, since we need to fill in the html tag’s lang attribute:

<tal:block metal:define-macro="">
  <!DOCTYPE html>
    <html lang="??">
...

Alert sign with indications in several languages While not documented, it turns out that in the end it is neither hard nor overly involved to find out this information inside a template. You have to do your homework though.

In this article, I will first share the recipe, then expand on http language negotiation and Roundups approach to it. Finally I will elaborate briefly on further aspects of internationalization of Roundup trackers.

The Recipe

Preamble

Amongst others, we use the following configuration settings in our trackers:

[tracker]
...
language = de_AT

[web]
...
use_browser_language = yes
...

This means, that the trackers web interface tries to switch to the language preference indicated by the web browser of the visiting user. If none of the preferred languages matches, the web interface will be presented in (Austrian) German.

If your tracker is monolingual or you do not allow the language to be switched by the browser preference there is no need to use the following recipe. If you do switch languages via the web user interface though, just use the request/language path in the templates.

TAL Python Extension

Create the file page_language.py with the following contents in the extensions sub directory of your tracker:

# return the selected page language
def page_language(i18n):
    if 'language' in i18n.info():
        return i18n.info()['language']
    else:
        return 'en'

def init(instance):
    instance.registerUtil('page_language', page_language)

Prepare .po Files Correctly

By convention all .po files contain a translation for the empty string on top, which is in the form of an RFC 822 header. Be sure, in all .po files this ‘info’ block looks like the following.

"Project-Id-Version: 1\n"
"POT-Creation-Date: Sat Dec 10 21:53:51 2016\n"
"PO-Revision-Date: 2016-12-20 10:18-0600\n"
"Last-Translator: Georg Lehner <email@suppressed>\n"
"Language-Team: German\n"
"Language: de\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: talgettext.py 1.5.0\n"

The critical part for our objective is the line:

"Language: de\n"

Note: Content-Type and Content-Transfer-Encoding are the other two required lines for a working translation.

Gratuitous English .po File

You must have a en.po file in the locale sub directory of your tracker, although it needs not contain any translation. Create it by copying the messages.pot file to en.po and do with it as described in Prepare .po Files Correctly.

Use the Page Language In Your Template

At any place in your TAL template file where you need to know the selected page language use the python extension as in the following example:

<tal:block metal:define-macro="">
  <!DOCTYPE html>
    <html lang="python:utils.page_language(i18n)">
...

This is just the way I use it for gias html5 compliant template mentioned at the beginning of this article

What the Heck?

Man with machete fighting other man with revolver

HTTP Language Negotiation

By the very nature of the Internet web sites have an international audience. The means provided by the web standards to adapt a web site to the language capabilities of a visiting user is the HTTP Header Accept-Language which is defined in RFC 7231.

The user of the browser (you!) sets up one or more languages in order of preference. The browser then sends the Accept-Language header with the list of languages with every request to the web servers.

Roundup Internationalization

With the configuration line:

use_browser_language = yes

Roundup dissects the list of preferred languages and converts the [language tags][rfc_5646] into a list of gettext locale codes ordered by the users preference, highest priority first. Then this list is expanded by adding main language tags to sub-tagged languages, if the main language does not exists. Finally the list is iterated through, trying to find a translation file on each turn. The first file found is used to translate the strings in the page. If no matching file is found, the configured tracker locale is used.

The translation file is read in and represented by the template translation object i18n, with the well documented methods and template invocations. This process makes use of the Python gettext module.

Notes:

  • While the algorithm used provides for a correct ordering, it probably can generate incorrect or unusable locale codes, since an eventual variant sub tag given by the browser will not be represented in the way locale codes are defined. I guess, however, that in practice this has no relevance, unless you go into the lengths of creating different translations for specific dialects or variants of the same main language on your Roundup tracker.

  • Probably Roundup should record the selected locale in a template variable, e.g. page_language, by itself, obsoleting this blog completely.

The Trick to Get The Page Language

The Python gettext module apparently reads the translation of the null string into the _info variable and provides its content via the info() method of the NullTranslations class in form of a dictionary.

This dictionary is even available as template path i18n/info. A convenience mapping to the values as in i18n/info/language is however not implemented.

The reason why we cannot use the template expression:

python:i18n.info()['language']

directly is, because if no language is negotiated or the Language: line is not present in the .po file, we get a KeyError exception thrown at. Our page_language TAL Python Extension takes care of this error.

English Is Not English

Sketch of the actor Rowan Atkinson. The next peculiarity found is, that even if my browsers language preference is set to ‘English, then German’, I get the German translation. If you re-read the notes about Roundup Internationalization you will see, that if there is no en.po file (or more precisely: no en.mo file) the configured tracker language will be used. This happens even if ‘en’ is on the language list and we have no need to translate any string.

Furthermore, if the English language is explicitly requested via a @language=en element in the http query string (more on this later), the i18n translation object has no file to read in the _info variable and in consequence the dictionary returned by i18n.info() is empty.

Both issues are fixed by providing the Gratuitous English .po File.

Further Notes on Internationalization

A Roundup tracker typically falls into the class of ‘Multilingual, same content’ websites. Just the user interface is translated into the respective users language.

W3Cs masterly article about language negotiation highlight several aspects from the usability view. Most are dealt with already in Roundup.

The first is Language negotiation via user configured preferences: This is what we have discussed until now. What follows are some thoughts on the other aspects of internationalization of web sites.

User-Agent Header Heuristics

Sean Connery as James Bond The above mentioned article suggests one could guess the browser user interface language from the User-Agent http header, in case no Accept-Language header is sent.

A quick Internet search seems to indicate, that a lot of browsers do not include the language in the User-Agent header and that programmers seldom (or never?) use this heuristic at all. Roundup does currently not look at this header.

Stickiness In Navigation

The language can be set directly via the query string element @language. It sets the cookie roundup_language to the selected locale code. When the cookie is found, it is used to set the templates language variable and to select the respective translation.

Two issues arise:

  • If @language is set to a language for which you do not provide a translation file, English is chosen. Arguably one would expect the site to be rendered in the tracker language in this situation.

  • Setting a cookie is considered a privacy leak. The W3C site, e.g. asks you gently via a pop up if you want to set a cookie for persisting your language choice. If you don’t, every next page falls back to language negotiation.

Bad or No User Languages

In several scenarios a user might get a page with a language she does not know, even if in some cases a known language were available:

  • By failure to configure preferred languages at all.
  • When configuring only languages for which there are no translations available.
  • When using a browser on a foreign device (e.g. in an Internet Cafe).

In these cases she will get either the default tracker language or English (as shown in Stickiness In Navigation) or an available language whereby any or all of them might be unknown to her.

colletion of national flags

Language Controls

The common solution is to include ‘language controls’ on some or all pages, where the user can switch to any of the available languages. Most likely you know these web sites with lots of little country flags representing the language to switch to. Ever wondered how people with visual impairment handle this?

The default Roundup tracker templates do not provide such controls and there is one piece of information missing to satisfactorily implement them. At the moment there is no provision to get a list of all available translations inside a tracker template.

You can and should, of course, define a template variable with a hand-crafted list of available languages and use it in a drop-down or a language navigation bar to provide this feature.

User Account Language Preference

Besides Stickiness In Navigation, logged in users might want to override their browser language choice too, if any, or get a different than the tracker language as default.

This can be done by adding a preferred language property to the user class. It is to be researched, how to make use of this information to set this user language upon log in.

Other Bits and Pieces

File Upload Buttons

The button used to select files for upload is provided by the browser and can neither be styled, nor can the text on the button set. This comes natural and by html spec, as the browser also indicates if a file has already be selected or not.

In my case, I get a complete German user interface on roundup, with the only exception of the “Browse…” button, because I usually install my operating system (and so the browser) with English user interface language.

A very complete, though involved solution to style and potentially translate these

 <input ... type="file"> 
elements is provided by Osvaldas Valutis in a very well crafted tutorial: Styling & Customizing File Inputs the Smart Way.

Roundup Internal Strings

I have found at least one instance of a string in the Roundup sources, which is not subjected to gettext processing. It is inside a javascript function, so I guess it is challenging to do it there.

There are other user interface elements where no translation appears, however the python code reveals that they are in fact gettexted. You can work around this, by providing an extra file in the html sub directory, which is used to hold translations of strings not extracted (or extractable) by roundup-gettext.

I have named this file po.html (must be *.html!) and e.g. the entry:

<x i18n:translate="">Submit Changes</x>

has given me a translated submit button on the issue item template, which is produced by the opaque construct:

<span tal:replace="structure context/submit">_submit_button_</span>

Summary

We have shown, how to add multilingual capabilities for html5 templates to Roundup without patching the software.

There remain several shortcomings, which cannot be resolved without changing the source, however.

It would be helpful for the internationalization of the Roundup tracker, to provide the following information to the html templates:

  • page language .. the selected language for rendering a page.
  • translations .. a list of available languages.

The behavior with respect to the fallback language should be improved.

Some translation strings of code generated elements are not automatically included in the messages.pot file generated by roundup-gettext, other such elements are missing gettext processing still.

User Account Language Preference is still to be researched.

Nice to haves:

Microformats for IkiWiki

Background

Microformats is one of several approaches to add meaning to the visible text which is readable by computers. The goal of this is to create a Semantic Web.

The IndieWeb proposes the use of microformats to build a decentralized social network: the IndieWeb.

The blog your are reading right now is based on IkiWiki, adding microformats to IkiWiki is a first step to adding it to the IndieWeb.

Microformats

Microformats 2 is the recent development of the microformats markup. It is advised, to use them but still mark web pages up with classic microformats. For easier authoring I have put together a cheat sheet for translating h-entry to hAtom attributes.

For marking up a blog, a perceived minimum of three microformats classes are needed:

h-entry:
For giving details about a post (or article, respectively).
h-card:
For marking up the author of the article inside the h-entry.
h-feed:
For marking up aggregations of posts.

The following is a stripped down HTML 5 source of the present article. Different class attributes hold the microformat markup. Look out for the following data (classic microformats in parenthesis):

  • h-entry (hentry)
  • p-name (entry-title)
  • e-content (entry-content)
  • rel=”tag” (p-category)
  • p-author (author), with h-card (vcard)
  • dt-published (published)
  • dt-updated (updated)
<article class="page h-entry hentry">
    ...
    <span class="title p-name entry-title">
        Simple Responsive Design for IkiWiki
    </span>
    ...
    <section id="content" role="main" class="e-content entry-content">
        ... article text comes here ...
    </section>
    <nav>
        Tags:
        <a href="../../../tags/webdesign/" rel="tag"  class="p-category">webdesign</a>
    </nav>
    <span class="vcard">
        <a class="p-author author h-card" href="http://jorge.at.anteris.net">Georg Lehner</a>,
    </span>
    <span class="dt-published published">Posted <time datetime="2016-06-18T14:08:19Z" pubdate="pubdate" class="relativedate" title="Sat, 18 Jun 2016 16:08:19 +0200">at teatime on Saturday, June 18th, 2016</time></span>
    <span class="dt-updated updated">Last edited <time datetime="2016-07-23T13:48:28Z" class="relativedate" title="Sat, 23 Jul 2016 15:48:28 +0200">Saturday afternoon, July 23rd, 2016</time></span>
</article>

Now lets look at a feed. In this case just a time ordered list of two posts. Follow the structure as you did above:

  • h-feed (hfeed)
  • p-name (entry-title): title of the feed.
  • list of posts, each:
    • h-entry (entry)
    • u-url (bookmark)
    • p-name (entry-title): title of the post
    • dt-published (published)
    • p-author (author), with h-card (vcard)
<div class="h-feed hfeed">
    <span class="p-name entry-title"><span class="value-title" title="MagmaSoft Tech Blog: all posts list"> </span></span>
        <div class="archivepage h-entry entry">
            <a href="./Microformats_for_IkiWiki/" class="u-url bookmark p-name entry-title">Microformats for IkiWiki</a><br />
            <span class="archivepagedate dt-published published">
                Posted <time datetime="2016-07-27T15:19:32Z" pubdate="pubdate" class="relativedate" title="Wed, 27 Jul 2016 17:19:32 +0200">late Wednesday afternoon, July 27th, 2016</time>
            </span>
        </div>
        <div class="archivepage h-entry entry">
            <a href="./Simple_Responsive_Design_for_IkiWiki/" class="u-url bookmark p-name entry-title">Simple Responsive Design for IkiWiki</a><br />
            <span class="archivepagedate dt-published published">
                Posted <time datetime="2016-06-18T14:07:50Z" pubdate="pubdate" class="relativedate" title="Sat, 18 Jun 2016 16:07:50 +0200">at teatime on Saturday, June 18th, 2016</time>
            </span>
            <span class="vcard">
                by <a class="p-author author h-card url fn" href="http://jorge.at.anteris.net">Georg Lehner</a>
            </span>
        </div>
</div>

If you use Firefox and install the Operator Add-on you can see respective ‘Contacts’ and ‘Tagspaces’ entries.

This should do for learning by example, if you need more, go to the http://microformats.org website.

Note: IMHO the markup looks overly complicated, due to doubling microformats v1 and v2 markup. Microformats v2 simplify things a lot, but Operator has no support for it (yet) and who else out there will have?!

IkiWiki

Single posts

Pages are rendered by IkiWiki via HTML::Template using a fixed template: page.tmpl. So are blog posts, as these are simply standard wiki pages. The templates can contain variables, such as the authors name or the creation date of the page, which are inserted accordingly in the HTML code.

Instead of rewriting the default page.tmpl, I copied it over to mf2-article.tmpl and use the IkiWiki directive pagetemplate on top of all blog posts in the following way: “

Note: One would like to avoid this extra typing. There are approaches which automate template selection, e.g. as discussed at the IkiWiki website here, however they are not yet available in the default IkiWiki code base.

Aggregates

We already explored the different options of aggregating several pages with IkiWiki on this website.

Two templates come into play for formatting the various posts involved:

archivepage.tmpl:
 
For simple lists of posts, like the example for a feed used above.
inlinepage.tmpl:
 
When concatenating several posts, e.g. the five most recent ones, with all their contents.

Note: there is also a microblog.tmpl template, which I have not used until now. Of course it will need a microformats upgrade too.

Accordingly I provide two modified template files which include the necessary microformats markup:

These are used in the respective inline directive in the template argument. Two live examples from the posts and the blog pages:

[[!inline  pages="page(blog/posts/*) and !*/Discussion"
show="5" feeds="no"
template="mf2-inlinepage" pageinfo]]
[[!inline  pages="page(./posts/*) and !*/Discussion"
archive=yes quick=yes trail=yes show=0
template="mf2-archivepage" pageinfo]]

But this does not give us a feed markup!

Ideally the inline directive should create the microformats markup for h-feed by itself. This would need a major change in IkiWiki’s code base and of course has to be discussed with the IkiWiki authors and community. In the meantime I wrote a wrapper template: h-feed, which can be used to enclose an inline directive and wraps the rendered post list into a h-feed tagged <div>.

Note: it is not trivial to mark up the h-feed automatically. Feeds have required and optional elements which might be made visible or not on the page. The question is, how would the inline directive know which information to show whether to put it on top or below of the list of posts and which text to wrap it into - think multiple languages. A possible solution would be a feedtemplate parameter with which you can select a template wrapping the in-lined pages. The you can adapt the template to your taste. A default template provided by IkiWiki could be similar to the h-feed template.

Finally for the lazy readers: here comes the complete live example for h-feed. We’ll show the archivepage example (simple list of posts):

[[!template  id=h-feed.mdwn
name="MagmaSoft Tech Blog: all posts list"
feed="""
[[!inline pages="page(./posts/*) and !*/Discussion"
archive=yes quick=yes trail=yes show=0
template="mf2-archivepage" pageinfo]]
"""]]

Itches

Pageinfo: more data for templates

IkiWiki runs several passes to compile the given wiki source tree into html pages. During these passes a lot of meta data is gathered for each wiki page. However, as explained in a forum post, IkiWiki does not supply much of this information in the template or inline directive.

To harvest the meta data already available I prototyped a new “IkiWiki” feature, which I call pageinfo. It adds a new valueless parameter to the inline and the template directive as can seen in the above examples. If it is present, information in the global hash %pagestate is made available to the template as <TMPL_VAR variable.

Plugins can be written, which add information to %pagestate in early passes of the IkiWiki compiler.

The precise location for a given variable of a certain page in %pagestate is: $pagestate{$page}{pageinfo}{variable}

Note: This approach should be expanded somehow to take the information from the meta plugin into account.

Sample plugin: get_authors

The canonical way to declare authorship of an IkiWiki page is by using the ?meta directive with the author parameter. This is (sic) not available to templates or the inline directive. Additionally, you must set it by yourself manually for each page.

Since I am using a revision control system (rcs), the authorship information is already present in the wiki. So why repeat myself?

In the pageinfo prototype of IkiWiki two new rcs hooks are added, which gather the creator and the last committer of a file. The sample get_authors plugin uses this information with a map in the setup file to convert the rcs author information to the authors name and URL and provide them as author_name, author_url, modifier_name and modifier_url. These variables are present in the above described templates.

Note: I am not happy with this first attempt. It is rather heuristic and already needs two hooks to implement for each type of rcs. In a real world wiki a page can have a lot of contributors, not just the first and the last one. Should we care?

Show me the source

At http://at.magma-soft.at/gitweb you can find the pageinfo branch of the ikiwiki.git repository. In the same location you will find the ikiwiki-plugins repository with the get_author.pm plugin.