Best way to write terms for interoperability

This is half a question and half me sharing what I already found out. In short: when exchanging Prolog terms (in text form) with a different Prolog system or other software that reads Prolog terms, how do I make SWI-Prolog write terms so that the other side has the best chance of understanding them?

The obvious answer is write_canonical/1, which quotes atoms where necessary and uses plain f(A, B, C) syntax for all compound terms… well, almost all. SWI-Prolog’s write_canonical/1 actually still uses special syntax for lists ([x] instead of '[|]'(x,[]) or '.'(x,[])) and braces ({x} instead of {}(x)). This is fine when exchanging terms with SWI-Prolog and most other Prolog systems, but can be a problem with other parsers. (Guess who’s working with a Prolog term parser that doesn’t even understand list syntax…)

To disable the special syntax for lists and braces as well, we have to use write_term/2 and set the appropriate options. First, we need quoted(true), ignore_ops(true), and character_escapes_unicode(false) to get the behavior of write_canonical/1. Then we can use dotlists(true) to write lists as .(H,T) and brace_terms(true) to write braces as {}(X).

Another problem still remains though, namely Unicode characters in atoms. SWI-Prolog has full Unicode support in source code, so Unicode letters and symbols like α and can be used in unquoted atoms the same way as ASCII letters/symbols. Not all Prolog systems support Unicode this well though - for example, SICStus only supports Latin-1 characters unquoted, so ä is a valid atom in SICStus, but α is not and has to be written as 'α'. This causes problems when sending Unicode atoms from SWI-Prolog to SICStus: all SWI term writing predicates will write an atom like α unquoted, which SICStus isn’t able to read.

Is there any way to change this behavior and force Unicode characters in atoms to be quoted or escaped? I found the option character_escapes, but that apparently only affects characters that SWI wouldn’t accept as unquoted atoms (e. g. \uFEFF). Characters like α are still not quoted even with character_escapes(true) (which seems to be the default setting anyway).

TLDR: The best I can do is:

write_term(Term, [
    quoted(true),
    ignore_ops(true),
    character_escapes_unicode(false),
    dotlists(true),
    brace_terms(true)
]).

Somewhat related:

It seems you figured out most of it. SWI-Prolog writes lists as [a,b,c] because of its different internal representation. There is another huge advantage though. Using .(a,.(b,.(c,[]))) has the huge disadvantage that you cannot reduce any of these terms before you get the first ) because they may not have arity 2. As a result, you can handle much longer lists using [a,b,c] notation. I doubt {a} was intentionally written such. Might be a mistake. I’d have to look through the history.

Biggest problem is indeed non-ascii text. I though this could be resolved using a stream that does not support anything but ASCII, but that gives us this (note that switching the encoding of user_output typically only works on POSIX systems that run directly in a terminal emulator).

?- set_stream(user_output, encoding(ascii)).
true.
?- write_canonical(α).
\u03B1

This looks like a bug to me as the output should have been quoted. It should also the Prolog \Code\ notation rather then \u03B1. That is normally controlled by the Prolog flag character_escapes_unicode. That too doesn’t work.

These problems arise because the escaping is not done by write_term/3, but by the stream’s error handling (see stream_property/2, property representation_errors).

What seems to be needed is that write_term/3 either gets a flag that indicates the unquoted character set or uses the encoding of the stream to do that. I think SICStus choice to allow for ä as an atom but not for α is a bit dubious. The Prolog standard only talks about the ASCII code points. Above that we must make a choice. Including only the letters from ISO-Latin-1 is a rather arbitrary one. SWI-Prolog uses all Unicode characters that have certain Unicode properties ( ID_Start/ ID_Continue, see SWI-Prolog -- Manual).

We are still faced with two problems: when to quote and when to use \Code\. I’m tempted to connect these and have the encoding force \Code\ (or at user choice \uXXXX/\UXXXXXXXX), which will in turn force quotes. That make machine-machine transfer of terms portable. The drawback is that non-ASCII strings get pretty unreadable. We could have a flag that forces quoting atoms when they contain non-ASCII characters.

Ideas welcome …

P.s. Checking out unquoted_atomW(), which decides whether an atom can be written without quotes suggests that the stream encoding is intended to work.

The {...} notation seems to be forgotten in almost every Prolog system’s documentation – I remember being surprised by it in Quintus Prolog decades ago when I inadvertently used {...} outside of a DCG, and later in SWI-Prolog when I wrote a dict without the tag.

?- {a,b} =.. [{}, Arg], Arg =.. ArgX.
Arg =  (a, b),
ArgX = [',', a, b].

write_canonical/1 ought to output {a,b} as '{}'(','(a,b)).

Speaking of dicts, what should write_canonical/1 do? If the intent is interoperationality, it might be nice to have a iso_only option, to prevent surprises if the code happens to output SWI-Prolog syntactic extensions.

Another possibility for interoperationality is to implement library(fastrw) at the receiving end, which will have the additional advantage of speed-up for large terms.

1 Like

One possibly confusing thing is that {...} takes a single term and commas inside it are operators. So, novices might think that {a,b} is the same as '{}'(a,b) but it’s actually '{}'((a,b) or '{}'(','(a,b)).

(I’ve seen novices write things like (a,b,c), not realizing that this is equivalent to ','(a,','(b,c)). They also often don’t realize that the commas in the right-hand side of a goal are operators: ((a:-b,c)) == ':-'(a,','(b,c)).

At least with write_term/2, this seems to be intentional, according to this footnote from the docs:

In traditional systems this flag also stops the syntactic sugar notation for lists and brace terms. In SWI-Prolog, these are controlled by the separate options dotlists and brace_terms

But in write_canonical/1 it doesn’t really make sense - unlike with lists, the brace syntax doesn’t hide a non-traditional term format that would cause problems with other systems.

Interesting idea to use a different stream encoding - I didn’t know that SWI’s streams could automatically replace unsupported characters with Prolog escape sequences like that. But as you say, this only works correctly for characters that are already inside quotes.

This at least seems to be documented in stream_property/2:

The behaviour is one of error (throw an I/O error exception), prolog (write \x<hex>\ ), unicode (write \uXXXX or \UXXXXXXXX escape sequences) or xml (write &#...; XML character entity). The initial mode is unicode for the user streams and error for all other streams.

That seems like a reasonable solution. Any Prolog system with basic Unicode support can probably handle non-ASCII characters inside quotes, so quoting non-ASCII characters but not escaping them should be relatively compatible. Based on that, it should also be possible to implement a pure ASCII term output if necessary, using an ASCII stream with representation_errors(prolog).

Our code actually has some fastrw support already, precisely because of the improved performance. But it uses the SICStus version of fastrw, which is completely undocumented and AFAIK uses a different data format than SWI’s implementation. Not that great for interoperability :slight_smile:

The big advantage of plain Prolog terms is that they are natively compatible with any Prolog system, as long as you restrict the syntax sufficiently. A fastrw-like format loses that advantage, even though it has other benefits like allowing faster parsing and more compact data.

Right. Some digging in the history shows the introduction of the brace_terms option without adjusting write_canonical/1. Fixed and added a test case.

That was a bug. write/1 was supposed to check that the stream encoding could handle all characters of an atom and switch to quoted otherwise. It also should do the escaping itself rather than waiting for the stream. This was also buggy. Both now works. The intend of this I/O stream feature is to do something more sensible with the user streams on Unicode atoms that cannot be represented than stopping the process.

I’ve added an option quote_non_ascii(Bool) to write_term/2,3 that will switch to quoted representation when an atom contains at least one non-ASCII character. The default is false, but true for write_canonical/1. Together with how writing handles encoding limitations that should properly take care of atoms holding non-ASCII characters. See also the test cases in the recent commits. Please add more, especially if you need to rely on some behavior :slight_smile:

1 Like

Speaking of dicts, what should write_canonical/1 do? If the intent is interoperability, it might be nice to have a iso_only option, to prevent surprises if the code happens to output SWI-Prolog syntactic extensions.

I don’t know. Interoperability is a nightmare. I guess atoms can only be safely exchanged if they contain characters 1…127 while some systems have severe limits on the length of atoms (127 chars?). For systems using bounded integers there is no minimum requirement. Some older systems only supported 32-bit machine words excluding the tags, so you’d end up with e.g., 29 bit signed integers. Some systems also remove some bits from floating point numbers :slight_smile: Next we have all the extensions in data types. You need to consider the target Prolog systems and the application requirements. A lot actually works for concrete cases.

:frowning: fastrw is supported in quite a few systems. It would be great if there was a standard representation that could be extended to support non-standard types such as strings, rational numbers, etc. I doubt there is much hope we get this done.

You can always use protobufs. :grimacing:
That even allows exchanging data with C++, Python, Java, etc. although I expect performance would be worse than write_canonical/1 and read/1.

General interoperability is, as you say, a nightmare. You’d probably need to add some kind of “typing”, for example to verify that integers are in the appropriate range. Nobody seems to have got very far with a comprehensive type system for Prolog terms, so this seems like a good Masters thesis. :smile_cat:

1 Like

Jan W. has several patches related to this topic.

Note: The list is based on dates and not the exact commits.

I wouldn’t say that. Most of the type systems did come up with typing terms AFAIK. Some probably have numeric ranges. I doubt one has dealt with atom content such as allowed characters or length limit. Probably the work on Ciao is the most comprehensive here. You want something like JSON or XML Schema?

1 Like

Is An overview of Ciao and its design philosophy the best description of “types” that Ciao has developed?

I’ve been able to avoid those for most of my life … I attended the first XML conference and when I saw early versions of XML Schema and XQuery, I decided to work on things that had nothing to do with XML (despite knowing and respecting people like Tim Bray (who wrote the Annotated XML Specification)). :wink:

My needs for interoperability are rather modest – e.g., I was able to use JSON for passing data between Python and Prolog but it was a bit slow, so I wrote some simple Python code to output canonical Prolog terms. Elsewhere, protobufs suffice.

I was a little surprised, but this seems indeed the case. Creating a list with the integers 1..1 000 000 and reading this back got me these timings:

How Write time Read Time (sec)
JSON (SWI-Prolog) - 0.689
JSON (node.js v10.19) - 0.040
SWI-Prolog 0.200 0.126
Ciao 1.20 0.395 0.759
XSB 3.8.0 1.040 1.986
YAP 6.3 0.267 0.276
SICStus 4.6 0.245 0.280
ECLiPSe 6.1 0.094 0.062

Considering its simplicity and importance, a complete C implementation for JSON might be worth considering.

Recently discussed library(fastrw) gives the result below. As Prolog text the size is 6,888,900 (one byte more than the JSON text :slight_smile: )

How Size Write time (sec) Read time (sec)
SWI-Prolog 5,967,116 0.068 0.056
SICStus 4.6 8,887,733 0.124 0.084
Ciao 1.2.0 8,887,733 0.056 SEGV

The ciao version works with a list of 1,000, so the code seems fine. The file size is exactly the same as for SICStus. Trying to load the Ciao version using SICStus gives a version mismatch error.

All tests on AMD 3950X, 64Gb core (so file reading is from cache), Ubuntu 20.04

If anyone wants the programs, I’m glad to share them. They are a bit of a mess though as it is not so easy to write these tests such that they work everywhere :frowning:

Gets my vote. :+1:

Having spent a good portion of my time learning to create SWI-Prolog web servers, HTML pages and creating Cytoscape.js pages, JSON is at the heart of getting options and data to Cytoscape.js. Also in high usage are repy_json/1, dict_create/3.


Cytoscape.js (ref)

  • Fully serialisable and deserialisable via JSON
Personal Notes (click triangle to expand)

Parsing JSON is a Minefield
Tips on Adding JSON Output to Your CLI App
I disagree with the tip Do Flatten the Structure and think that there should be a separate tool (think Linux command line piping) that will flatten the structure. This way a user has an option to get a nested or flattened version as desired or needed.

It would be interesting to compare performance of JSON with protobufs - the former is a fairly verbose wire format and the latter tries to be compact.

FWIW, the Python community seems to be OK with a pure Python implementation of JSON. This might be because the pure Python version is easily extensible whereas as pure C is more restrictive.

If you want to see the performance possibilities, here are some benchmarks: GitHub - ultrajson/ultrajson: Ultra fast JSON decoder and encoder written in C with Python bindings

2 Likes

Perfect, thank you! With quote_non_ascii added, our custom Prolog term parser is happy now :slight_smile: I need to look into extending the parser to support list syntax - then we should be able to use plain write_canonical/1 again, with the latest changes in SWI.

I think this depends on what exactly the user is trying to achieve with write_canonical/1. My impression is that there are different use cases which were previously all served well by write_canonical/1, but with SWI’s extensions it’s difficult to accommodate all of them with a single predicate. Specifically, I think these are the main use cases:

  • Exchanging terms between different instances of SWI-Prolog. Syntax extensions like dicts are not a problem here. It’s only important that operator declarations are ignored, as these can vary between SWI instances.
  • Exchanging terms between different Prolog systems. Here it’s important to be conservative with quoting and avoid syntax extensions. Dicts would need to be represented as some other term that’s unlikely to conflict with anything, e. g. a compound term with functor '$DICT'. In that case, it would be good to also have a read_term/2 option in SWI that maps '$DICT' terms back to real dicts.
  • Displaying a term’s exact internal representation for debugging/teaching. Here we want to avoid all syntactic sugar like lists and braces. Since dicts are internally just compound terms with a funny functor, they should probably be written like compound terms as well.

I don’t have a strong opinion on which of these behaviors is best for write_canonical/1. It doesn’t affect me right now thankfully, because our code comes from SICStus and doesn’t use dicts :slight_smile:

JSON, protobufs, etc. avoid most of the interoperability problems of course, because they are much more well-specified and consistently implemented. The disadvantage is that they are not Prolog-native data formats, so you generally have to convert your Prolog data structures into an intermediate format for output.

In our specific case, the Prolog process communicates with a Java library, which is quite tightly coupled to the Prolog part. The interface between the two gets extended/modified relatively often, and using Prolog terms as the interchange format simplifies the development a bit, because we can skip the data conversion/mapping work on one of the two sides at least.

I’m not sure about dicts. I’d like to keep the internals hidden. A possible portable exchange is to use e.g. ‘$DICT’(Tag, {key:value, …}). When transferring between Prolog systems you have to deal with limitations anyway and thus you must make sure the exchanged terms are recognized by both systems. If I have a rational number, I can transfer it to ECLiPSe, and few (no?) others. Same applies to IEEE floats such as NaN and Inf. Or an atom that may be limited by length, whether or not it can hold a 0-code and the range of character codes it supports.

If you want to stay portable you probably need to stick with the traditional Prolog types, limit atoms to hold only non-null bytes, use only ASCII and be at most 128 chars long, keep compounds below arity 32 (think I’ve seen that limit) and integers below signed 28 bit. Even that might be too much for some systems :frowning:

Let us keep things practical …