Sunday, September 26, 2010

ABCL 0.22.0 released

On behalf of the developers of ABCL (Armed Bear Common Lisp) I'm glad to be able to announce the 0.22.0 release.

ABCL is a Common Lisp implementation implemented in Java and running on the JVM, featuring both an interpreter and a compiler. The compiler targets the JVM directly meaning that its output is runnable JVM bytecode. The fact that ABCL is written in Java allows for relatively easy embedding in larger applications. For integration with existing applications ABCL implements Java Specification Request (JSR) 223: Java scripting API.

The release is a small maintenance release: most efforts were focussed on work happening on branches some of which has already been merged to trunk for 0.23.



Latest and older binary and source distributions can be downloaded from http://common-lisp.net/project/armedbear/releases/

Sunday, September 19, 2010

Parsing latin1 data with an utf-8 stream

We recently noticed that ABCL isn't capable of reading files containing certain latin1 characters, like the author comments in albert. ABCL's DecodingReader uses java.nio.charset.CharsetDecoder, which defaults to throwing exceptions when it finds unmappable character or input it otherwise considers malformed. Therefore, reading albert files, which contain Norwegian/Danish o-slashes resulted in an exception being thrown. Same result occurred with files that contain o-umlauts.

The remedy that seemed obvious was changing the CharsetDecoder's error handling strategy to replace malformed input and unmappable characters. Alas, this didn't work, to much surprise, as it resulted in the DecodingReader to loop endlessly. Despite this document, which seems to suggest that an utf-8 byte sequence can only start with f0-f4, it seemed that o-umlaut (f6) and o-slash (f8) threw the decoder off-balance.

After debugging the problem, we noticed that in the case of these characters, the DecodingReader reported an overflow but didn't advance the buffers it uses. So, after some consideration, we introduced a hack where, in case of an overflow that didn't advance the buffers at all, we advance the buffers manually by one byte and insert a ? character into the resulting stream. This reeks of a hack, but so far we haven't been able to find a more elegant solution. This solution does seem to enable us to parse files with such latin1 chars like o-umlaut and o-slash properly, and the fix was committed into ABCL trunk as commit r12902, and will be part of ABCL 0.22.