Sunday, September 19, 2010

Parsing latin1 data with an utf-8 stream

We recently noticed that ABCL isn't capable of reading files containing certain latin1 characters, like the author comments in albert. ABCL's DecodingReader uses java.nio.charset.CharsetDecoder, which defaults to throwing exceptions when it finds unmappable character or input it otherwise considers malformed. Therefore, reading albert files, which contain Norwegian/Danish o-slashes resulted in an exception being thrown. Same result occurred with files that contain o-umlauts.

The remedy that seemed obvious was changing the CharsetDecoder's error handling strategy to replace malformed input and unmappable characters. Alas, this didn't work, to much surprise, as it resulted in the DecodingReader to loop endlessly. Despite this document, which seems to suggest that an utf-8 byte sequence can only start with f0-f4, it seemed that o-umlaut (f6) and o-slash (f8) threw the decoder off-balance.

After debugging the problem, we noticed that in the case of these characters, the DecodingReader reported an overflow but didn't advance the buffers it uses. So, after some consideration, we introduced a hack where, in case of an overflow that didn't advance the buffers at all, we advance the buffers manually by one byte and insert a ? character into the resulting stream. This reeks of a hack, but so far we haven't been able to find a more elegant solution. This solution does seem to enable us to parse files with such latin1 chars like o-umlaut and o-slash properly, and the fix was committed into ABCL trunk as commit r12902, and will be part of ABCL 0.22.

11 comments:

  1. So invalid utf-8 characters get replaced by #\? ? Why not signal a condition with a restart providing the option of using a different character in that place? (this is what flexi-streams does, if I remember correctly).

    Also, I don't think advancing the stream only one byte is the correct thing to do for invalid multi-byte characters. On most implementations FILE-LENGTH and FILE-POSITION work in terms of byte counts since that is really the only reasonable thing to do for multi-byte encodings. I don't know what ABCL does for FILE-LENGTH/POSITION, but if it's something similar, I think that FILE-POSITION will be off after reading an invalid multi-byte character.

    ReplyDelete
  2. First of all, thanks for the suggestions.

    We'll certainly consider providing a condition and a restart, thus far it was deemed more important to get the solution working than to make it perfect. :)

    I'm not sure about the invalid multi-byte issue - I'll need to perform some investigations on it. The offending characters that we have tested the solution work fine when skipping a single byte, as they are not invalid multi-byte sequences - as per the quoted RFC, they are not multi-byte sequences at all. Latin-1 characters f0-f4 will obviously be rather more problematic, because they _are_ valid multi-byte introducers in utf-8.

    We'll keep improving the solution as we gain further insights into how to do it properly. I am quite disappointed that CharsetDecoder fails to do it for me. As always, patches/contributions are heartily welcomed. ;)

    ReplyDelete
  3. this is the approach taken by plan 9 (the first system to implement utf-8 throughout, i believe). with the exception that the character returned is unicode 0xFFFD (the unicode replacement character), which is somewhat less common than a question mark.

    (returning a non-ASCII character also allows an easy check that the character decoded is a genuine error rather than a legitimate 0xFFFD - if the code point is greater than 0x7f and the number of bytes consumed is 1, then it must be an error. this is probably not relevant in your case)

    ReplyDelete
  4. > Despite this document, which seems to suggest that an utf-8 byte sequence can only start with f0-f4
    RFC 3626 describes UTF-8 encoding, right.

    > it seemed that o-umlaut (f6) and o-slash (f8) threw the decoder off-balance.

    These are codes of Latin-1/Unicode representation. Reading Latin-1 string with UTF-8 decoder is as stupid as reading BMP file with PNG decoder and complaining about generated error.

    JFYI, "รถ" is represented as #xC3 #xB6 in UTF-8.

    ReplyDelete
  5. @lispnik, although that may be very true - and it is - not everybody produces strict ascii or strict utf8 sources.

    I think it's good that ABCL tries its hardest to consume those sources out of the box anyway.

    ReplyDelete
  6. @lispnik, if I tell the decoder to replace or ignore malformed input or unmappable characters, is it me being stupid if the decoder fails to do so for input that is not a utf-8 multibyte introducer? I'd imagine f6 or f8 being either unmappable of malformed, silly me. :)

    I am not complaining about a generated error. The decoder fails to generate one. Instead, I have to resort to a hack. That hack so far seems to work.

    ReplyDelete
  7. Sorry, guys.

    As of weird behavior of DecodingReader, it seems sensible to me. It simply cannot know proper synchronization strategy: for one-byte encoding skipping one symbol is good solution, but for multibyte encodings you *may* implement something more complex, like looking for possible beginning of next sequence...

    ReplyDelete
  8. This comment has been removed by the author.

    ReplyDelete
  9. I've made a couple attempts to reply here, but have met with technical openid failures each time, so I gave up and posted my response on my own blog here.

    ReplyDelete
  10. Also, I'd like to report this business of substituting a '?' rather than throwing an exception as a data-corruption bug in the trac system, but need a login to do so.

    (See my aforementioned blog post for more details as to why this is a bug, not a feature).

    You can contact me at rwk at acm dot org to arrange a login, or add the trac item yourself if you prefer.

    Thanks.

    ReplyDelete
  11. You can still read binary data without the replacing taking place if you want. I don't see how throwing an exception really helps anybody, as we found existing codebases (albert being one of them) that have latin-1 files that would then throw exceptions while being parsed.

    Also, I'd be interested to know how we're supposed to be able to restart from the beginning - there are cases like pipe streams that aren't seekable, so how exactly is the reader supposed to restart the parsing from the beginning using a different encoding? How's it
    supposed to know which encoding to use? Having a restart is nice for people using a REPL, but it becomes
    less useful for people embedding ABCL.

    ReplyDelete