We recently noticed that ABCL isn't capable of reading files containing certain latin1 characters, like the author comments in albert. ABCL's DecodingReader uses java.nio.charset.CharsetDecoder, which defaults to throwing exceptions when it finds unmappable character or input it otherwise considers malformed. Therefore, reading albert files, which contain Norwegian/Danish o-slashes resulted in an exception being thrown. Same result occurred with files that contain o-umlauts.
The remedy that seemed obvious was changing the CharsetDecoder's error handling strategy to replace malformed input and unmappable characters. Alas, this didn't work, to much surprise, as it resulted in the DecodingReader to loop endlessly. Despite this document, which seems to suggest that an utf-8 byte sequence can only start with f0-f4, it seemed that o-umlaut (f6) and o-slash (f8) threw the decoder off-balance.
After debugging the problem, we noticed that in the case of these characters, the DecodingReader reported an overflow but didn't advance the buffers it uses. So, after some consideration, we introduced a hack where, in case of an overflow that didn't advance the buffers at all, we advance the buffers manually by one byte and insert a ? character into the resulting stream. This reeks of a hack, but so far we haven't been able to find a more elegant solution. This solution does seem to enable us to parse files with such latin1 chars like o-umlaut and o-slash properly, and the fix was committed into ABCL trunk as commit r12902, and will be part of ABCL 0.22.