Byte Order Marks (BOM)

The so-called Byte Order Mark is a special unicode character that has no visual representation. The point of it, is to start your text data with this pseudocharacter; it serves as a way to identify “Endianness” – that the text is encoded with UTF-16 (Little Endian), or UTF-16 (Big Endian), or UTF-8.
Java handles it kinda weirdly; this post describes how it works.
The BOM is the bytes: 0xFE 0xFF. That means:

Encoding First bytes in the stream
UTF-16, Little Endian FF FE
UTF-16, Big Endian FE FF
UTF-8 EF BB BF

You can use these to identify streams.
In Java, the BOM is left in the stream data. So, if you for example have a UTF-8 text file that starts with a BOM, and you read it into a String, your string starts with the BOM character. It doesn’t show up when you print it, but it still ‘counts’, in the sense that the .length() call on your string counts the BOM as 1 character, and a string that starts with a BOM is not equal to one that doesn’t start with it, even if they are otherwise the same. You probably want to filter it out!!
The only exception is the special encoding UTF-16. This encoding will, if it’s there, consume the BOM and use it to configure itself as Little Endian or Big Endian. If there is no BOM, it defaults to big endian. Note that this ‘consume the BOM’ behaviour does not apply to the encodings UTF-16LE and UTF-16BE. They read the BOM as normal.
NB: Esoteric note: The unicode character 0xFF 0xFE is intentionally defined as not valid, so that the BOM can be used unambiguously as indicating the endianness of a UTF-16 stream. However, in java, reading this special invalid character does not throw an exception. You can therefore read the byte data: FF FE 41 00, which is the string "A" encoded with a BOM as UTF-16 Little Endian using the encoding UTF-16BE. This produces garbage, but does not throw an exception.

Leave a Reply