It happened to me that I thought I had a handle on character sets, encoding, etc. and then someone sent me a CSV file that they tried to convert to XLS format using a tool I had written and it didn't work... some of the international characters were mangled. To make a long story short, the converter was written correctly but it used Apache POI 3.2 to create the Excel file. Apparently, even though the HSSF module there knows that it should encode text with UTF-16 in Excel files, it (sometimes??) uses the platform's default encoding for its output. It boiled down to this:
public class HelloWorld {
public static void main(String[] args) {
String greeting = "H\u00eb\u0142\u0142\u00f4 World!";
System.out.println(java.nio.charset.Charset.defaultCharset().name()+": "+greeting);
}
}
javac HelloWorld.java
java HelloWorld
java -Dfile.encoding=utf-8 HelloWorld
java -Dfile.encoding=Cp1252 HelloWorld
On my Mac, it shows the following output:
MacRoman: H???? World!
UTF-8: Hëłłô World!
windows-1252: H???? World!
Same output in image format in case the text in this post gets messed up:
Notice that the Unicode characters were safely embedded in the source code using their code points so that the source file itself can be encoded in US-ASCII. Java represents all Strings internally as Unicode so in this example the problem happens in System.out.println(). I was able to show that it wasn't my converter because when it was executed on the other machine with -Dfile.encoding=utf-8 it worked just fine. And my converter wasn't writing the output directly, it was using Apache POI to create the output file. And Apache POI accepts a plain OutputStream for its binary output, not an OutputStreamReader that can be told about character sets.
Comments (0)
You don't have permission to comment on this page.