The Evils of java.lang.String.toUpperCase
String.toUpperCase is not as innocent as it looks. If you carefully read the docs you’ll see that the default signature asks for a java.util.Locale. The reasoning behind this is that there are language specific rules on how to convert lower case letters to uppercase. German, for example, has the letter “ß” which gets converted to “SS”, so “straße” becomes “STRASSE”. See the problem? The String length changed! This can trip you up if you stored it somewhere before you called toUpperCase. I’m sure there are lots of examples for other languages, so watch out and never store a String length.
1 Comment »
RSS feed for comments on this post. TrackBack URI
The whole thing is even worse. length() is not the number of characters, but the number of 16-bit-numbers using the UTF-16 encoding. For most cases, this is the same, but traditional chinese characters sometimes require two UTF-16 numbers. The same holds for charAt etc. You always get the UTF-16 encoding.
The other day, there was a long discussion on how to introduce the characters above 64k. How to extend Java. 7 approaches have been dropped due to various reasons. Finally, they chose this one.
http://java.sun.com/developer/technicalArticles/Intl/Supplementary/
If you want to have the correct length of a string, use methods like codePointCount().