[notes] unicode

Excerpt from https://www.cnblogs.com/kingcat/archive/2012/10/16/2726334.html:

所以我们也可以这样理解,Unicode是用0至65535之间的数字来表示所有字符.其中0至127这128个数字表示的字符仍然跟ASCII完全一样.65536是2的16次方.这是第一步.第二步就是怎么把0至65535这些数字转化成01串保存到计算机中.这肯定就有不同的保存方式了.于是出现了UTF(unicode transformation format),有UTF-8,UTF-16.

There are many problems can be caused by encoding. So it is a must to know how different languages handles encoding issues:

Java

java uses unicode = utf-16 internally, but it seems neccessary to set it up on the surface.

1
2
3
4
5
6
7
8
9
10
11
12
13
14

System.setProperty("file.encoding", "UTF-16");
String a = System.getProperty("file.encoding");

//conversion
try {
// Convert from Unicode to UTF-8
String string = "abcu5639u563b";
byte[] utf8 = string.getBytes("UTF-8");
// Convert from UTF-8 to Unicode
string = new String(utf8, "UTF-8");
}
catch (UnsupportedEncodingException e) {
}

Python

  • In Python 3, all strings are sequences of Unicode characters. There is a bytes type that holds raw bytes.
  • In Python 2, a string may be of type str or of type unicode. You can tell which using code something like this: isinstance(s,str/unicode)
1
2
3
4
5
6
7
8
9
10
11
# check encodeing type 
import sys
sys.getdefaultencoding()

# set encoding type ("utf-8", "utf-16")
#coding=utf-8

# conversion ("utf-8", "utf-16", "unicode-escape" )
s = '你好'
ec = s.encode("utf-8")
dc = ec.decode("utf-8")