2015年8月25日 星期二

[RR Python] Unicode in Python

Get system encoding parameter:
sys.getdefaultencoding()
>>> sys.getdefaultencoding()
'ascii'
 Play around with multibyte characters
>>> msg = '今天天氣真好12345'
>>> msg
'\xe4\xbb\x8a\xe5\xa4\xa9\xe5\xa4\xa9\xe6\xb0\xa3\xe7\x9c\x9f\xe5\xa5\xbd12345'

>>> msgu = u'今天天氣真好12345'
>>> msgu
u'\u4eca\u5929\u5929\u6c23\u771f\u597d12345'
 >>> print msg, msgu
今天天氣真好12345 今天天氣真好12345
check their type
>>> print type(msg), type(msgu)
<type 'str'> <type 'unicode'>
 
the length of msg/msgu is interesting
 >>> print len(msg), len(msgu)
23 11
msg is encoded in "utf-8", to verify it, decode it and compare with msgu, they are identical!
>>> msg.decode('utf-8')
u'\u4eca\u5929\u5929\u6c23\u771f\u597d12345'


reference:
瞭解Unicode¶
Python Tutorial 第一堂(4)Unicode 支援、基本 I/O