幺贰和叁: Unicode In Ruby 1.9

2009年4月9日星期四

Unicode In Ruby 1.9

Ruby 1.9终于抛弃了丑陋的jcode，提供一定程度上的Unicode支持。只不过我总觉得有点儿不对劲，Ruby当前采用的处理方式，很可能会带来一些新的问题。

String有了encoding属性，部分方法的处理单位也由字节改为字符。源码文件的默认编码为US-ASCII，如果在代码中写了中文，就必须指定encoding。如果不想写，也可以用BOM。

# encoding: utf-8

s = '幺贰和叁'
puts s.encoding     # => UTF-8
puts s.length     # => 4
puts s.bytesize     # => 12

不同encoding字符串之间可以直接比较，也就是说从今以后比较字符串都要考虑编码问题，搞不好会有很多Bug因此而产生。

# encoding: utf-8

s = '位'.encode('gbk')
t = 'λ'
puts s.bytes.to_a == t.bytes.to_a # => true
puts s == t # => false
puts s == t.force_encoding('gbk') # => true

正则表达式方面更乱，encoding不同“=~”一定会报错，但match方法却不一定。

# encoding: utf-8

s = 'abc'.encode('gbk')
p s.match(/b/) # => #<MatchData "b">

s =~ /b/ # => incompatible encoding regexp match (UTF-8 regexp with GBK string) (Encoding::CompatibilityError)

s = '位'.encode('gbk')
p s.match('λ') # => incompatible encoding regexp match (UTF-8 regexp with GBK string) (Encoding::CompatibilityError)

IO方面也添加了encoding支持，不过还不支持BOM，所以读取带BOM的文件得多一个步骤。

File.open('utf8_with_bom.txt', 'r:utf-8') do |f|
  f.ungetc c unless (c = f.getc)=="\uFEFF"
  # ...
end

Python 3.0中一个很大的变化就是区分了String和Byte，所有String都是Unicode，说不定有一天Ruby也会走Python的老路，希望这只是我乌鸦嘴吧。

1 条评论 :

hannyu 说...: 走老路是好事; 2009年5月6日 16:00

发表评论