Working with Unicode in Ruby
I hate it when useful stuff on the internet disappears. Fortunately I was able to retrieve this from the Google Cache. Originally posted at www.leftbee.net/articles/2006/08/03/working-with-unicode-in-ruby/.
Working with Unicode in Ruby
Note: This guide is UTF-8 oriented and does not cover any other unicode variants.
I’m presenting this very brief guide to help ruby users deal with unicode, more specifically with UTF-8.
First of all you need to setup Ruby to work with UTF-8 strings:
$KCODE = 'u' require 'jcode'
The first line indicates the encoding the ruby interpreter will use — in our case we use ‘u’ which stands for UTF-8.
The second line requires the ruby library to deal with multibyte strings such as UTF-8 ones.
This library (jcode) extends String with some new methods such as:
- jcount
- jlength
- jsize (alias for jlength)
Let’s try counting the ocurrences of “ü” in Ouvertüre using count and jcount:
irb(main):005:0> "Ouvertüre".count("ü") => 2 irb(main):006:0> "Ouvertüre".jcount("ü") => 1
As you can see, jcount gives a reliable result whereas count returns 2.
This library also overrides some String methods:
- each_char
- chop, chop!
- delete, delete!
- squeeze, squeeze!
- succ, succ!
- tr, tr!
- trs, trs!
And adds a new one:
- mbchar?
Wondering what this one does?
Let’s find out:
irb(main):007:0> "Ouvertüre".mbchar? => 6 irb(main):008:0> "Schön".mbchar? => 3
Uh huh! Seems to be indicating where the first multibyte char is: 7th place for the ü in Ouvertüre and 4th place for ö in Schön, which looks right if you consider that the index is zero based.
What about using upcase and downcase with UTF-8 strings?
Let’s try:
irb(main):009:0> "Ouvertüre".upcase => "OUVERTüRE"
Ooops, doesn’t look right!
Time to present a new friend:
It’s a Ruby gem and it’s called… tzashaaam: unicode. Easy eh?
Ok, so let’s require it and see what happens.
(of course you will need to install the gem first, use gem install unicode)
irb(main):010:0> require 'rubygems' => true irb(main):011:0> require 'unicode' => true
(Make sure jcode has been previously loaded, otherwise it will refuse to load)
Now we have a few more methods to use:
- Unicode::downcase
- Unicode::upcase
- Unicode::normalize
Let’s see them working:
irb(main):012:0> Unicode.upcase "Ouvertüre" => "OUVERTÜRE"
Mmm, that’s better!
irb(main):013:0> Unicode.downcase "OUVERTÜRE" => "ouvertüre"
Great!
But… what’s the normalize method for?
Let’s say Unicode normalization is something out of the scope of this brief guide.
You can read all about it at unicode.org (you are asking for a severe headache though!)
Well, that’s all for now!
Posted by Ruben on Thursday, August 03, 2006
Copyright © 2006-2007 Ruben Nine. All Rights Reserved.
This site is powered by Radiant CMS. (His site, not mine :)
Thank you! This was a big help for our English/Spanish project.