Working with Unicode in Ruby


I hate it when useful stuff on the internet disappears. Fortunately I was able to retrieve this from the Google Cache. Originally posted at www.leftbee.net/articles/2006/08/03/working-with-unicode-in-ruby/.

Working with Unicode in Ruby

Note: This guide is UTF-8 oriented and does not cover any other unicode variants.

I’m presenting this very brief guide to help ruby users deal with unicode, more specifically with UTF-8.

First of all you need to setup Ruby to work with UTF-8 strings:

$KCODE = 'u'
require 'jcode'

The first line indicates the encoding the ruby interpreter will use — in our case we use ‘u’ which stands for UTF-8.

The second line requires the ruby library to deal with multibyte strings such as UTF-8 ones.

This library (jcode) extends String with some new methods such as:

  • jcount
  • jlength
  • jsize (alias for jlength)

Let’s try counting the ocurrences of “ü” in Ouvertüre using count and jcount:

irb(main):005:0> "Ouvertüre".count("ü")

=> 2
irb(main):006:0> "Ouvertüre".jcount("ü")

=> 1

As you can see, jcount gives a reliable result whereas count returns 2.

This library also overrides some String methods:

  • each_char
  • chop, chop!
  • delete, delete!
  • squeeze, squeeze!
  • succ, succ!
  • tr, tr!
  • trs, trs!

And adds a new one:

  • mbchar?

Wondering what this one does?

Let’s find out:

irb(main):007:0> "Ouvertüre".mbchar?

=> 6
irb(main):008:0> "Schön".mbchar?
=> 3

Uh huh! Seems to be indicating where the first multibyte char is: 7th place for the ü in Ouvertüre and 4th place for ö in Schön, which looks right if you consider that the index is zero based.

What about using upcase and downcase with UTF-8 strings?
Let’s try:

irb(main):009:0> "Ouvertüre".upcase

=> "OUVERTüRE"

Ooops, doesn’t look right!

Time to present a new friend:

It’s a Ruby gem and it’s called… tzashaaam: unicode. Easy eh?
Ok, so let’s require it and see what happens.

(of course you will need to install the gem first, use gem install unicode)

irb(main):010:0> require 'rubygems'

=> true
irb(main):011:0> require 'unicode'

=> true

(Make sure jcode has been previously loaded, otherwise it will refuse to load)

Now we have a few more methods to use:

  • Unicode::downcase
  • Unicode::upcase
  • Unicode::normalize

Let’s see them working:

irb(main):012:0> Unicode.upcase "Ouvertüre"
=> "OUVERTÜRE"

Mmm, that’s better!

irb(main):013:0> Unicode.downcase "OUVERTÜRE"

=> "ouvertüre"

Great!

But… what’s the normalize method for?

Let’s say Unicode normalization is something out of the scope of this brief guide.
You can read all about it at unicode.org (you are asking for a severe headache though!)

Well, that’s all for now!

Posted by Ruben on Thursday, August 03, 2006

Copyright © 2006-2007 Ruben Nine. All Rights Reserved.
This site is powered by Radiant CMS. (His site, not mine :)

2 Comments

  1. Comment by Charles Forcey on 2010-03-04 7:21 pm

    Thank you! This was a big help for our English/Spanish project.

  2. Comment by Arnaud Meuret on 2010-11-17 9:24 am

    Very useful indeed. Thanks for sharing. Tipped this on TipTheWeb !

    Additionally, I confirm that the Unicode gem works great with Ruby 1.9.2

Comments RSS

Sorry, the comment form is closed at this time.


powered by WordPress     themed by Mukkamu     presented by ideaharbor.org     everything else by steve hulet