Gravitonic
Andrei Zmievski

31-July-2006
My name is not really Andrei

Ryan Kennedy commented on the presentation I gave at OSCON; specifically, about the transliteration support in PHP 6. I wanted to follow up and explain exactly what it is and, unfortunately, what it is not.

Ryan was excited about the possibilities presented by transliteration, especially as it applies to representing foreign names in reader's native script (think mail readers). This works really well for Japanese names:

echo str_transliterate("やまもと, のぼる""Any""Latin");

And the result is:

yamamoto, noboru

This is sweet, right? We get an approximate spelling of the foreign name and one could even attempt to pronounce it. But does it work for all script pairs?

echo str_transliterate("Tom Cruise""Latin""Cyrillic");

What do we get for this paragon of fame?

Том Цруисе

Hmm. If I had to reconstruct for you English speakers what that sounds like in Russian, it would be something close to TOM TSRU-EE-SEH. Probably not how he'd like to be known in Russia. What is the problem here?

The problem is that English orthography is defective. There is a disconnect between the orthography (spelling) of English and the phonemes (sounds) of the language. We've all seen this: each English letter may represent more than one sound (c can be [s] or [k]), and each English sound may be written by more than one letter ([f] can be f, or ph, or gh).

This plays havoc with transliteration which is a literal mapping from one system of writing to another. Transliteration is supposed to be lossless and thus, reversible. In order to achieve this the mapping rules must represent each letter (glyph) of the source script as a separate glyph or a unique combination of glyphs in the target script. Transliteration knows nothing about the underlying sounds of the language and works only with the written forms. You can see how this is problematic when you come to a language like English.

What you really want is transcription which maps the sounds of the source language to the script of the target language. This is fairly easy for an efficient language, one where the sounds have one-to-one mapping to glyphs, and becomes progressively more difficult for less efficient languages. With English, the transcription process has to rely on a dictionary providing exact phonetic transcription of pretty much every word.

Still, transliteration works fairly well for a good number of script pairs since it attempts to map the letters of the source script to similar sounding letters in the target one. The results mostly depend on how efficient the source language is (as with Russian->English). Transliteration rules can be customized, and if you're willing to live without the reversibility requirement, one can get fairly accurate representations. The str_transliterate() function in PHP 6 uses default built-in transliteration rules, but there will be a way to provide your own rules towards achieving this goal.

Hope this helps explain some of the issues concerning mapping one writing system to another. Until next time.

Posted at 11:14 | Permalink | PHP | Comments (8)
29-July-2006
Back from OSCON 2006

Just got back from OSCON which was again in Portland this year. The conference was excellent, as always and so were the events and extracurricular activities. The sheer variety of talks at OSCON is exciting and frustrating at the same time: exciting because I attended several talks that I would not get to hear at a more focused conference, and frustrating because of the time conflicts between these talks.

The slides from my own session on PHP 6 and Unicode are online now.

By the way, if you like books just a tiny little bit and happen to be in Portland, do yourself a favor: set aside a full day and visit the Powell's. It the world's largest independent used and new bookstore (covering, oh, a couple of city blocks) and has an amazing collection of books (including some very rare ones). You could literally lose your friends and family there and wonder among the stacks for hours whilst salivating giddily over the titles on whatever topic your mind can imagine. And don't worry, there is always the coffee shop to come back to and get provisions to sustain yourself.

Posted at 21:24 | Permalink | PHP | Comments (3)
22-July-2006
Photos from Moscow

The photos from the Moscow part of my trip to Russia are now up on the site. It's taken longer than I expected to process them. This is partly because it was the first time I shot everything in RAW format in order to be able to adjust exposure and white balance easier. An unintended consequence of this decision was that I spent way more time on each photo than I usually do, tweaking things to be just perfect, and doing this for a good portion of 600 photos just takes a while, even with the help of a WhiBal. I'll have to figure a more streamlined approach to processing.

The photos from St. Petersburg will be coming up a bit later.

Posted at 10:50 | Permalink | Travel | Comments (2)
15-July-2006
PHP-GTK 2 Zeta (Yes, Zeta)

PHP-GTK 2 Zeta release, the first one of the new architecture, is finally out. No, zeta it's not a typo. What's a zeta? Well, it's a letter of the Greek alphabet. Why zeta? Because, a) we've gone through several "iterations" of this release without actually releasing anything, but more importantly, b) alpha and beta get all the glory in the software world, leaving the other Greek letters longing for a spot in the sunshine. So, zeta it is.

While preparing this release I looked at the very end of the NEWS file and realized that I have been working on this project for over 5 years now. That is a sobering thought, which reminds me that I need to go and set up my place for the party tonight in preparation of imbibing copious amounts of mood-altering substances better knows as beers. Cheers.

Posted at 11:50 | Permalink | PHP | Comments (4)
Hot and .. Not So

Thought I'd throw a couple of fun links your way. First one is a project that almost won $5,000 prize at the last Mashup Camp. It presents, shall we say, an innovative approach to user validation combining so-called business with so-called pleasure. HotCaptcha gets thumbs-up from me any day of the week.

The other is a supremely strong candidate for the title of the Worst Music Video Ever. It is an inspired effort that immediately induces cringing expression on your face and fails to release you from its grip until (in my case) 3 hours later. Have fun.

Posted at 10:33 | Permalink | Humor | Comments (4)
13-July-2006
All the Little Pieces, or TextIterator in PHP 6

I have been working on the Unicode support in PHP for quite a while now and I figure that it is time to start talking about Unicode and I18N in general and specifically about some of the new features that PHP 6 will be bringing to the table.

First up is the new Swiss-army knife-like TextIterator class. The purpose of this class is to provide access to various text units in a generic fashion. Actually, I lied. TextIterator implements ICU's full boundary analysis API, so what it really gives you are the boundaries between the text units. A slight distinction, but well worth remembering. And what are these units, might you ask?

  • codepoints
  • combining sequences
  • characters (slightly different than combining sequences)
  • words
  • line breaks
  • sentences

As its name tells you, TextIterator also implements PHP's Iterator interface and thus can be used in such constructs as foreach(). As a quick example, let's go through a string and extract all words contained in it (skipping empty pieces). Using foreach() it is as simple as:

$str "The quick brown fox jumped over the lazy dog.";
foreach (new 
TextIterator($strTextIterator::WORD)
         as 
$num => $word) {
    if (
$word[0] != " ") {
        
printf("%d. %s\n"$num$word);
    }

The result is:

0. The
2. quick
4. brown
6. fox
8. jumped
10. over
12. the
14. lazy
16. dog
17. .

Doing the same thing without foreach() is a bit more involved, but also more flexible. We'll print out the words along with their boundaries' offsets.

$it = new TextIterator($strTextIterator::WORD);
$start $it->first();
for (
$end $it->next(); $end != TextIterator::DONE$start $end$end $it->next()) {
    if (
$str[$start] != " ") {
        
printf("[%2d..%2d]  %s\n"$start$endsubstr($str$start$end-$start));
    }
}

And the result here:

[ 0.. 3]  The
[ 4.. 9]  quick
[10..15]  brown
[16..19]  fox
[20..26]  jumped
[27..31]  over
[32..35]  the
[36..40]  lazy
[41..44]  dog
[44..45]  .

One thing worth mentioning is that, at least for now, accessing random offsets in the Unicode strings is somewhat slower than in the binary strings. So the foreach() approach ends up being faster and is the recommended way of accessing text units in a linear fashion.

What else can we do with boundary analysis? At any point we can retrieve the text element at the current boundary with the current() method. Continuing the example:

$it->first();
$word $it->current();

will give you "The". We can move backward with the previous() method:

$it->last();     // positions iterator beyond the last character
$it->previous(); // advances to the boundary before the current one
$word $it->current();

gives you "." which is the last word in the text. If you want to move through multiple boundaries in the same call, just pass that number to next() and previous():

$it->first();
$it->next(4); // skip the first 4 boundaries and stop
$word $it->current();

gives you "brown". You can check whether a certain offset is a boundary or not with isBoundary():

var_dump($it->isBoundary(10)); // true since 'brown' is at offset 10 and it's a boundary

Two more methods, following() and preceding(), allow you to locate a boundary immediately following or preceding the specified offset. This might be useful for doing ellipsis on a piece of text:

$limit 25// cut off at 25 chars or before
$offset $it->preceding($limit);
echo 
substr($str0$offset), "...\n";

gives "The quick brown fox ...". One more thing to note is that isBoundary(), following() and preceding() actually reposition the iterator to the located boundary.

TextIterator has a counterpart that does everything (well, almost) in reverse. It's called, wait for it.. ReverseTextIterator. It has the exact same API and can be used transparently where needed:

foreach (new ReverseTextIterator($strTextIterator::WORD) as $num => $word) {
    if (
$word[0] != " ") {
        
printf("%d. %s\n"$num$word);
    }
}

The result here is:

0. .
1. dog
3. lazy
5. the
7. over
9. jumped
11. fox
13. brown
15. quick
17. The

Last but not least, if you are really lazy and just want to get all the text pieces defined by the boundaries, TextIterator provides a convenient getAll() method:

$it = new TextIterator($strTextIterator::WORD);
print_r($it->getAll());

With the expected result of:

Array
(
    [0] => The
    [1] =>  
    [2] => quick
    [3] =>  
    [4] => brown
    [5] =>  
    [6] => fox
    [7] =>  
    [8] => jumped
    [9] =>  
    [10] => over
    [11] =>  
    [12] => the
    [13] =>  
    [14] => lazy
    [15] =>  
    [16] => dog
    [17] => .
)

Performance has been an important consideration when designing TextIterator. It does a few optimization tricks internally that allow it to be much faster than using offset operator, substr() or even word boundaries in regular expressions.

Hopefully, this has been a useful preview of an important new piece of functionality in PHP 6. Stay tuned for more to come.

Posted at 11:06 | Permalink | PHP | Comments (11)
11-July-2006
Yahoo! Looking for PHP Talent

Yahoo Site Operations Group needs a top-class PHP developer. Here is the official job description:

Yahoos Site Operations Group is looking for a software engineer to build and support existing tools for the Operations groups. You will be responsible for the design, implementation and ongoing maintenance and operation of internal web-based applications targeted at an operations/service-engineering audience. If you thrive on fast-paced development, challenging projects, and being part of a fun and highly-skilled team, read on.

This position will require development of scalable database-driven web applications able to communicate with various backend systems via web services, direct database connections, etc. You should be comfortable with all aspects of the development cycle from requirements-gathering and specification design to implementation and ongoing development.

The ideal candidate will be comfortable working in a cross-functional environment as part of a team composed of project managers, senior engineers, database architects, tech writers, etc. You should be able to identify and fill gaps, find innovative solutions for design problems, work well independently with an eye for high-level objectives, and consistently deliver quality products in a timely fashion.

Required Qualifications
- 5+ years of industry experience building top quality web apps
- Proven expertise with PHP and MySQL
- Expert level UI skills (HTML, CSS, JavaScript)
- Knowledge of C/C++ is a big plus
- Experience with object oriented design and development (PHP/C++)
- Experience with data modeling and advanced query design, optimization, and benchmarking
- Web Services (REST, SOAP, XML_RPC, etc) experience
- Expertise in at least one Unix shell scripting language (bash, PHP CLI, Perl, etc)
- Understand software engineering life cycle, client/server application and Internet application architectures (TCP/IP, HTTP, etc)
- Significant experience with requirements analysis, design, coding, testing, documentation, and application maintenance
- Experience with Linux/Unix/BSD system administration and shell scripting, Apache, and basic database administration
- BS or MS in Computer Science
- Excellent self-motivating, multi-tasking, communication, and interpersonal skills

If that sounds like you, send me a resume and I'll forward it to the hiring manager.

Posted at 12:58 | Permalink | PHP | Comments (0)
10-July-2006
Sara joins Yahoo!

This may be known to some of you already, but Sara Golemon, author of runkit, classkit, ssh2, and other PECL packages as well as a regular contributor to PHP core, has started at Yahoo! today. She'll be working in the Search & Marketplace Group and will be a valuable addition to the team. Welcome, Sara!

Posted at 22:08 | Permalink | PHP | Comments (0)