|
|
Andrei Zmievski |
|
13-July-2006
All the Little Pieces, or TextIterator in PHP 6
I have been working on the Unicode support in PHP for quite a while now and I figure that it is time to start talking about Unicode and I18N in general and specifically about some of the new features that PHP 6 will be bringing to the table. First up is the new Swiss-army knife-like TextIterator class. The purpose of this class is to provide access to various text units in a generic fashion. Actually, I lied. TextIterator implements ICU's full boundary analysis API, so what it really gives you are the boundaries between the text units. A slight distinction, but well worth remembering. And what are these units, might you ask?
As its name tells you, TextIterator also implements PHP's Iterator interface and thus can be used in such constructs as foreach(). As a quick example, let's go through a string and extract all words contained in it (skipping empty pieces). Using foreach() it is as simple as:
$str = "The quick brown fox jumped over the lazy dog.";
The result is: 0. The 2. quick 4. brown 6. fox 8. jumped 10. over 12. the 14. lazy 16. dog 17. . Doing the same thing without foreach() is a bit more involved, but also more flexible. We'll print out the words along with their boundaries' offsets.
$it = new TextIterator($str, TextIterator::WORD);
And the result here: [ 0.. 3] The [ 4.. 9] quick [10..15] brown [16..19] fox [20..26] jumped [27..31] over [32..35] the [36..40] lazy [41..44] dog [44..45] . One thing worth mentioning is that, at least for now, accessing random offsets in the Unicode strings is somewhat slower than in the binary strings. So the foreach() approach ends up being faster and is the recommended way of accessing text units in a linear fashion. What else can we do with boundary analysis? At any point we can retrieve the text element at the current boundary with the current() method. Continuing the example:
$it->first();
will give you "The". We can move backward with the previous() method:
$it->last(); // positions iterator beyond the last character
gives you "." which is the last word in the text. If you want to move through multiple boundaries in the same call, just pass that number to next() and previous():
$it->first();
gives you "brown". You can check whether a certain offset is a boundary or not with isBoundary():
var_dump($it->isBoundary(10)); // true since 'brown' is at offset 10 and it's a boundary
Two more methods, following() and preceding(), allow you to locate a boundary immediately following or preceding the specified offset. This might be useful for doing ellipsis on a piece of text:
$limit = 25; // cut off at 25 chars or before
gives "The quick brown fox ...". One more thing to note is that isBoundary(), following() and preceding() actually reposition the iterator to the located boundary. TextIterator has a counterpart that does everything (well, almost) in reverse. It's called, wait for it.. ReverseTextIterator. It has the exact same API and can be used transparently where needed:
foreach (new ReverseTextIterator($str, TextIterator::WORD) as $num => $word) {
The result here is: 0. . 1. dog 3. lazy 5. the 7. over 9. jumped 11. fox 13. brown 15. quick 17. The Last but not least, if you are really lazy and just want to get all the text pieces defined by the boundaries, TextIterator provides a convenient getAll() method:
$it = new TextIterator($str, TextIterator::WORD);
With the expected result of:
Array
(
[0] => The
[1] =>
[2] => quick
[3] =>
[4] => brown
[5] =>
[6] => fox
[7] =>
[8] => jumped
[9] =>
[10] => over
[11] =>
[12] => the
[13] =>
[14] => lazy
[15] =>
[16] => dog
[17] => .
)
Performance has been an important consideration when designing TextIterator. It does a few optimization tricks internally that allow it to be much faster than using offset operator, substr() or even word boundaries in regular expressions. Hopefully, this has been a useful preview of an important new piece of functionality in PHP 6. Stay tuned for more to come. Posted at 11:06
|
PHP
Comments
PHP 6 is looking to be really awesome. There will probably be books with TextIterator as a chapter by itself. Posted by SantosJ on July 13, 2006 06:19 PMI remember when I was reading about the performance hit that PHP 6 takes as a result of the implementation of Unicode strings being worried that it would be a big stumbling block towards widescale adoption. I think tools like TextIterator are really going to help offset these issues, given the internal optimizations you've mentioned, along with the new uses and techniques it will allow. On a slightly unrelated note, has anything of significance with regard to PHP 6's behavior changed since that internal document was released outlining the implementation of Unicode, amongst other things? I couldn't find the URL, but I remember it was hosted in a user directory on the PHP.net site. Posted by Abu Hurayrah on July 13, 2006 10:09 PMTextIterator is definitely going to help with certain usage patterns, but we will also do our best to make Unicode strings as fast as possible. The document you are referring to was probably the Unicode support design document. It now lives in PHP tree: http://cvs.php.net/viewvc.cgi/php-src/README.UNICODE?view=co Posted by Andrei on July 13, 2006 10:15 PMActually, the document I was thinking about was in the format of meeting minutes - however, this document you've linked us to is far more relevant. : -D In general, with advanced features, come initial performance hits. I highly doubt the performance issues that Unicode brings will outweigh its importance. With time, I'm sure it will hardly be worth mentioning, as we'll soon wonder how we lived without it! (P.S. You've got mail!) Posted by Abu Hurayrah on July 13, 2006 10:48 PMHi Andrei, nice new functionality. Great work! The first example you give ends with "16. dog" while I would expect "17. .". Thanks for catching the error, Tobias, I fixed the example. Posted by Andrei on July 14, 2006 12:09 AMAndrei, l0t3k Posted by l0t3k on July 14, 2006 10:01 AMl0t3k, I imagine with all the flexibility of that TextIterator brings with it, namely, the code boundaries, it would be rather trivial to implement some of what to which you're referring on top of the native PHP features, given that it seems almost any combination can be created from that list Andrei posted at the beginning. Posted by Abu Hurayrah on July 14, 2006 12:08 PMAndrei an excellent introduction! It's good to see PHP is finally going to get multi-byte character support internally, something it's needed for a long time. I also like how you're abstracting the features out. Not forcing an end user to actually understand how an I18N string works will help push the adoption of the features. The least amount of new work an end user has to do to utilize a feature, the more likely they will be to work with it. Especially on large projects. There seems to be one case of an inconsistent behavior in your posting that I hope you can help clarify. In every example shown, each word position is incremented by a $current + 2 algorithm. While being a bit purplexing to the unicode uninitiated, it can easily be accepted without any questions. The algorithm for discovering a word position can be used easily anywhere within a code base. Reaching the '.' the example sentence (and I'm sure other special characters) causes a break in the word position algorithm, introduces a new behavior that will force a deeper understanding of the string behaviors. But more importantly it will make it more difficult to acheive random access into the strings. The reverse iterator has the same issue, only the impact becomes a case of throwing off the counting from even positions to odd positions. I fear both of these issues will result in a sluggish adoption of the I18N functionalities you're working so diligently on. Is it possible to force the offending characters into a continued algorithm pattern? For example with the TextIterator, forcing the '.' to be at $num = 18 instead of 17, or in the ReverseIterator forcing the word 'dog' (and all subsequent bits) to be at $num = 2. What happens in the case of a sentence like "The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog." Is the last word at position 34, 35, or 36? Is the final period at 35, 36, or 38? Posted by Dan Kalowsky on July 15, 2006 12:37 PMDan, I think, if I understood correctly, the $current + 2 algorithm is really $current + :space:. That is, the jumping over every other element is due to the fact that each word has a space between it. The reason there is no jump between "dog" and "full stop" is because, in that case, there is no space between "dog" and "full stop". This goes along with Andrei's explanation at the beginning where he mentions that TextIterator gives boundaries between text units. Posted by Abu Hurayrah on July 16, 2006 02:50 PMDan, I am sorry for not being more clear about this. foreach (new TextIterator($str, TextIterator::WORD) as $num => $word) { In this piece of code, $num is the count of how many boundaries we've gone through rather than the offset of the word boundary. Since we're skipping empty words, the printed out count is increased by 2, until we get to the period. There is a well-defined boundary between the last letter of "dog" and the period, so we don't skip anything there and the count increases by one. Hope this helps. Posted by Andrei on July 16, 2006 07:37 PM |