2012/02/28

the trouble with babylon

developing multi-language apps -- even webapps -- can be a pain in the rectum, to say the least.

when working with PHP, you have a multitude of choices:

0. langauge-dependent conditionals
pros: none. none. okay, maybe the fact that they are integrated within the code, but that's it.
cons: about everything. for one, bloated and clunky code.
usage scenario: DO NOT EVER DO THIS.

1. array-based dictionaries
pros: native solution, fast for smaller wordsets, easy to modify
cons: must keep track of item IDs, consumes a lot of processing resource above certain wordset sizes, clumsy to use, has to have proprietary implementation, editing tools limited to source code editing, dictionary must be present initially, hard to expand to a new language, new items must be added manually.
usage scenario: PHP apps that work anywhere, quick'n'dirty development.

2/a. database-backend dictionaries (major SQL servers)
pros: easy to manage (you can write any frontend of your liking to edit them), easy to expand to a new language, easy to query.
cons: needs a database backend, needs to have a database and a table set, dictionary must be present initially, has to have a proprietary implementation, db queries can be a resource issue, new items must be added manually.
usage scenario: PHP apps that work in most environments and where you have a database server and where performance is not a big issue; also, apps where new expressions may be added dynamically to the dictionary.

2/b. SQLite-backend dictionary
pros: all of 2/a plus the SQLite db is just a file, so it's easy to move and set up.
cons: needs a PHP installation with SQLite capability, and while not as big a resource hog as a major DB server, with large wordsets it can slow things down; also, most of 2/a.
usage scenario: same as 2/a.

3. gettext native PHP support
pros: the most widely accepted and universal i18n (internationalization)/l10n (localization) toolkit; lets you write a mostly native code with an initial language of your choice, and worry about translations later on; you can work with partial translations; supports singular and plural forms; a plethora of ready-made software to let you prep and edit translations, and even lets you outsorce translation; very easy to add new expressions.
cons: requires a bit of ahead-planning (ie. if you decide to use gettext, you must write code that utilizes the gettext functions), rigid resource file location scheme, a hassle to set up properly (environment variables, charsets, etc); if you use caching, it's lightning-fast, but any change to the dicitionaries requires a soft-restart of the webserver hosting the PHP interpreter; if you eschew caching, it's slow and a resource hog. requires you to use poEDIT, xgettext or an IDE that supports gettext strings extraction; context-based gettexting is not natively supported (seriously, why, PHP, why?!?!)
usage scenario: apps that have a LOT of strings to localize, apps that have outsorced translators, apps where the dictionary expands a lot by programming.

4. php-gettext software library
pros: almost all of the native PHP gettext implementation, and doesn't need soft restarts of the web server when the dictionary changes. also, less hassle with specifying languages and charsets; supports context-based gettexting.
cons: besides the caching and context problem, almost all of native gettext's hassles. performance can be an issue as it tries to wrangle the binary-form .mo files, and does not cache them. a lot of classes and files to include.
usage scenario: like gettext, but where context-gettexting or the caching-restart hassle is an issue.

needless to say, i wasn't too happy when i started developing my new framework (called wg5, there will be a lot of articles about it later) -- almost all solutions have a lot of cons that outweigh the pros. but never one to accept defeat and go for an uneasy compromise, i decided it was time for a third alternative to using gettext in php. ladies and gentlemen, please welcome the PDXMLang classes!

it's pure PHP5 OOP, so you can easily extend or override parts of it -- especially those which concern dictionary loading and caching. it's a single library file, and does not depend on any external resources -- except of course the dictionary .po files. also, it's only two classes, and one gets initialized only when there is no cache file/data available. it is a complete implementation of gettext features except for catalog-specific calls -- they are provided for compatibility, but behave exactly like the non-catalog-specific counterparts. namely, it provides the following methods: textdomain(), gettext(), _(), ngettext(), dgettext(), dngettext(), pgettext(), npgettext(), dpgettext(), dnpgettext().

it has its own system of path names, but being an OOP construct, you can easily override that for your purposes.

the main idea is that it interprets the .po files directly (therefore, no hassle with having to interpret binary .mo files), and creates a special hash-array representation of all the items, then caches (to a file -- but it's your choice if you want to cache it into memory with memcache or APC) the result either as a serialize()d text or a json_encode()d text, and in subsequent runs uses the serialized file to initialize the dictionary.

it supports plural forms in almost all officially-supported (by gettext) languages, using native PHP code dependant on the language specification. also, it supports minor tweaks of the official implementation.

the library also offers on-the-fly charset conversions using either mbconv or iconv. however, the whole thing is aimed at UTF-8, being the de-facto standard for charsets.

all documentation is provided within the source file, phpdoc-style.

i suggest using poEdit for both strings extraction and translation work: it's free, it's multi-platform, and it kicks ass in many ways.

and, to be fair to my library, here's:

5. the subpar daemon's PDXMLang gettext replacement
pros: full gettext implementation without the hassles of the original regarding caching, non-caching, and path layout structure and environment settings; best behaves in UTF-8, which is standard; compact and well-documented code library; very fast after first caching; offers tweaks of behaviour; very extensible due to pure-OOP programming style; charset conversion on-the-fly; fast plural form implementation.
cons: slow initial read-in time for .po files; consumes a bit more memory than other gettext implementations (but not much); plural-form detection is based on language, not .po file spec.
usage scenario: as with other gettext usages, except it's much easier. :)

have fun using it, and drop me a line here if you like it, use it, or have any issues with it.

No comments:

Post a Comment