Provided by: po4a_0.69-1_all bug

NAME

       po4a-gettextize - convert an original file (and its translation) to a PO file

SYNOPSIS

       po4a-gettextize -f fmt -m master.doc -l XX.doc -p XX.po

       (XX.po is the output, all others are inputs)

DESCRIPTION

       po4a (PO for anything) eases the maintenance of documentation translation using the
       classical gettext tools. The main feature of po4a is that it decouples the translation of
       content from its document structure.  Please refer to the page po4a(7) for a gentle
       introduction to this project.

       The po4a-gettextize script helps you converting your previously existing translations into
       a po4a-based workflow. This is only to be done once to salvage an existing translation
       while converting to po4a, not on a regular basis after the conversion of your project.
       This tedious process is explained in details in Section 'Converting a manual translation
       to po4a' below.

       You must provide both a master file (e.g., the source in English) and an existing
       translated file (e.g., a previous translation attempt without po4a). If you provide more
       than one master or translation files, they will be used in sequence, but it may be easier
       to gettextize each page or chapter separately and then use msgmerge to merge all produced
       PO files. As you wish.

       If the master document has non-ASCII characters, the new generated PO file will be in
       UTF-8. If the master document is completely in ASCII, the generated PO will use the
       encoding of the translated input document.

OPTIONS

       -f, --format
           Format of the documentation you want to handle. Use the --help-format option to see
           the list of available formats.

       -m, --master
           File containing the master document to translate. You can use this option multiple
           times if you want to gettextize multiple documents.

       -M, --master-charset
           Charset of the file containing the document to translate.

       -l, --localized
           File containing the localized (translated) document. If you provided multiple master
           files, you may wish to provide multiple localized file by using this option more than
           once.

       -L, --localized-charset
           Charset of the file containing the localized document.

       -p, --po
           File where the message catalog should be written. If not given, the message catalog
           will be written to the standard output.

       -o, --option
           Extra option(s) to pass to the format plugin. See the documentation of each plugin for
           more information about the valid options and their meanings. For example, you could
           pass '-o tablecells' to the AsciiDoc parser, while the text parser would accept '-o
           tabs=split'.

       -h, --help
           Show a short help message.

       --help-format
           List the documentation formats understood by po4a.

       -k --keep-temps
           Keep the temporary master and localized POT files built before merging.  This can be
           useful to understand why these files get desynchronized, leading to gettextization
           problems

       -V, --version
           Display the version of the script and exit.

       -v, --verbose
           Increase the verbosity of the program.

       -d, --debug
           Output some debugging information.

       --msgid-bugs-address email@address
           Set the report address for msgid bugs. By default, the created POT files have no
           Report-Msgid-Bugs-To fields.

       --copyright-holder string
           Set the copyright holder in the POT header. The default value is "Free Software
           Foundation, Inc."

       --package-name string
           Set the package name for the POT header. The default is "PACKAGE".

       --package-version string
           Set the package version for the POT header. The default is "VERSION".

   Converting a manual translation to po4a
       po4a-gettextize synchronizes the master and localized files to extract their content into
       a PO file. The content of the master file gives the msgid while the content of the
       localized file gives the msgstr. This process is somewhat fragile: the Nth string of the
       translated file is supposed to be the translation of the Nth string in the original.

       Gettextization works best if you manage to retrieve the exact version of the original
       document that was used for translation. Even so, you may need to fiddle with both master
       and localized files to align their structure if it was changed by the original translator,
       so working on files' copies is advised.

       Internally, each po4a parser reports the syntactical type of each extracted strings. This
       is how desynchronization are detected during the gettextization.  In the example depicted
       below, it is very unlikely that the 4th string in translation (of type 'chapter') is the
       translation of the 4th string in original (of type 'paragraph'). It is more likely that a
       new paragraph was added to the original, or that two original paragraphs were merged
       together in the translation.

           Original         Translation

         chapter            chapter
           paragraph          paragraph
           paragraph          paragraph
           paragraph        chapter
         chapter              paragraph
           paragraph          paragraph

       po4a-gettextize will verbosely diagnose any structure desynchronization. When this
       happens, you should manually edit the files to add fake paragraphs or remove some content
       here and there until the structure of both files actually match. Some tricks are given
       below to salvage the most of the existing translation while doing so.

       If you are lucky enough to have a perfect match in the file structures out of the box,
       building a correct PO file is a matter of seconds. Otherwise, you will soon understand why
       this process has such an ugly name :) Even so, gettextization often remains faster than
       translating everything again. I gettextized the French translation of the whole Perl
       documentation in one day despite the many synchronization issues. Given the amount of text
       (2Mb of original text), restarting the translation without first salvaging the old
       translations would have required several months of work. In addition, this grunt work is
       the price to pay to get the comfort of po4a. Once converted, the synchronization between
       master documents and translations will always be fully automatic.

       After a successful gettextization, the produced documents should be manually checked for
       undetected disparities and silent errors, as explained below.

       Hints and tricks for the gettextization process

       The gettextization stops as soon as a desynchronization is detected. When this happens,
       you need to edit the files as much as needed to re-align the files' structures.
       po4a-gettextize is rather verbose when things go wrong. It reports the strings that don't
       match, their positions in the text, and the type of each of them. Moreover, the PO file
       generated so far is dumped as gettextization.failed.po for further inspection.

       Here are some tricks to help you in this tedious process and ensure that you salvage the
       most of the previous translation:

       •   Remove all extra content of the translations, such as the section giving credits to
           the translators. They should be added separately to po4a as addendas (see po4a(7)).

       •   When editing the files to align their structures, prefer editing the translation if
           possible. Indeed, if the changes to the original are too intrusive, the old and new
           versions will not be matched during the first po4a run after gettextization (see
           below). Any unmatched translation will be dumped anyway.  That being said, you still
           want to edit the original document if it's too hard to get the gettextization to
           proceed otherwise, even if it means that one paragraph of the translation is dumped.
           The important thing is to get a first PO file to start with.

       •   Do not hesitate to kill any original content that would not exist in the translated
           version. This content will be automatically reintroduced afterward, when synchronizing
           the PO file with the document.

       •   You should probably inform the original author of any structural change in the
           translation that seems justified. Issues in the original document should reported to
           the author. Fixing them in your translation only fixes them for a part of the
           community. Plus, it is impossible to do so when using po4a ;) But you probably want to
           wait until the end of the conversion to po4a before changing the original files.

       •   Sometimes, the paragraph content does match, but not their types. Fixing it is rather
           format-dependent. In POD and man, it often comes from the fact that one of them
           contains a line beginning with a white space while the other does not.  In those
           formats, such paragraph cannot be wrapped and thus become a different type. Just
           remove the space and you are fine. It may also be a typo in the tag name in XML.

           Likewise, two paragraphs may get merged together in POD when the separating line
           contains some spaces, or when there is no empty line between the =item line and the
           content of the item.

       •   Sometimes, the desynchronization message seems odd because the translation is attached
           to the wrong original paragraph. It is the sign of an undetected issue earlier in the
           process. Search for the actual desynchronization point by inspecting the file
           gettextization.failed.po that was produced, and fix the problem where it really is.

       •   Other issues may come from duplicated strings in either the original or translation.
           Duplicated strings are merged in PO files, with two references.  This constitutes a
           difficulty for the gettextization algorithm, that is a simple one to one pairing
           between the msgids of both the master and the localized files. It is however believed
           that recent versions of po4a deal properly with duplicated strings, so you should
           report any remaining issue that you may encounter.

   Reviewing files produced by po4a-gettextize
       Any file produced by po4a-gettextize should be manually reviewed, even when the script
       terminates successfully. You should skim over the PO file, ensuring that the msgid and
       msgstr actually match. It is not necessary to ensure that the translation is perfectly
       correct yet, as all entries are marked as fuzzy translations anyway. You only need to
       check for obvious matching issues because badly matched translations will be dumped in
       subsequent steps while you want to salvage them.

       Fortunately, this step does not require to master the target languages as you only want to
       recognize similar elements in each msgid and its corresponding msgstr. As a speaker of
       French, English, and some German myself, I can do this for all European languages at
       least, even if I cannot say one word of most of these languages. I sometimes manage to
       detect matching issues in non-Latin languages by looking at string length, phrase
       structures (does the amount of interrogation marks match?) and other clues, but I prefer
       when someone else can review those languages.

       If you detect a mismatch, edit the original and translation files as if po4a-gettextize
       reported an error, and try again. Once you have a decent PO file for your previous
       translation, backup it until you get po4a working correctly.

   Running po4a for the first time
       The easiest way to setup po4a is to write a po4a.conf configuration file, and use the
       integrated po4a program (po4a-updatepo and po4a-translate are deprecated). Please check
       the "CONFIGURATION FILE" Section in po4a(1) documentation for more details.

       When po4a runs for the first time, the current version of the master documents will be
       used to update the PO files containing the old translations that you salvaged through
       gettextization. This can take quite a long time, because many of the msgids of from the
       gettextization do not exactly match the elements of the POT file built from the recent
       master files. This forces gettext to search for the closest one using a costly string
       proximity algorithm.  For example, the first run over the Perl documentation's French
       translation (5.5 MB PO file) took about 48 hours (yes, two days) while the subsequent ones
       only take seconds.

   Moving your translations to production
       After this first run, the PO files are ready to be reviewed by translators. All entries
       were marked as fuzzy in the PO file by po4a-gettextization, forcing their careful review
       before use. Translators should take each entry to verify that the salvaged translation
       actually match the current original text, update the translation on need, and remove the
       fuzzy markers.

       Once enough fuzzy markers are removed, po4a will start generating the translation files on
       disk, and you're ready to move your translation workflow to production. Some projects find
       it useful to rely on weblate to coordinate between translators and maintainers, but that's
       beyond po4a' scope.

SEE ALSO

       po4a(1), po4a-normalize(1), po4a-translate(1), po4a-updatepo(1), po4a(7).

AUTHORS

        Denis Barbier <barbier@linuxfr.org>
        Nicolas François <nicolas.francois@centraliens.net>
        Martin Quinson (mquinson#debian.org)

COPYRIGHT AND LICENSE

       Copyright 2002-2022 by SPI, inc.

       This program is free software; you may redistribute it and/or modify it under the terms of
       GPL (see the COPYING file).