Provided by: po4a_0.67-2_all bug

NAME

       po4a-gettextize - convert an original file (and its translation) to a PO file

SYNOPSIS

       po4a-gettextize -f fmt -m master.doc [-l XX.doc] -p XX.po

       (XX.po is the output, all others are inputs)

DESCRIPTION

       po4a (PO for anything) eases the maintenance of documentation translation using the
       classical gettext tools. The main feature of po4a is that it decouples the translation of
       content from its document structure.  Please refer to the page po4a(7) for a gentle
       introduction to this project.

       The po4a-gettextize script is in charge of converting documentation files into PO files.
       You only need it to setup your translation project with po4a, never afterward.

       If you start from scratch, po4a-gettextize will extract the translatable strings from the
       documentation and write a POT file. If you provide a previously existing translated file
       with the -l flag, po4a-gettextize will try to use the translations that it contains in the
       produced PO file. This process remains tedious and manual, as explained in Section
       'Converting a manual translation to po4a' below.

       If the master document has non-ASCII characters, the new generated PO file will be in
       UTF-8. Else (if the master document is completely in ASCII), the generated PO will use the
       encoding of the translated input document, or UTF-8 if no translated document is provided.

OPTIONS

       -f, --format
           Format of the documentation you want to handle. Use the --help-format option to see
           the list of available formats.

       -m, --master
           File containing the master document to translate. You can use this option multiple
           times if you want to gettextize multiple documents.

       -M, --master-charset
           Charset of the file containing the document to translate.

       -l, --localized
           File containing the localized (translated) document. If you provided multiple master
           files, you may wish to provide multiple localized file by using this option more than
           once.

       -L, --localized-charset
           Charset of the file containing the localized document.

       -p, --po
           File where the message catalog should be written. If not given, the message catalog
           will be written to the standard output.

       -o, --option
           Extra option(s) to pass to the format plugin. See the documentation of each plugin for
           more information about the valid options and their meanings. For example, you could
           pass '-o tablecells' to the AsciiDoc parser, while the text parser would accept '-o
           tabs=split'.

       -h, --help
           Show a short help message.

       --help-format
           List the documentation formats understood by po4a.

           = item -k --keep-temps

           Keep the temporary master and localized POT files built before merging.  This can be
           useful to understand why these files get desynchronized, leading to gettextization
           problems

       -V, --version
           Display the version of the script and exit.

       -v, --verbose
           Increase the verbosity of the program.

       -d, --debug
           Output some debugging information.

       --msgid-bugs-address email@address
           Set the report address for msgid bugs. By default, the created POT files have no
           Report-Msgid-Bugs-To fields.

       --copyright-holder string
           Set the copyright holder in the POT header. The default value is "Free Software
           Foundation, Inc."

       --package-name string
           Set the package name for the POT header. The default is "PACKAGE".

       --package-version string
           Set the package version for the POT header. The default is "VERSION".

   Converting a manual translation to po4a
       po4a-gettextize will try to extract the content of any provided translation file, and use
       this content as msgstr in the produced PO file. Be warned that this process is very
       fragile: the Nth string of the translated file is supposed to be the translation of the
       Nth string in the original. This will naturally not work unless both files share exactly
       the same structure.

       Internally, each po4a parser reports the syntactical type of each extracted strings. This
       is how desynchronization are detected during the gettextization.  For example, if the
       files have the following structure, it is very unlikely that the 4th string in translation
       (of type 'chapter') is the translation of the 4th string in original (of type
       'paragraph'). It is more likely that a new paragraph was added to the original, or that
       two original paragraphs were merged together in the translation.

           Original         Translation

         chapter            chapter
           paragraph          paragraph
           paragraph          paragraph
           paragraph        chapter
         chapter              paragraph
           paragraph          paragraph

       po4a-gettextize will verbosely diagnose any detected structure desynchronization. When
       this happens, you should manually edit the files (this probably requires that you have
       some notions of the target language). You must add fake paragraphs or remove some content
       in one of the documents (or both) to fix the reported disparities, until the structure of
       both documents perfectly match. Some tricks are given in the next section.

       Even when the document is successfully processed, undetected disparities and silent errors
       are still possible. That is why any translation associated automatically by
       po4a-gettextize is marked as fuzzy to require an manual inspection by humans. One has to
       check that each retrieved msgstr is actually the translation of the associated msgid, and
       not the string before or after.

       As you can see, the key here is to have the exact same structure in the translated
       document and in the original one. The best is to do the gettextization on the exact
       version of master.doc that was used for the translation, and only update the PO file
       against the latest master file once the gettextization was successful.

       If you are lucky enough to have a a perfect match in the file structures, building a
       correct PO file is a matter of seconds. Otherwise, you will soon understand why this
       process has such an ugly name :) But remember that this grunt work is the price to pay to
       get the comfort of po4a afterward. Once converted, the synchronization between master
       documents and translations will always be fully automatic.

       Even when things go wrong, gettextization often remains faster than translating everything
       again. I was able to gettextize the existing French translation of the whole Perl
       documentation in one day, even though the structure of many documents were desynchronized.
       That was more than two megabytes of original text (2 millions of characters): restarting
       the translation from scratch would have required several months of work.

   Hints and tricks for the gettextization process
       The gettextization stops as soon as a desynchronization is detected. In theory, it should
       probably be possible resynchronize the gettextization later in the documents using e.g.
       the same algorithm than the diff(1) utility. But a manual intervention would still be
       mandatory to manually match the elements that couldn't be automatically matched,
       explaining why automatic resynchronization is not implemented (yet?).

       When this happens, the whole game comes down to the alignment of these damn files'
       structures again through manual edits. po4a-gettextize is rather verbose about what went
       wrong when it happens. It reports the strings that don't match, their positions in the
       text, and the type of each of them. Moreover, the PO file generated so far is dumped as
       gettextization.failed.po for further inspection.

       Here are some other tricks to help you in this tedious process:

       •   Remove all extra content of the translations, such as the section giving credits to
           the translators. You can add them back in po4a afterward, using an addenda (see
           po4a(7)).

       •   If you need to edit the files to align their structures, you should prefer editing the
           translation if possible. Indeed, if the changes to the original are too intrusive, the
           old and new versions will not be matched during the PO update, and the corresponding
           translation will be dumped anyway. But do not hesitate to also edit the original
           document if required: the important thing is to get a first PO file to start with.

       •   Do not hesitate to kill any original content that would not exist in the translated
           version. This content will be automatically reintroduced afterward, when synchronizing
           the PO file with the document.

       •   You should probably inform the original author of any structural change in the
           translation that seems justified. Issues in the original document should reported to
           the author. Fixing them in your translation only fixes them for a part of the
           community. Plus, it is impossible to do so when using po4a ;)

       •   Sometimes, the paragraph content does match, but not their types. Fixing it is rather
           format-dependent. In POD and man, it often comes from the fact that one of them
           contains a line beginning with a white space while the other does not.  In those
           formats, such paragraph cannot be wrapped and thus become a different type. Just
           remove the space and you are fine. It may also be a typo in the tag name in XML.

           Likewise, two paragraphs may get merged together in POD when the separating line
           contains some spaces, or when there is no empty line between the =item line and the
           content of the item.

       •   Sometimes, the desynchronization message seems odd because the translation is attached
           to the wrong original paragraph. It is the sign of an undetected issue earlier in the
           process. Search for the actual desynchronization point by inspecting
           gettextization.failed.po, and fix the problem where it really is.

       •   In some case, po4a adds a space at the end of either the original or the translated
           strings. This is because every string must be deduplicated during the gettextize
           process. Imagine that a string appearing several times unmodified in the original, but
           is translated in differing way, or that different paragraphs are translated in the
           exact same way.

           Without deduplication, such case would break the gettexization algorithm, as it is a
           simple one to one pairing between the msgids of both the master and the localized
           files. Since one of the PO files would miss an entry (that would be reported as
           duplicate, with two references), the pairing would fail.

           Since po4a uses the entry type ("title" or "plain paragraph", etc) to detect whether
           the parsing streams got desynchronized, similar issues could occur if two identical
           entries (same content but differing type) of the master file are translated in the
           exact same way in the localized file. po4a would detect a fake desyncronization in
           such case.

           In most cases, the extra space added by po4a to deduplicate the strings has no impact
           on the formatting. Strings are fuzzied anyway, and msgmerge will probably match the
           strings accordingly afterward.

       •   As a final note, do not be too surprised if the first synchronization of your PO file
           takes a long time. This is because most of the msgid of the PO file resulting from the
           gettextization don't match exactly any element of the POT file built from the recent
           master files. This forces gettext to search for the closest one using a costly string
           proximity algorithm.

           For example, the first po4a-updatepo of the Perl documentation's French translation
           (5.5 MB PO file) took about 48 hours (yes, two days) while the subsequent ones only
           take a dozen of seconds.

SEE ALSO

       po4a(1), po4a-normalize(1), po4a-translate(1), po4a-updatepo(1), po4a(7).

AUTHORS

        Denis Barbier <barbier@linuxfr.org>
        Nicolas Francois <nicolas.francois@centraliens.net>
        Martin Quinson (mquinson#debian.org)

COPYRIGHT AND LICENSE

       Copyright 2002-2020 by SPI, inc.

       This program is free software; you may redistribute it and/or modify it under the terms of
       GPL (see the COPYING file).