Provided by: libmsoffice-word-surgeon-perl_2.10-1_all 

NAME
MsOffice::Word::Surgeon - tamper with the guts of Microsoft docx documents, with regexes
SYNOPSIS
my $surgeon = MsOffice::Word::Surgeon->new(docx => $filename);
# extract plain text
my $main_text = $surgeon->document->plain_text;
my @header_texts = map {$surgeon->part($_)->plain_text} $surgeon->headers;
# unlink fields
$surgeon->document->unlink_fields;
# reveal bookmarks
$surgeon->document->reveal_bookmarks(color => 'cyan');
# anonymize
my %alias = ('Claudio MONTEVERDI' => 'A_____', 'Heinrich SCHÜTZ' => 'B_____');
my $pattern = join "|", keys %alias;
my $replacement_callback = sub {
my %args = @_;
my $replacement = $surgeon->new_revision(to_delete => $args{matched},
to_insert => $alias{$args{matched}},
run => $args{run},
xml_before => $args{xml_before},
);
return $replacement;
};
$surgeon->all_parts_do(replace => qr[$pattern], $replacement_callback);
# save the result
$surgeon->overwrite; # or ->save_as($new_filename);
DESCRIPTION
Purpose
This module supports a few operations for inspecting or modifying contents in Microsoft Word documents in
'.docx' format -- therefore the name 'surgeon'. Since a surgeon does not give life, there is no support
for creating fresh documents; if you have such needs, use one of the other packages listed in the "SEE
ALSO" section -- or use the companion module MsOffice::Word::Template.
Some applications for this module are :
• content extraction in plain text format;
• unlinking fields (equivalent of performing Ctrl-Shift-F9 on the whole document)
• adding markers at bookmark start and end positions
• regex replacements within text, for example for :
• anonymization, i.e. replacement of names or addresses by aliases;
• templating, i.e. replacement of special markup by contents coming from a data tree (see also
MsOffice::Word::Template).
• insertion of generated images (for example barcodes) -- see "images" in
MsOffice::Word::Surgeon::PackagePart;
• pretty-printing the internal XML structure.
The ".docx" format
The format of Microsoft ".docx" documents is described in
<http://www.ecma-international.org/publications/standards/Ecma-376.htm> and <http://officeopenxml.com/>.
An excellent introduction can be found at <https://www.toptal.com/xml/an-informal-introduction-to-docx>.
Another precious source of documentation is <http://officeopenxml.com/WPcontentOverview.php>.
Internally, a document is a zipped archive, where the member named "word/document.xml" stores the main
document contents, in XML format.
Operating mode
The present module does not parse all details of the whole XML structure because it only focuses on text
nodes (those that contain literal text) and run nodes (those that contain text formatting properties).
All remaining XML information, for example for representing sections, paragraphs, tables, etc., is stored
as opaque XML fragments; these fragments are re-inserted at proper places when reassembling the whole
document after having modified some text nodes.
METHODS
Constructor
new
my $surgeon = MsOffice::Word::Surgeon->new(docx => $filename_or_filehandle);
# or simply : ->new($filename);
Builds a new surgeon instance, initialized with the contents of the given filename or filehandle.
Accessors
docx
Path to the ".docx" file
zip
Instance of Archive::Zip associated with this file
parts
Hashref to MsOffice::Word::Surgeon::PackagePart objects, keyed by their part name in the ZIP file. There
is always a 'document' part. Other parts may be headers, footers, footnotes or endnotes.
document
Shortcut to "$surgeon->part('document')" -- the MsOffice::Word::Surgeon::PackagePart object corresponding
to the main document. See the "PackagePart" documentation for operations on part objects. Besides, the
following operations are supported directly as methods to the $surgeon object and are automatically
delegated to the "document" part : "contents", "original_contents", "indented_contents", "plain_text",
"replace".
headers
my @header_parts = $surgeon->headers;
Returns the ordered list of names of header members stored in the ZIP file.
footers
my @footer_parts = $surgeon->footers;
Returns the ordered list of names of footer members stored in the ZIP file.
Other methods
part
my $part = $surgeon->part($part_name);
Returns the MsOffice::Word::Surgeon::PackagePart object corresponding to the given part name.
all_parts_do
my $result = $surgeon->all_parts_do($method_name => %args);
Calls the given method on all part objects. Results are accumulated in a hash, with part names as keys to
the results. This is mostly used to invoke the "replace" in MsOffice::Word::Surgeon::PackagePart method,
i.e.
$surgeon->all_parts_do(replace => qr[$pattern], $replacement_callback, %replacement_args);
xml_member
my $xml = $surgeon->xml_member($member_name); # reading
# or
$surgeon->xml_member($member_name, $new_xml); # writing
Reads or writes the given member name in the ZIP file, with utf8 decoding or encoding.
save_as
$surgeon->save_as($docx_file_or_filehandle);
Writes the ZIP archive into the given file or filehandle.
overwrite
$surgeon->overwrite;
Writes the updated ZIP archive into the initial file. If the initial "docx" was given as a filehandle,
use the "save_as" method instead.
new_revision
my $xml = $surgeon->new_revision(
to_delete => $text_to_delete,
to_insert => $text_to_insert,
author => $author_string,
date => $date_string,
run => $run_object,
xml_before => $xml_string,
);
This method is syntactic sugar for instantiating the MsOffice::Word::Surgeon::Revision class and
returning XML markup for MsWord revisions (a.k.a. "tracked changes") generated by that class. Users can
then manually review those revisions within MsWord and accept or reject them. This is best used in
collaboration with the "replace" method : the replacement callback can call "$self->new_revision(...)" to
generate revision marks in the document.
Either "to_delete" or "to_insert" (or both) must be present. Other parameters are optional. The
parameters are :
to_delete
The string of text to delete (usually this will be the "matched" argument passed to the replacement
callback).
to_insert
The string of new text to insert.
author
A short string that will be displayed by MsWord as the "author" of this revision.
date
A date (and optional time) in ISO format that will be displayed by MsWord as the date of this
revision. The current date and time will be used by default.
run A reference to the MsOffice::Word::Surgeon::Run object surrounding this revision. The formatting
properties of that run will be copied into the "<w:r>" nodes of the deleted and inserted text
fragments.
xml_before
An optional XML fragment to be inserted before the "<w:t>" node of the inserted text
Operations on parts
See the MsOffice::Word::Surgeon::PackagePart documentation for other operations on package parts,
including operations on fields, bookmarks or images.
SEE ALSO
The <https://metacpan.org/pod/Document::OOXML> distribution on CPAN also manipulates "docx" documents,
but with another approach : internally it uses XML::LibXML and XPath expressions for manipulating XML
nodes. The API has some intersections with the present module, but there are also some differences :
"Document::OOXML" has more support for styling, while "MsOffice::Word::Surgeon" has more flexible
mechanisms for replacing text fragments.
Other programming languages also have packages for dealing with "docx" documents; here are some
references :
<https://docs.microsoft.com/en-us/office/open-xml/word-processing>
The C# Open XML SDK from Microsoft
<http://www.ericwhite.com/blog/open-xml-powertools-developer-center/>
Additional functionalities built on top of the XML SDK.
<https://poi.apache.org>
An open source Java library from the Apache foundation.
<https://www.docx4java.org/trac/docx4j>
Another open source Java library, competitor to Apache POI.
<https://phpword.readthedocs.io/en/latest/>
A PHP library dealing not only with Microsoft OOXML documents but also with OASIS and RTF formats.
<https://pypi.org/project/python-docx/>
A Python library, documented at <https://python-docx.readthedocs.io/en/latest/>.
As far as I can tell, most of these libraries provide objects and methods that closely reflect the
complete XML structure : for example they have classes for paragraphs, styles, fonts, inline shapes, etc.
The present module is much simpler but also much more limited : it was optimised for dealing with the
text contents and offers no support for presentation or paging features. However, it has the rare
advantage of providing an API for regex substitutions within Word documents.
The MsOffice::Word::Template module relies on the present module, together with the Perl Template
Toolkit, to implement a templating system for Word documents.
AUTHOR
Laurent Dami, <dami AT cpan DOT org<gt>
COPYRIGHT AND LICENSE
Copyright 2019-2024 by Laurent Dami.
This program is free software, you can redistribute it and/or modify it under the terms of the Artistic
License version 2.0.
perl v5.40.1 2025-05-16 MsOffice::Word::Surgeon(3pm)