Provided by: libmediawiki-dumpfile-perl_0.2.2-1_all
NAME
Parse::MediaWikiDump::Revisions - Object capable of processing dump files with multiple revisions per article
ABOUT
This object is used to access the metadata associated with a MediaWiki instance and provide an iterative interface for extracting the individual article revisions out of the same. To guarantee that there is only a single revision per article use the Parse::MediaWikiDump::Pages object.
SYNOPSIS
use MediaWiki::DumpFile::Compat; $pmwd = Parse::MediaWikiDump->new; $revisions = $pmwd->revisions('pages-articles.xml'); $revisions = $pmwd->revisions(\*FILEHANDLE); #print the title and id of each article inside the dump file while(defined($page = $revisions->next)) { print "title '", $page->title, "' id ", $page->id, "\n"; }
METHODS
$revisions->new Open the specified MediaWiki dump file. If the single argument to this method is a string it will be used as the path to the file to open. If the argument is a reference to a filehandle the contents will be read from the filehandle as specified. $revisions->next Returns an instance of the next available Parse::MediaWikiDump::page object or returns undef if there are no more articles left. $revisions->version Returns a plain text string of the dump file format revision number $revisions->sitename Returns a plain text string that is the name of the MediaWiki instance. $revisions->base Returns the URL to the instances main article in the form of a string. $revisions->generator Returns a string containing 'MediaWiki' and a version number of the instance that dumped this file. Example: 'MediaWiki 1.14alpha' $revisions->case Returns a string describing the case sensitivity configured in the instance. $revisions->namespaces Returns a reference to an array of references. Each reference is to another array with the first item being the unique identifier of the namespace and the second element containing a string that is the name of the namespace. $revisions->namespaces_names Returns an array reference the array contains strings of all the namespaces each as an element. $revisions->current_byte Returns the number of bytes that has been processed so far $revisions->size Returns the total size of the dump file in bytes.
EXAMPLE
Extract the article text of each revision of an article using a given title #!/usr/bin/perl use strict; use warnings; use MediaWiki::DumpFile::Compat; my $file = shift(@ARGV) or die "must specify a MediaWiki dump of the current pages"; my $title = shift(@ARGV) or die "must specify an article title"; my $pmwd = Parse::MediaWikiDump->new; my $dump = $pmwd->revisions($file); my $found = 0; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); #this is the only currently known value but there could be more in the future if ($dump->case ne 'first-letter') { die "unable to handle any case setting besides 'first-letter'"; } $title = case_fixer($title); while(my $revision = $dump->next) { if ($revision->title eq $title) { print STDERR "Located text for $title revision ", $revision->revision_id, "\n"; my $text = $revision->text; print $$text; $found = 1; } } print STDERR "Unable to find article text for $title\n" unless $found; exit 1; #removes any case sensativity from the very first letter of the title #but not from the optional namespace name sub case_fixer { my $title = shift; #check for namespace if ($title =~ /^(.+?):(.+)/) { $title = $1 . ':' . ucfirst($2); } else { $title = ucfirst($title); } return $title; }
LIMITATIONS
Version 0.4 This class was updated to support version 0.4 dump files from a MediaWiki instance but it does not currently support any of the new information available in those files.