Provided by: libmce-perl_1.833-1_all
NAME
MCE::Grep - Parallel grep model similar to the native grep function
VERSION
This document describes MCE::Grep version 1.833
SYNOPSIS
## Exports mce_grep, mce_grep_f, and mce_grep_s use MCE::Grep; ## Array or array_ref my @a = mce_grep { $_ % 5 == 0 } 1..10000; my @b = mce_grep { $_ % 5 == 0 } [ 1..10000 ]; ## File_path, glob_ref, or scalar_ref my @c = mce_grep_f { /pattern/ } "/path/to/file"; my @d = mce_grep_f { /pattern/ } $file_handle; my @e = mce_grep_f { /pattern/ } \$scalar; ## Sequence of numbers (begin, end [, step, format]) my @f = mce_grep_s { %_ * 3 == 0 } 1, 10000, 5; my @g = mce_grep_s { %_ * 3 == 0 } [ 1, 10000, 5 ]; my @h = mce_grep_s { %_ * 3 == 0 } { begin => 1, end => 10000, step => 5, format => undef };
DESCRIPTION
This module provides a parallel grep implementation via Many-Core Engine. MCE incurs a small overhead due to passing of data. A fast code block will run faster natively. However, the overhead will likely diminish as the complexity increases for the code. my @m1 = grep { $_ % 5 == 0 } 1..1000000; ## 0.065 secs my @m2 = mce_grep { $_ % 5 == 0 } 1..1000000; ## 0.194 secs Chunking, enabled by default, greatly reduces the overhead behind the scene. The time for mce_grep below also includes the time for data exchanges between the manager and worker processes. More parallelization will be seen when the code incurs additional CPU time. my @m1 = grep { /[2357][1468][9]/ } 1..1000000; ## 0.353 secs my @m2 = mce_grep { /[2357][1468][9]/ } 1..1000000; ## 0.218 secs Even faster is mce_grep_s; useful when input data is a range of numbers. Workers generate sequences mathematically among themselves without any interaction from the manager process. Two arguments are required for mce_grep_s (begin, end). Step defaults to 1 if begin is smaller than end, otherwise -1. my @m3 = mce_grep_s { /[2357][1468][9]/ } 1, 1000000; ## 0.165 secs Although this document is about MCE::Grep, the MCE::Stream module can write results immediately without waiting for all chunks to complete. This is made possible by passing the reference to an array (in this case @m4 and @m5). use MCE::Stream default_mode => 'grep'; my @m4; mce_stream \@m4, sub { /[2357][1468][9]/ }, 1..1000000; ## Completed in 0.203 secs. This is amazing considering the ## overhead for passing data between the manager and workers. my @m5; mce_stream_s \@m5, sub { /[2357][1468][9]/ }, 1, 1000000; ## Completed in 0.120 secs. Like with mce_grep_s, specifying a ## sequence specification turns out to be faster due to lesser ## overhead for the manager process. A common scenario is grepping for pattern(s) inside a massive log file. Notice how parallelism increases as complexity increases for the pattern. Testing was done against a 300 MB file containing 250k lines. use MCE::Grep; my @m; open my $LOG, "<", "/path/to/log/file" or die "$!\n"; @m = grep { /pattern/ } <$LOG>; ## 0.756 secs @m = grep { /foobar|[2357][1468][9]/ } <$LOG>; ## 24.681 secs ## Parallelism with mce_grep. This involves the manager process ## due to processing a file handle. @m = mce_grep { /pattern/ } <$LOG>; ## 0.997 secs @m = mce_grep { /foobar|[2357][1468][9]/ } <$LOG>; ## 7.439 secs ## Even faster with mce_grep_f. Workers access the file directly ## with zero interaction from the manager process. my $LOG = "/path/to/file"; @m = mce_grep_f { /pattern/ } $LOG; ## 0.112 secs @m = mce_grep_f { /foobar|[2357][1468][9]/ } $LOG; ## 6.840 secs
PARSING HUGE FILES
The MCE::Grep module lacks an optimization for quickly determining if a match is found from not knowing the pattern inside the code block. Use the following snippet as a template to achieve better performance. Also, take a look at examples/egrep.pl, included with the distribution. use MCE::Loop; MCE::Loop::init { max_workers => 8, use_slurpio => 1 }; my $pattern = 'karl'; my $hugefile = 'very_huge.file'; my @result = mce_loop_f { my ($mce, $slurp_ref, $chunk_id) = @_; ## Quickly determine if a match is found. ## Process slurped chunk only if true. if ($$slurp_ref =~ /$pattern/m) { my @matches; ## The following is fast on Unix. Performance degrades ## drastically on Windows beyond 4 workers. open my $MEM_FH, '<', $slurp_ref; binmode $MEM_FH, ':raw'; while (<$MEM_FH>) { push @matches, $_ if (/$pattern/); } close $MEM_FH; ## Therefore, use the following construct on Windows. while ( $$slurp_ref =~ /([^\n]+\n)/mg ) { my $line = $1; # save $1 to not lose the value push @matches, $line if ($line =~ /$pattern/); } ## Gather matched lines. MCE->gather(@matches); } } $hugefile; print join('', @result);
OVERRIDING DEFAULTS
The following list options which may be overridden when loading the module. use Sereal qw( encode_sereal decode_sereal ); use CBOR::XS qw( encode_cbor decode_cbor ); use JSON::XS qw( encode_json decode_json ); use MCE::Grep max_workers => 4, # Default 'auto' chunk_size => 100, # Default 'auto' tmp_dir => "/path/to/app/tmp", # $MCE::Signal::tmp_dir freeze => \&encode_sereal, # \&Storable::freeze thaw => \&decode_sereal # \&Storable::thaw ; From MCE 1.8 onwards, Sereal 3.015+ is loaded automatically if available. Specify "Sereal =" 0> to use Storable instead. use MCE::Grep Sereal => 0;
CUSTOMIZING MCE
MCE::Grep->init ( options ) MCE::Grep::init { options } The init function accepts a hash of MCE options. The gather option, if specified, is ignored due to being used internally by the module. use MCE::Grep; MCE::Grep::init { chunk_size => 1, max_workers => 4, user_begin => sub { print "## ", MCE->wid, " started\n"; }, user_end => sub { print "## ", MCE->wid, " completed\n"; } }; my @a = mce_grep { $_ % 5 == 0 } 1..100; print "\n", "@a", "\n"; -- Output ## 2 started ## 3 started ## 1 started ## 4 started ## 3 completed ## 4 completed ## 1 completed ## 2 completed 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
API DOCUMENTATION
MCE::Grep->run ( sub { code }, list ) mce_grep { code } list Input data may be defined using a list or an array reference. Unlike MCE::Loop, Flow, and Step, specifying a hash reference as input data isn't allowed. my @a = mce_grep { /[2357]/ } 1..1000; my @b = mce_grep { /[2357]/ } \@list; my @z = mce_grep { /[2357]/ } \%hash; # not supported MCE::Grep->run_file ( sub { code }, file ) mce_grep_f { code } file The fastest of these is the /path/to/file. Workers communicate the next offset position among themselves with zero interaction by the manager process. my @c = mce_grep_f { /pattern/ } "/path/to/file"; # faster my @d = mce_grep_f { /pattern/ } $file_handle; my @e = mce_grep_f { /pattern/ } \$scalar; MCE::Grep->run_seq ( sub { code }, $beg, $end [, $step, $fmt ] ) mce_grep_s { code } $beg, $end [, $step, $fmt ] Sequence may be defined as a list, an array reference, or a hash reference. The functions require both begin and end values to run. Step and format are optional. The format is passed to sprintf (% may be omitted below). my ($beg, $end, $step, $fmt) = (10, 20, 0.1, "%4.1f"); my @f = mce_grep_s { /[1234]\.[5678]/ } $beg, $end, $step, $fmt; my @g = mce_grep_s { /[1234]\.[5678]/ } [ $beg, $end, $step, $fmt ]; my @h = mce_grep_s { /[1234]\.[5678]/ } { begin => $beg, end => $end, step => $step, format => $fmt }; MCE::Grep->run ( sub { code }, iterator ) mce_grep { code } iterator An iterator reference may be specified for input_data. Iterators are described under section "SYNTAX for INPUT_DATA" at MCE::Core. my @a = mce_grep { $_ % 3 == 0 } make_iterator(10, 30, 2);
MANUAL SHUTDOWN
MCE::Grep->finish MCE::Grep::finish Workers remain persistent as much as possible after running. Shutdown occurs automatically when the script terminates. Call finish when workers are no longer needed. use MCE::Grep; MCE::Grep::init { chunk_size => 20, max_workers => 'auto' }; my @a = mce_grep { ... } 1..100; MCE::Grep::finish;
INDEX
MCE, MCE::Core
AUTHOR
Mario E. Roy, <marioeroy AT gmail DOT com>