Ubuntu Manpage: Catmandu::MARC::Tutorial - A documentation-only module for new users of Catmandu::MARC

name
synopsis
utf-8
reading
create a csv file of all issn numbers found at any marc field
transforming
writing
deduplication

kinetic (3) Catmandu::MARC::Tutorial.3pm.gz

Provided by: libcatmandu-marc-perl_1.271-1_all

NAME

       Catmandu::MARC::Tutorial - A documentation-only module for new users of Catmandu::MARC

SYNOPSIS

         perldoc Catmandu::MARC::Tutorial

UTF-8

   MARC8 and UTF-8
       The current Catmandu MARC tools are targetted for processing UTF-8 encoded files.  When
       you have MARC8 encoded data tools like MarcEdit <https://marcedit.reeset.net/> or
       "yaz-marcdump" <https://software.indexdata.com/yaz/doc/yaz-marcdump.html> can be used to
       create a UTF-8 encoded file:

          $ yaz-marcdump -f MARC-8 -t UTF-8 -o marc -l 9=97 marc21.raw > marc21.utf8.raw

   Unicode errors
       If you process UTF-8 encoded files which contain faulty characters, you will get a fatal
       error message like:

         utf8 "\xD8" does not map to Unicode at ...

       Use the iconv (libc6-dev Linux package) tool, to preprocess the data and discard faulty
       characters:

         $ iconv -c -f UTF-8 -t UTF-8 marc21.utf8.raw | catmandu convert MARC to JSON

   Convert a decomposed UTF-8 file to a combined UTF-8 file and vice versa
       For example, the character ae can be represented as

       "ae", that is the codepoint U+00E4 (two bytes c3 a4 in UTF-8 encoding), or as "aX", that
       is the two codepoints U+0061 U+0308 (three bytes 61 cc 88 in UTF-8).

       The uconf (libicu-dev Linux package) tool can be used to convert these types of files:

           $ uconv -x any-nfc < decomposed.txt > combined.txt
           $ uconv -x any-nfd < combined.txt > decomposed.txt

READING

   Convert MARC21 records into JSON
       The command below converts file data.mrc into JSON:

          $ catmandu convert MARC to JSON < data.mrc

   Convert MARC21 records into MARC-XML
          $ catmandu convert MARC to MARC --type XML < data.mrc

   Convert UNIMARC records into JSON, XML, ...
       To read UNIMARC records use the RAW parser to get the correct character encoding.

          $ catmandu convert MARC --type RAW to JSON < data.mrc
          $ catmandu convert MARC --type RAW to MARC --type XML < data.mrc

   Create a CSV file containing all the titles
       To extract data from a MARC record on needs a Fix routine. This is a small language to
       manipulate data. In the example below we extract all 245 fields from MARC:

          $ catmandu convert MARC to CSV --fix 'marc_map(245,title); retain(title)' < data.mrc

       The Fix "marc_map" puts the MARC 245 field in the "title" field.  The Fix "retain" makes
       sure only the title field ends up in the CSV file.

   Create a CSV file containing only the 245$a and 245$c subfields
       The "marc_map" Fix can get one or more subfields to extract from MARC:

          $ catmandu convert MARC to CSV --fix 'marc_map(245ac,title); retain(title)' < data.mrc

   Create a CSV file which contains a repeated field
       In the example below the 650a field can be repeated in some marc records.  We will join
       all the repetitions in an comma delimited list for each record.

          $ catmandu convert MARC to CSV --fix 'marc_map(650a,subject,join:","); retain(subject)' < data.mrc

   Create a list of all ISBN numbers in the data
       In the previous example we saw how all subjects can be printed using a few Fix commands.
       When a subject is repeated in a record, it will be written on one line joined by a comma:

           subject1
           subject2, subject3
           subject4

       In the example over record 1 contained 'subject1', record 2 'subject2' and 'subject3' and
       record 3 'subject4'. What should we use when we want a list of all values in a long list?

       In the example below we'll print all ISBN numbers in a batch of MARC records in one long
       list using the Text exporter:

         $ catmandu convert MARC to Text --field_sep "\n" --fix 'marc_map(020a,isbn.\$append); retain(isbn)' < data.mrc

       The first new thing is the $append in the marc_map. This will create in "isbn" a list of
       all ISBN numbers found in the "020a" field. Because "$" signs have a special meaning on
       the command line they need to be escaped with a backslash "\". The "Text" exporter with
       the "field_sep" option will make use all the list in the "isbn" field are written on a new
       line.

   Create a list of all unique ISBN numbers in the data
       Given the result of the previous command, it is now easy to create a unique list of ISBN
       numbers with the UNIX "uniq" command:

        $ catmandu convert MARC to Text --field_sep "\n" --fix 'marc_map(020a,isbn.\$append); retain(isbn)' < data.mrc | uniq

   Create a list of the number of subjects per record
       We will create a list of subjects (650a) and count the number of items in this list for
       each record. The CSV file will contain the "_id" (record identifier) and "subject" the
       number of 650a fields.

       Writing all Fixes on the command line can become tedious. In Catmandu it is possible to
       create a Fix script which contains all the Fix commands.

       Open a text editor and create the "myfix.fix" file with content:

           marc_map(650a,subject.$append)
           count(subject)
           retain(_id, subject)

       And execute the command:

          $ catmandu convert MARC to CSV --fix myfix.fix < data.mrc

   Create a list of all ISBN numbers for records with type 920a == book
       In the example we need an extra condition for match the content of the 920a field against
       the string "book".

       Open a text editor and create the "myfix.fix" file with content:

           marc_map(020a,isbn.$append)
           marc_map(920a,type)

           select all_match(type,"book") # select only the books
           select exists(isbn)           # select only the records with ISBN numbers

           retain(isbn)                  # only keep this field

       All the text after the "#" sign are inline code comments.

       And run the command:

           $ catmandu convert MARC to Text --field_sep "\n" --fix myfix.fix < data.mrc

   Show which MARC record don't contain a 900a field matching some list of values
       First we need to create a list of keys that need to be matched against our MARC records.
       In the example below we create a CSV file with a "key" , "value" header and all the keys
       that are OK:

           $ cat mylist.txt
           key,value
           book,OK
           article,OK
           journal,OK

       Next we create a Fix script that maps the MARC 900a field to a field called "type". This
       "type" field we lookup in the "mylist.txt" file. If a match is found, then the "type"
       field will contain the value in the list (OK). When no match is found then the "type" will
       contain the original value. We reject all records that have OK as "type" and keep only the
       ones that weren't matched in the file.

       Open a text editor and create the "myfix.fix" file with content:

           marc_map(900a,type)

           lookup(type,'/tmp/mylist.txt')

           reject all_match(type,OK)

           retain(_id,type)

       And now run the command:

           $ catmandu convert MARC to CSV --fix myfix.fix < data.mrc

Create a CSV file of all ISSN numbers found at any MARC field

       To process this information we need to create a Fix script like the one below (line
       numbers are added here to explain the working of this script but don't need to be included
       in the script):

           01: marc_map('***',text.$append)
           02:
           03: filter(text,'(\b\d{4}-?\d{3}[\dxX]\b)')
           04: replace_all(text.*,'.*(\b\d{4}-?\d{3}[\dxX]\b).*',$1)
           05:
           06: do list(path:text)
           07:   unless is_valid_issn(.)
           08:     reject()
           09:   end
           10: end
           11:
           12: vacuum()
           13:
           14: select exists(text)
           15:
           16: join_field(text,' ; ')
           17:
           18: retain(_id,text)

       On line 01 all the text in the MARC record is mapped into a "text" array.  On line 03 we
       filter out this array all the lines that contain an ISSN string using a regular
       expression.  On line 04 the "replace_all" is used to delete everything in the "text" array
       that isn't an ISSN number.  On line 06-10 we go over every ISSN string and check if it has
       a valid checksum and erase it when not.  On line 12 we use the "vacuum" function to remove
       any remaining empty fields On line 14 we select only the records that contain a valid ISSN
       number On line 16 the ISSN get joined by a semicolon ';' into a long string On line 18 we
       keep only the record id and the ISSNs in for the report.

       Run this Fix script (without the line number) using this command

           $ catmandu convert MARC to CSV --fix myfix.fix < data.mrc

   Create a MARC validator
       For this example we need a Fix script that contains validation rules we need to check. For
       instance, we require to have a 245 field and at least a 008 control field with a date
       filled in. This can be coded as in:

           # Check if a 245 field is present
           unless marc_has('245')
             log("no 245 field",level:ERROR)
           end

           # Check if there is more than one 245 field
           if marc_has_many('245')
             log("more than one 245 field?",level:ERROR)
           end

           # Check if in 008 position 7 to 10 contains a 4 digit number ('\d' means digit)
           unless marc_match('008/07-10','\d{4}')
             log("no 4-digit year in 008 position 7 -> 10",level:ERROR)
           end

       Put this Fix script in a file "myfix.fix" and execute the Catmandu command with the "-D"
       option for logging and the Null exporter to discard the normal output

           $ catmandu -D convert MARC to Null --fix myfix.fix < data.mrc

TRANSFORMING

   Add a new MARC field
       In the example bellow we add new 856 field to the record with a $u subfield containing the
       Google homepage:

          marc_add(856,u,"http://www.google.com")

       A control field can be added by using the '_' subfield

          marc_add(009,_,0123456789)

       Maybe you want to copy the data from one subfield to another. Use the marc_map to store
       the data first in a temporary field and add it later to the new field:

          # copy a subfield
          marc_map(001,tmp)

          # maybe process the data a bit
          append(tmp,"-mytest")

          # add the contents of the tmp field to the new 009 field
          marc_add(009,_,$.tmp)

   Set a MARC subfield
       Set the $h subfield to a new value (or create it when it doesn't exist yet):

          marc_set(100h, test123)

       Only set the 100 field if the first indicator is 3

          marc_set(100[3]h, test123)

   Remove a MARC (sub)field
       Remove all fields 500 , 501 , 5** :

          marc_remove(5**)

       Remove all 245h fields:

          marc_remove(245h)

   Append text to a MARC field
       Append a period to the 500 field is there isn't already there:

         do marc_each()
           unless marc_match(500, "\.$")    # Only if the current field 500 doesn't end with a period
             marc_append(500,".")           # Add to the current 500 field a period
           end
         end

       Use the Catmandu::Fix::Bind::marc_each Bind to loop over all MARC fields. In the context
       of the "do -- end" only one MARC field at a time is visible for the "marc_*" fixes.

   The marc_each binder
       All "marc_*" fixes will operate on all MARC fields matching a MARC path. For example,

          marc_remove(856)

       will remove all 856 MARC fields. In some cases you may want to change only some of the
       fields in a record. You could write:

         if marc_match(856u,"google")
            marc_remove(856)
         end

       in the hope it would remove the 856 fields that contain the text "google" in the $u
       subfield.  Alas, this is not what will happen. The "if" condition will match when the
       record contains one or more 856u fields containing "google". The "marc_remove" Fix will
       delete all 856 fields. To correctly remove only the 856 fields in the context of the "if"
       statement the "marc_each" binder is required:

         do marc_each()
           if marc_match(856u,"google")
              marc_remove(856)
           end
         end

       The "marc_each" will loop over all MARC fields one at a time. The if statement will only
       match when the current MARC field is 856 and the $u field contains "google". The
       "marc_remove(856)" will only delete the current 856 field.

       In "marc_each" binder, it seems for all Fixes as if there is only one field at a time
       visible in the record.  This Fix will not work:

         do marc_each()
           if marc_match(856u,"google")
              marc_remove(900)           # <-- there is only a 856 field in the current context
           end
         end

   marc_copy, marc_cut and marc_paste
       The Catmandu::Fix::marc_copy, Catmandu::Fix::marc_cut, Catmandu::Fix::marc_paste Fixes are
       needed when complicated edits are needed in MARC record.

       The "marc_copy" fill copy parts of a MARC record matching a MARC_PATH to a temporary
       variable.  This tempoarary variable will contain an ARRAY of HASHes containing the content
       of the MARC field.

       For instance,

         marc_copy(650, tmp)

       The "tmp" will contain something like:

         tmp:[
             {
                 "subfields" : [
                     {
                         "a" : "Perl (Computer program language)"
                     }
                 ],
                 "ind1" : " ",
                 "ind2" : "0",
                 "tag" : "650"
           },
           {
                 "ind1" : " ",
                 "subfields" : [
                     {
                         "a" : "Web servers."
                     }
                 ],
                 "tag" : "650",
                 "ind2" : "0"
           }
         ]

       This structure can be edited with all the Catmandu fixes. For instance you can set the
       first indicator to '1':

         set_field(tmp.*.ind1 , 1)

       The JSON path "tmp.*.ind1" will match all the first indicators. The JSON path "tmp.*.tag"
       will match all the MARC tags. The JSON path "tmp.*.subfields.*.a" will match all the $a
       subfields. For instance, to change all 'Perl' into 'Python' in the $a subfield use this
       Fix:

         replace_all(tmp.*.subfields.*.a,"Perl","Python")

       When the fields need to be places back into the record the "marc_paste" command can be
       used:

          marc_paste(subjects)

       This will add all 650 fields in the "tmp" temporary variable at the end of the record. You
       can change the MARC fields in place using the "march_each" binder:

         do marc_each()
            # Select only the 650 fields
            if marc_has(650)
               # Create a working copy
               marc_copy(650,tmp)

               # Change some fields
               set_field(tmp.*.ind1 , 1)

               # Paste the result back
               marc_paste(tmp)
            end
         end

       The "marc_cut" Fix works like "marc_copy" but will delete the matching MARC field from the
       record.

   Rename MARC subfields
       In the example below we rename each $1 subfield in the MARC record to $0 using the
       Catmandu::Fix::marc_cut, Catmandu::Fix::marc_paste and Catmandu::Fix::rename fixes:

           # For each marc field...
           do marc_each()
              # Cut the field into tmp..
              marc_cut(***,tmp)

              # Rename every 1 subfield to 0
              rename(tmp.*.subfields.*,1,0)

              # And paste it back
              marc_paste(tmp)
           end

       The "marc_each" bind will loop over all the MARC fields. With "marc_cut" we store any
       field ("***" matches every field) into a "tmp" field. The "marc_cut" creates an array
       structure in "tmp" which is easy to process using the Fix language. Using the "rename"
       function we search for all the subfields, and replace the field matching the regular
       expression 1 with 0. At the end, we paste back the "tmp" field into the record.

   Setting and remove MARC indicators
       In the example below we set every indicator1 of the 500 field to the value "0".  We will
       use the Catmandu::Fix::Bind::marc_each bind with a loop variable:

           # For each marc field...
           do marc_each(var:this)
              # If the marc field is a 500 field
              if marc_has(500)
                 # Set the indicator1 to value "0"
                 set_field(this.ind1,0)
                 # Store the result back into the MARC record
                 marc_remove(500)
                 marc_paste(this)
              end
            end

       Using the same method indicators can also be deleted by setting their value to a space "
       ".

   Adding a new MARC subfield
       In the example below we append a new MARC subfield $z to the 500 field with value test.
       We will use the Catmandu::Fix::Bind::marc_each bind with a loop variable:

           # For each marc field...
           do marc_each(var:this)
              # If the marc field is a 500 field
              if marc_has(500)
                 # add a new subfield z
                 add_field(this.subfields.$append.z,Test)
                 # Store the result back into the MARC record
                 marc_remove(500)
                 marc_paste(this)
              end
           end

   Remove all non-numeric fields from the MARC record
           # For each marc field...
           do marc_each(var:this)
              # If we have a non-numeric fields
              unless all_match(this.tag,"\d{3}")
                 # Remove this tag
                 marc_remove(***)
              end
           end

WRITING

   Convert a MARC record into a MARC record (do nothing)
           $ catmandu convert MARC to MARC < data.mrc > output.mrc

   Add a 920a field with value 'checked' to all records
           $ catmandu convert MARC to MARC --fix 'marc_add("900",a,"checked")' < data.mrc > output.mrc

   Delete the 024 fields from all MARC records
           $ catmandu convert MARC to MARC --fix 'marc_remove("024")' < data.mrc > output.mrc

   Set the 650p field to 'test' for all records
           $ catmandu convert MARC to MARC --fix 'marc_add("650p","test")' < data.mrc > output.mrc

   Select only the records with 900a == book
           $ catmandu convert MARC to MARC --fix 'marc_map(900a,type); select all_match(type,book)' < data.mrc > output.mrc

       The "all_match" also allows a regular expressions:

           $ catmandu convert MARC to MARC --fix 'marc_map(900a,type); select all_match(type,"[Bb]ook")' < data.mrc > output.mrc

   Select only the records with 900a values in a given CSV file
       Create a CSV file with name,value pairs (need two columns):

           $ cat values.csv
           name,values
           book,1
           journal,1
           movie,1

           $ catmandu convert MARC to MARC --fix myfixes.txt < data.mrc > output.mrc

           with myfixes.txt like:

           do marc_each()
              marc_map(900a,test)
              lookup(test,values.csv,default:0)
              select all_match(test,1)
              remove_field(test)
           end

       We use a "do marc_each() ... end" loop because 900a fields can be repeated. If a MARC tag
       isn't repeatable this loop not isn't needed. With marc_map we copy first the value of a
       marc subfield to a 'test' field. This test we lookup against the CSV file. Then, we select
       only the records that are found in the CSV file (and return the correct value).

DEDUPLICATION

   Check for duplicate ISBN numbers in a MARC file
       In this example we extract from a MARC file all the ISBN numbers from the 020 and do a
       little bit of data cleaning using the Catmandu::Identifier project. To install this
       package, we run this command:

           $ cpanm Catmandu::Identifier

       To extract all the ISBN numbers we use this Fix script 'dedup.fix':

           marc_map(020a, identifier.$append)
           replace_all(identifier.*,"\s+.*","")
           do list(path:identifier)
             isbn13(.)
           end
           do hashmap(exporter:YAML)
             copy_field(identifier,key)
             copy_field(_id,value)
           end

       The first "marc_map" fix maps every 020 field to an identifier array.  The "replace_all"
       cleans the data a bit and deletes some unwanted text.  The "do list" will transform all
       the ISBN numbers to ISBN13.  The "do hashmap" will create an internal mapping table of
       identifier,_id key value pairs. For very identifier, one or more _id can be stored. At the
       end of all MARC processing this mapping table is dumped from memory as a YAML document.

       Run this fix as:

           $ catmandu convert MARC to Null --fix dedup.fix < marc.mrc > output.yml

       The output YAML file will contain the ISBN to document ID mapping. We only need the ISBN
       numbers with more than one hit. We need a little bit of cleanup on this YAML file to reach
       our final result. Use the following 'cleanup.fix' script:

           select exists(value.1)
           join_field(value,",")

       This first "select" fix selects only the records with more than one hit.  The "join_field"
       will turn the array of results into a string. Execute this Fix like:

           $ catmandu convert YAML to TSV --fix cleanup.fix < output.yml > result.csv

       This will provide a tab delimited file of double isbn numbers in the MARC input file.