marc21-xml

1.2.1 • Public • Published

marc21-xml

Parse and query MARC21-XML files, no matter how big

Abstract

This package provides powerful querying tools to handle MARC21-XML files. It was born to handle very big (>10GB), gzipped xml files, but it obviously works as well with small files.

note this document is work in progress.

Installation

$ npm install marc21-xml

Features

  • Query marc21-xml files with an intuitive and relatable syntax (no xpath!)
  • Work with a record as soon as it is fetched or save all results to file (JSON, XML)
  • Convert MARC21-XML to JSON
  • Split large files into smaller files with two lines of code

Instances

To use marc21-xml, require the Parser class and, optionally, the fn object, which contains helper functions for queries.

const marc21 = require('marc21-xml');
const Parser = marc21.Parser;
const fn = marc21.fn;

var parser = new Parser({ in : 'file.xml.gz', memory : true })

parser.find()
.then(records => {
    console.log('Done, records parsed: ' + records.length);
})

Note: the previous example has an explicit memory options, therefore the records are saved in memory when parsed. This should be avoided in case of very large files.

options

  • options.in {array | string} : List of files to parse
  • options.gzip {boolean} : Whether the file is gzipped. Default: inferred from file extension.
  • options.search {number} : Stop after N records have been analyzed, independently on how many were fetched. Default : (no limit)
  • options.skip {number} : Skip the first N records without evaluating them. Default : 0
  • options.log {number | boolean} : If number, the frequency of logging (e.g. log every 2000 records). If false, do not log. Default: 10
  • options.out {string} : Path to file where to save results, without file extension (e.g. 'out' will save to out.json, out.xml, out.txt). Default: no output file
  • options.outFormat {array | string} : Format(s) of output file, available: json, xml, human (txt). Default: json (if options.out is set)
  • options.rotate {number | boolean} : If number, how many records per output file. If false, do not rotate. Default: 500
  • options.memory {bool} : Whether to save in memory the results. If true, the array of results is returned on completion. If false, only the number of results is returned. Default : true if no output file set, false otherwise.
  • options.rawRecords {bool} : Whether the record returned on record event and saved in memory should be raw json object or internal Record objects. Default: false

Note: search and skip behave basically like limit and offset but

Example: split a large file in smaller files
const marc21 = require('marc21-xml');
const Parser = marc21.Parser;

const INPUT_FILE = 'huge-file.xml.gz'; // contains 4M records

var parser = new Parser({ in : INPUT_FILE,
                          out : '/path/to/smaller-file',
                          outFormat : ['json', 'xml'],
                          rotate : 4000 });

parser.find().then(() => {
    console.log('Done');
})

When the then callback is called, 2000 files are saved in '/path/to/' (must exist):

  • xml : smaller-file.xml, smaller-file.1.xml, ..., smaller-file.999.xml
  • json : smaller-file.json, smaller-file.1.json, ..., smaller-file.999.json

Each file has 4000 records (set via the rotate option passed in the constructor)

Basic search

The main method to query a file is Parser.find(). It takes a single parameter options which defines all search parameters. Let's proceed top-down and see a full example first:

parser.find({
    where : {
        type : 'Bibliographic',
        controlfield : { tag : '001', value : '1234567890' },
        datafield : { tag : '100', code : 'a', value : 'Cormac Mc Carthy' }
    }
    select : {
        controlfields: true,
        datafields : { tag: '100', code : 'a', full : false }
    },
    limit : 20,
    offset : 0
})
.each(record => {
    console.log('Found a record that matches the query')
})
.then(records => {
    console.log('Done. Records found : ' + records.length);
})

The previous query selects the first 20 records that are of type 'Bibliographic' and have either a controlfield '001' of value '1234567890' or at least one datafield '100' with at least a subfield of code 'a' whose value is 'Cormac Mc Carthy'. The selectors are contained in the where section.

Additionally, when returning the selected records, it filters their content so that each one has only:

  • all controlfields
  • only datafields '100'
  • in the datafields '100' only include the subfields 'a' (full parameter)

The results content filtering is performed via the select section of the query

The object has the following properties:

  • options.where : defines the search parameters (selectors)
  • options.select : defines the fields to fetch
  • options.limit : {number} How many results to fetch. Default: Infinity.
  • options.offset : {number} results to discard, starting from 0. Default: 0.

Let's start with the most important: selectors:

Selectors: options.where

Three main selectors are available:

options.where.type

{ type : 'Bibliographic' }

Selects all records of a specific type (e.g: 'Bibliographic', 'Authoritative')

options.where.controlfield

{ controlfield : { tag : '001', value : '1234567890' } }

Selects all records that contain at least one controlfield whose tag is '001' and value is '1234567890'. Neither t nor value are mandatory, the two folloing selectors allow to seach for records that a) contain at least a controlfield '002' or b) contain at least a controlfield of any tag and value '1234567890':

{ controlfield : { tag : '002' } }            [a]
{ controlfield : { value : '1234567890' } }   [b]

options.where.datafield

{ datafield : { tag : '100', ind1 : '1', ind2 : ' '} }

Selects all records that contain at least one datafield '100' with specific indices ind1 and ind2. A more useful selector also takes into account the subfields, which contain the actual datafield information. To make querying clearer, subfield selectors are combined with the datafield:

{ datafield : { tag : '100', ind1 : '1', ind2 : ' ', code : 'a', value : 'Mozart'} }

The previous searches all records that contain at least a datafield '100' (with indices) which contains at least a subfield of code 'a' and value 'Mozart'. As for controlfield selector, none of the parameters is mandatory. The following examples show possible selectors to search for:

[a] records that have at least one datafield '100':

{ datafield : { tag : '100'} }

[b] records that have at least one datafield '100' with subfield 'a' of any value:

{ datafield : { tag : '100', code : 'a' } }

[c] records that have at least one datafield '100' with a subfield of value 'Mozart' no matter the code

{ datafield : { tag : '100', value : 'Mozart' } }

[d] records that have at least one datafield with a subfield of any code that has value 'Mozart'

{ datafield : { value : 'Mozart' } }

Note: the value, as specified above, must match completely (also case-sensitive). Later in this document you will see how to search with keywords with fn.CONTAINS().

Examples

1 : find by controlfield value

Search the record(s) that has a control field '001' whose value (text of the tag) is '1234567890' (case-sensitive)

parser.find({ controlfield : { tag : '001', value : '1234567890' } });

#####2 : find by controlfield value, multiple Search for all records that contain a controlfield '001' whose value is either '1234567890' OR '0987654321'.

parser.find({ controlfield : { tag : '001', value : fn.EQUALS('1234567890', '0987654321') } });

Note: fn.EQUALS is case-sensitive

#####3 : find by datafield value** Search for records that contain at least one datafield '079' with a subfield 't' whose value is 'Foo Bar'

parser.find({ datafield : { tag : '079', code : 't', value : 'Foo Bar' } });

#####4 : find by datafield value with search terms (OR)

Search for records that contain at least one datafield '100' with a subfield 'a' whose value contains either 'mozart' OR 'chopin' OR 'verdi'.

parser.find({ datafield : { tag : '100', code : 'a', value : fn.CONTAINS('mozart', 'chopin', 'verdi') } });

Note: fn.CONTAINS is case-insensitive

#####5 : find by datafield value with search terms (AND) Search for records that contain at least one datafield '100' with a subfield 'a' whose value contains 'mozart' AND 'amadeus'.

parser.find({ datafield : { tag : '100', code : 'a', value : fn.CONTAINS_ALL('mozart', 'amadeus') } });

Note: fn.CONTAINS is case-insensitive

#####5 : find by datafield value excluding terms Search for records that contain at least one datafield '100' with a subfield 'a' whose value does not contains 'mozart' NOR 'chopin'.

parser.find({ datafield : { tag : '100', code : 'a', value : fn.NCONTAINS('mozart', 'chopin') } });

Note: fn.NCONTAINS is case-insensitive

The find() method

This is the main function to query the data. It returns an extended Promise that has the following methods:

  • then() : the standard then of the Promise. The argument is the results (if memory is set to true in the main parser options)
  • catch() : standard catch of the Promise, catches errors
  • each() : allows to set a callback that is called every time a record that matches the query is found. The argument passed to the callback is either a single record or an array of records, depending on the value of the callbackFrequency option passed to the main parser object (default is 1)
  • transform() : allows to set a callback that is called every time a record is parsed from the file. This is meant to enable some transformation of the record(s) to be performed before it is tested against the query. The argument passed to the callback is always a single record

Record API

Each record is an object of class Record that wraps the raw JSON object.

Record.matches

Returns the list of conditions that where met. Example : a record contains three datafields tag "400" with subfields code "a" with the following values:

tag "400" ind1 "1"  ind2 " "  code "a" : Berbiguier, Antoine T.
tag "400" ind1 "1"  ind2 " "  code "d" : 1782-1838

tag "400" ind1 "1"  ind2 " "  code "a" : Berbiguier, Benoit Tranquille
tag "400" ind1 "1"  ind2 " "  code "d" : 1782-1838

tag "400" ind1 "1"  ind2 " "  code "a" : Berbiguier, Tranquille
tag "400" ind1 "1"  ind2 " "  code "d" : 1782-1838

The following query would find it:

parser.find({
    where : {
        datafield : fn.OR(
            { tag : "400", code : "a", value : "Benoit" },
            { tag : "400", code : "a", value : fn.CONTAINS("Antoine", "T.") }
        )
    }
})
.each(record => {
    console.log(record.matches);
})
.then(...)

the output would be:

[{ tag : "400", code : "a", value : "Antonie" }] // only first match

Note: the query can be read as "find all records that contain either a datafield 400 with a subfield 'a' whose value is exactly "Benoit", or a datafield 400 with a subfield 'a' whose value contains "antoine" and/or "t."

Record.datafields(datafieldSelector)

Returns an array containing the datafields of the record. By default it returns all datafields, but it is possible to filter the desired datafields by passing a datafield selector. Example :

parser.find().each(record => {
    var all = record.datafields();
    var selected400 = record.datafields({ tag : '400' });
    var selected100 = record.datafields({ tag : '100', code : 'a', value : 'Some String' });

    console.log('This record contains: ');
    console.log('Total datafields : ' + all.length);
    console.log('Tag "400" datafields : ' + selected400.length);
    console.log('Tag "100" datafields which contain a subfield of code "a" with value "Some String" : ' + selected100.length);
})
Record.hasDatafield(datafieldSelector)

Returns false if the record does not contain any datafield that matches the passed selector, otherwise returns a truthy value:

  • true if the selector did not contain any constraint on value
  • A single value that was matched if the selector contained a single value constraint (e.g. value : "Some String") or a list with or policy (e.g. fn.CONTAINS("some", "string") or fn.EQUALS("some", "string")). In the list case, the first matched value is returned
  • An array of all matched values if the selector contained a list constraint with and policy (e.g. value : fn.CONTAINS_ALL("some", "string")).
Record.JSON()

Returns a stringified version of the raw JSON record

Record.XML()

Returns a string containing the xml representation of the record. Only available if xml was passed as an outFormat in the main Parser object's options.

Record.print()

Returns a string containing a human-readable version of the record.

Readme

Keywords

none

Package Sidebar

Install

npm i marc21-xml

Weekly Downloads

3

Version

1.2.1

License

ISC

Last publish

Collaborators

  • s.bider