surgeon

    3.16.4 • Public • Published

    Surgeon

    GitSpo Mentions Travis build status Coveralls NPM version Canonical Code Style Twitter Follow

    Declarative DOM extraction expression evaluator.

    Powerful, succinct, composable, extendable, declarative API.

    articles:
    - select article {0,}
    body:
      - select .body
      - read property innerHTML
      imageUrl:
      - select img
      - read attribute src
      summary:
      - select ".body p:first-child"
      - read property innerHTML
      - format text
      title:
      - select .title
      - read property textContent
    pageName:
    - select .body
    - read property innerHTML
     

    Not succinct enough for you? Use aliases and the pipe operator (|) to shorten and concatenate the commands:

    articles:
    - sm article
    - body: s .body | rp innerHTML
      imageUrl: s img | ra src
      summary: s .body p:first-child | rp innerHTML | f text
      title: s .title | rp textContent
    pageName: s .body | rp innerHTML
    
    

    Have you got suggestions for improvement? I am all ears.


    Configuration

    Name Type Description Default value
    evaluator EvaluatorType HTML parser and selector engine. See evaluators. browser evaluator if window and document variables are present, cheerio otherwise.
    subroutines $PropertyType<UserConfigurationType, 'subroutines'> User defined subroutines. See subroutines. N/A

    Evaluators

    Subroutines use an evaluator to parse input (i.e. convert a string into an object) and to select nodes in the resulting document.

    The default evaluator is configured based on the user environment:

    Have a use case for another evaluator? Raise an issue.

    For an example implementation of an evaluator, refer to:

    browser evaluator

    Uses native browser methods to parse the document and to evaluate CSS selector queries.

    Use browser evaluator if you are running Surgeon in a browser or a headless browser (e.g. PhantomJS).

    import {
      browserEvaluator
    } from './evaluators';
     
    surgeon({
      evaluator: browserEvaluator()
    });
     

    cheerio evaluator

    Uses cheerio to parse the document and to evaluate CSS selector queries.

    Use cheerio evaluator if you are running Surgeon in Node.js.

    import {
      cheerioEvaluator
    } from './evaluators';
     
    surgeon({
      evaluator: cheerioEvaluator()
    });
     

    Subroutines

    A subroutine is a function used to advance the DOM extraction expression evaluator, e.g.

    x('foo | bar baz', 'qux');
     

    In the above example, Surgeon expression uses two subroutines: foo and bar.

    foo subroutine is invoked without additional values. bar subroutine is executed with 1 value ("baz").

    Subroutines are executed in the order in which they are defined – the result of the last subroutine is passed on to the next one. The first subroutine receives the document input (in this case: "qux" string).

    Multiple subroutines can be written as an array. The following example is equivalent to the earlier example.

    x([
      'foo',
      'bar baz'
    ], 'qux');
     

    There are two types of subroutines:

    Note:

    These functions are called subroutines to emphasise the cross-platform nature of the declarative API.

    Built-in subroutines

    The following subroutines are available out of the box.

    append subroutine

    append appends a string to the input string.

    Parameter name Description Default
    tail Appends a string to the end of the input string. N/A

    Examples:

    // Assuming an element <a href='http://foo' />,
    // then the result is 'http://foo/bar'.
    x(`select a | read attribute href | append '/bar'`);
     

    closest subroutine

    closest subroutine iterates through all the preceding nodes (including parent nodes) searching for either a preceding node matching the selector expression or a descendant of the preceding node matching the selector.

    Note: This is different from the jQuery .closest() in that the latter method does not search for parent descendants matching the selector.

    Parameter name Description Default
    CSS selector CSS selector used to select an element. N/A

    constant subroutine

    constant returns the parameter value regardless of the input.

    Parameter name Description Default
    constant Constant value that will be returned as the result. N/A

    format subroutine

    format is used to format input using printf.

    Parameter name Description Default
    format sprintf format used to format the input string. The subroutine input is the first argument, i.e. %1$s. %1$s

    Examples:

    // Extracts 1 matching capturing group from the input string.
    // Prefixes the match with 'http://foo.com'.
    x(`select a | read attribute href | format 'http://foo.com%1$s'`);
     

    match subroutine

    match is used to extract matching capturing groups from the subject input.

    Parameter name Description Default
    Regular expression Regular expression used to match capturing groups in the string. N/A
    Sprintf format sprintf format used to construct a string using the matching capturing groups. %s

    Examples:

    // Extracts 1 matching capturing group from the input string.
    // Throws `InvalidDataError` if the value does not pass the test.
    x('select .foo | read property textContent | match "/input: (\d+)/"');
     
    // Extracts 2 matching capturing groups from the input string and formats the output using sprintf.
    // Throws `InvalidDataError` if the value does not pass the test.
    x('select .foo | read property textContent | match "/input: (\d+)-(\d+)/" %2$s-%1$s');
     

    nextUntil subroutine

    nextUntil subroutine is used to select all following siblings of each element up to but not including the element matched by the selector.

    Parameter name Description Default
    selector expression A string containing a selector expression to indicate where to stop matching following sibling elements. N/A
    filter expression A string containing a selector expression to match elements against.

    prepend subroutine

    prepend prepends a string to the input string.

    Parameter name Description Default
    head Prepends a string to the start of the input string. N/A

    Examples:

    // Assuming an element <a href='//foo' />,
    // then the result is 'http://foo/bar'.
    x(`select a | read attribute href | prepend 'http:'`);
     

    previous subroutine

    previous subroutine selects the preceding sibling.

    Parameter name Description Default
    CSS selector CSS selector used to select an element. N/A

    Example:

    <ul>
      <li>foo</li>
      <li class='bar'></li>
    <ul>
    x('select .bar | previous | read property textContent');
    // 'foo'
     

    read subroutine

    read is used to extract value from the matching element using an evaluator.

    Parameter name Description Default
    Target type Possible values: "attribute" or "property" N/A
    Target name Depending on the target type, name of an attribute or a property. N/A

    Examples:

    // Returns .foo element "href" attribute value.
    // Throws error if attribute does not exist.
    x('select .foo | read attribute href');
     
    // Returns an array of "href" attribute values of the matching elements.
    // Throws error if attribute does not exist on either of the matching elements.
    x('select .foo {0,} | read attribute href');
     
    // Returns .foo element "textContent" property value.
    // Throws error if property does not exist.
    x('select .foo | read property textContent');
     

    remove subroutine

    remove subroutine is used to remove elements from the document using an evaluator.

    remove subroutine accepts the same parameters as the select subroutine.

    The result of remove subroutine is the input of the subroutine, i.e. previous select subroutine result.

    Parameter name Description Default
    CSS selector CSS selector used to select an element. N/A
    Quantifier expression A quantifier expression is used to control the expected result length. See quantifier expression.

    Examples:

    // Returns 'bar'.
    x('select .foo | remove span | read property textContent', `<div class='foo'>bar<span>baz</span></div>`);
     

    select subroutine

    select subroutine is used to select the elements in the document using an evaluator.

    Parameter name Description Default
    CSS selector CSS selector used to select an element. N/A
    Quantifier expression A quantifier expression is used to control the shape of the results (direct result or array of results) and the expected result length. See quantifier expression.
    Quantifier expression

    A quantifier expression is used to assert that the query matches a set number of nodes. A quantifier expression is a modifier of the select subroutine.

    A quantifier expression is defined using the following syntax.

    Name Syntax
    Fixed quantifier {n} where n is an integer >= 1
    Greedy quantifier {n,m} where n >= 0 and m >= n
    Greedy quantifier {n,} where n >= 0
    Greedy quantifier {,m} where m >= 1

    A quantifier expression can be appended a node selector [i], e.g. {0,}[1]. This allows to return the first node from the result set.

    If this looks familiar, its because I have adopted the syntax from regular expression language. However, unlike in regular expression, a quantifier in the context of Surgeon selector will produce an error (SelectSubroutineUnexpectedResultCountError) if selector result length is out of the quantifier range.

    Examples:

    // Selects 0 or more nodes.
    // Result is an array.
    x('select .foo {0,}');
     
    // Selects 1 or more nodes.
    // Throws an error if 0 matches found.
    // Result is an array.
    x('select .foo {1,}');
     
    // Selects between 0 and 5 nodes.
    // Throws an error if more than 5 matches found.
    // Result is an array.
    x('select .foo {0,5}');
     
    // Selects 1 node.
    // Result is the first match in the result set (or `null`).
    x('select .foo {0,}[0]');
     

    test subroutine

    test is used to validate the current value using a regular expression.

    Parameter name Description Default
    Regular expression Regular expression used to test the value. N/A

    Examples:

    // Validates that .foo element textContent property value matches /bar/ regular expression.
    // Throws `InvalidDataError` if the value does not pass the test.
    x('select .foo | read property textContent | test /bar/');
     

    See error handling for more information and usage examples of the test subroutine.

    User-defined subroutines

    Custom subroutines can be defined using subroutines configuration.

    A subroutine is a function. A subroutine function is invoked with the following parameters:

    Parameter name
    An instance of [Evaluator].
    Current value, i.e. value used to query Surgeon or value returned from the previous (or ancestor) subroutine.
    An array of values used when referencing the subroutine in an expression.

    Example:

    const x = surgeon({
      subroutines: {
        mySubroutine: (currentValue, [firstParameterValue, secondParameterValue]) => {
          console.log(currentValue, firstParameterValue, secondParameterValue);
     
          return parseInt(currentValue, 10) + 1;
        }
      }
    });
     
    x('mySubroutine foo bar | mySubroutine baz qux', 0);
     

    The above example prints:

    0 "foo" "bar"
    1 "baz" "qux"
    
    

    For more examples of defining subroutines, refer to:

    Inline subroutines

    Custom subroutines can be inlined into pianola instructions, e.g.

    x(
      [
        'foo',
        (subject) => {
          // `subject` is the return value of `foo` subroutine.
     
          return 'bar';
        },
        'baz',
      ],
      'qux'
    );
     

    Built-in subroutine aliases

    Surgeon exports an alias preset is used to reduce verbosity of the queries.

    Name Description
    ra ... Reads Element attribute value. Equivalent to read attribute ...
    rdtc ... Removes any descending elements and reads the resulting textContent property of an element. Equivalent to remove * {0,} | read property ... textContent
    rih ... Reads innerHTML property of an element. Equivalent to read property ... innerHTML
    roh ... Reads outerHTML property of an element. Equivalent to read property ... outerHTML
    rp ... Reads Element property value. Equivalent to read property ...
    rtc ... Reads textContent property of an element. Equivalent to read property ... textContent
    sa ... Select any (sa). Selects multiple elements (0 or more). Returns array. Equivalent to select "..." {0,}
    saf ... Select any first (saf). Selects multiple elements (0 or more). Returns single result or null. Equivalent to select "..." {0,}[0]
    sm ... Select many (sm). Selects multiple elements (1 or more). Returns array. Equivalent to select "..." {1,}
    smo ... Select maybe one (smo). Selects one element. Returns single result or null. Equivalent to select "..." {0,1}[0]
    so ... Select one (so). Selects a single element. Returns single result. Equivalent to select "..." {1}[0].
    t {name} Tests value. Equivalent to test ...

    Note regarding s ... alias. The CSS selector value is quoted. Therefore, you can write a CSS selector that includes spaces without putting the value in the quotes, e.g. s .foo .bar is equivalent to select ".foo .bar" {1}.

    Other alias values are not quoted. Therefore, if value includes a space it must be quoted, e.g. t "/foo bar/".

    Usage:

    import surgeon, {
      subroutineAliasPreset
    } from 'surgeon';
     
    const x = surgeon({
      subroutines: {
        ...subroutineAliasPreset
      }
    });
     
    x('s .foo .bar | t "/foo bar/"');
     

    In addition to the built-in aliases, user can declare subroutine aliases.

    Expression reference

    Surgeon subroutines are referenced using expressions.

    An expression is defined using the following pseudo-grammar:

    subroutines ->
        subroutines _ "|" _ subroutine
      | subroutine
    
    subroutine ->
        subroutineName " " parameters
      | subroutineName
    
    subroutineName ->
      [a-zA-Z0-9\-_]:+
    
    parameters ->
        parameters " " parameter
      | parameter
    
    

    Example:

    x('foo bar baz', 'qux');
     

    In this example, Surgeon query executor (x) is invoked with foo bar baz expression and qux starting value. The expression tells the query executor to run foo subroutine with parameter values "bar" and "baz". The expression executor runs foo subroutine with parameter values "bar" and "baz" and subject value "qux".

    Multiple subroutines can be combined using an array:

    x([
      'foo bar baz',
      'corge grault garply'
    ], 'qux');
     

    In this example, Surgeon query executor (x) is invoked with two expressions (foo bar baz and corge grault garply). The first subroutine is executed with the subject value "qux". The second subroutine is executed with a value that is the result of the parent subroutine.

    The result of the query is the result of the last subroutine.

    Read user-defined subroutines documentation for broader explanation of the role of the parameter values and the subject value.

    The pipe operator (|)

    Multiple subroutines can be combined using the pipe operator.

    The following examples are equivalent:

    x([
      'foo bar baz',
      'qux quux quuz'
    ]);
     
    x([
      'foo bar baz | foo bar baz'
    ]);
     
    x('foo bar baz | foo bar baz');
     

    Cookbook

    Unless redefined, all examples assume the following initialisation:

    import surgeon from 'surgeon';
     
    /**
     * @param configuration {@see https://github.com/gajus/surgeon#configuration}
     */
    const x = surgeon();
     

    Extract a single node

    Use select subroutine and read subroutine to extract a single value.

    const subject = `
      <div class="title">foo</div>
    `;
     
    x('select .title | read property textContent', subject);
     
    // 'foo'
     

    Extract multiple nodes

    Specify select subroutine quantifier to match multiple results.

    const subject = `
      <div class="foo">bar</div>
      <div class="foo">baz</div>
      <div class="foo">qux</div>
    `;
     
    x('select .title {0,} | read property textContent', subject);
     
    // [
    //   'bar',
    //   'baz',
    //   'qux'
    // ]
     

    Name results

    Use a QueryChildrenType object to name the results of the descending expressions.

    const subject = `
      <article>
        <div class='title'>foo title</div>
        <div class='body'>foo body</div>
      </article>
      <article>
        <div class='title'>bar title</div>
        <div class='body'>bar body</div>
      </article>
    `;
     
    x([
      'select article',
      {
        body: 'select .body | read property textContent'
        title: 'select .title | read property textContent'
      }
    ]);
     
    // [
    //   {
    //     body: 'foo body',
    //     title: 'foo title'
    //   },
    //   {
    //     body: 'bar body',
    //     title: 'bar title'
    //   }
    // ]
     

    Validate the results using RegExp

    Use test subroutine to validate the results.

    const subject = `
      <div class="foo">bar</div>
      <div class="foo">baz</div>
      <div class="foo">qux</div>
    `;
     
    x('select .foo {0,} | test /^[a-z]{3}$/');
     

    See error handling for information how to handle test subroutine errors.

    Validate the results using a user-defined test function

    Define a custom subroutine to validate results using arbitrary logic.

    Use InvalidValueSentinel to leverage standardised Surgeon error handler (see error handling). Otherwise, simply throw an error.

    import surgeon, {
      InvalidValueSentinel
    } from 'surgeon';
     
    const x = surgeon({
      subroutines: {
        isRed: (value) => {
          if (value === 'red') {
            return value;
          };
     
          return new InvalidValueSentinel('Unexpected color.');
        }
      }
    });
     

    Declare subroutine aliases

    As you become familiar with the query execution mechanism, typing long expressions (such as select, read attribute and read property) becomes a mundane task.

    Remember that subroutines are regular functions: you can partially apply and use the partially applied functions to create new subroutines.

    Example:

    import surgeon, {
      readSubroutine,
      selectSubroutine,
      testSubroutine
    } from 'surgeon';
     
    const x = surgeon({
      subroutines: {
        ra: (subject, values, bindle) => {
          return readSubroutine(subject, ['attribute'].concat(values), bindle);
        },
        rp: (subject, values, bindle) => {
          return readSubroutine(subject, ['property'].concat(values), bindle);
        },
        s: (subject, values, bindle) => {
          return selectSubroutine(subject, [values.join(' '), '{1}'], bindle);
        },
        sm: (subject, values, bindle) => {
          return selectSubroutine(subject, [values.join(' '), '{0,}'], bindle);
        },
        t: testSubroutine
      }
    });
     

    Now, instead of writing:

    articles:
    - select article
    body:
      - select .body
      - read property innerHTML
     

    You can write:

    articles:
    - sm article
    body:
      - s .body
      - rp innerHTML

    The aliases used in this example are available in the aliases preset (read built-in subroutine aliases).

    Error handling

    Surgeon throws the following errors to indicate a predictable error state. All Surgeon errors can be imported. Use instanceof operator to determine the error type.

    Note:

    Surgeon errors are non-recoverable, i.e. a selector cannot proceed if it encounters an error. This design ensures that your selectors are capturing the expected data.

    Name Description
    ReadSubroutineNotFoundError Thrown when an attempt is made to retrieve a non-existent attribute or property.
    SelectSubroutineUnexpectedResultCountError Thrown when a select subroutine result length does not match the quantifier expression.
    InvalidDataError Thrown when a subroutine returns an instance of InvalidValueSentinel.
    SurgeonError A generic error. All other Surgeon errors extend from SurgeonError.

    Example:

    import {
      InvalidDataError
    } from 'surgeon';
     
    const subject = `
      <div class="foo">bar</div>
    `;
     
    try {
      x('select .foo | test /bar/', subject);
    } catch (error) {
      if (error instanceof InvalidDataError) {
        // Handle data validation error.
      } else {
        throw error;
      }
    }
     

    Return InvalidValueSentinel from a subroutine to force Surgeon throw InvalidDataError error.

    Debugging

    Surgeon is using roarr to log debugging information.

    Export ROARR_LOG=TRUE environment variable to enable Surgeon debug log.

    Install

    npm i surgeon

    DownloadsWeekly Downloads

    44

    Version

    3.16.4

    License

    BSD-3-Clause

    Unpacked Size

    156 kB

    Total Files

    127

    Last publish

    Collaborators

    • gajus