atlas-query-stream

A query engine for atlas-html-stream which captures information from an html stream without using a DOM.

install

npm install --save atlas-query-stream

why

I found cheerio's API to be large and counter-intuitive, forcing me to look up syntax for a plethora of functions and types. Using a DOM-based query engine is a no-go for large documents and html streaming where constructing a DOM is not possible.

I want to be able to query html files intuitively without needing the entire file in memory. This package lets you define recursive crawlers that fetch information from a continuous stream of html nodes. Once you find the information you are looking for, you can end the html stream immediately without reading the rest of the file. The query stream will automatically unpipe itself from the readable's destination once all remaining queries have been popped off the stack. You can use atlas-html-stream to obtain a continuous stream of html nodes from a file.

introduction

transient dom

The queries you define are walkers that crawl on a transient DOM, or TDOM. The TDOM is transient in the sense that it doesn't exist all at once. When writing queries, you can pretend it exists and can be crawled.

queries

The queries specified here are basic functions which take an open or text node and return either a truthy result, falsy, or another query. Queries may be wrapped in an array to indicate that they are recursive. A basic query will only be executed until it finds the first result, whereas a recursive query is executed indefinitely, allowing many results to be found. You may abort queries prematurely by returning null. Nested queries are queries which return other queries, which is useful for walking deep into the TDOM.

input

This query runner takes "html nodes" of the following form:

Open tags: {name: String, data: Object}
Text node: {text: String}
Close tag: {name: String}

All text nodes are childless, therefore do not require a closing tag. Self-closing tags are split into an open tag and a close tag. If you use atlas-html-stream as your html parser, there's nothing to worry about. If you use a custom html parser, you can normalize the node data format manually before piping the output to the query stream.

examples

For these examples, we'll be using the atlas-html-stream html parser and we'll assume we have a file called index.html. The source html doesn't need to be a local file -- perhaps we're piping website data from a socket to our parser.

basic queries

Sometimes we want to look up a single html tag that satisfies some conditions, then output the information we found. Let's look up the DOCTYPE of our html file:

const QueryStream = require("atlas-query-stream");
const file = require("fs").createReadStream("./index.html", {highWaterMark: 5});
const HtmlParser = require("atlas-html-stream");
// define our query
const doctypeQuery = node => {
  if (node.name === "!DOCTYPE"){
    // return a result if we're at the correct node
    return {isHtml: "html" in node.data}
  }
}
// read the file, pipe the parsed nodes to our query stream
file
  .pipe(new HtmlParser())
  .pipe(new QueryStream(doctypeQuery))
  .on("data", data => console.log("html?", data.isHtml))
  .on("end", () => console.log("done reading, parsing and querying the file"))

Every 5 bytes we read are sent to the parser, which determines whether or not there's a complete node ready to send to the query stream. If there is, it sends the node to the query stream and our doctypeQuery is executed against the node.

Once a basic query returns a result, it is popped off the query stack and never runs again. Once there are no queries left to run, the query stream automatically unpipes itself from the parser. In this case, our query would find its result in the first 15 bytes of the file, meaning this code only ever reads, parses and queries 15 bytes of the file, regardless of how big the file is.

recursive queries

Other times, we want to fetch data from multiple tags in an html document, like tabular data or list element data. Suppose we have the following document:

<div id="atlassubbed-packages">
  <a href="https://github.com/atlassubbed/atlas-npm-init">Npm Starter</a>
  <a href="https://github.com/atlassubbed/atlas-webpack-init">Webpack Starter</a>
  <a href="https://github.com/atlassubbed/atlas-html-stream">Html Parser</a>
  <a href="https://github.com/atlassubbed/atlas-quintic-smoothing">x<sup>5</sup> Smoothing</a>
  <a href="https://github.com/atlassubbed/atlas-vector-noise">Perlin Noise</a>
</div>

Let's write a recursive query which scrapes all of the urls for us:

...
const urlQuery = [({name, data}) => {
  if (name === "a"){
    return data.href
  }
}]
 
file
  .pipe(new HtmlParser())
  .pipe(new QueryStream(urlQuery))
  .on("data", url => console.log(`scraped url: ${url}`))
  .on("end", () => console.log("done reading file"))

One thing to note here is that the query is wrapped in an Array literal, []. This tells the query stream to keep running the query even after it finds the first result.

aborting queries

Sometimes you want to write a recursive query, but only need it to return a limited number of results. In this case, we can abort the query by returning null:

let numResults = 0;
const firstThreeUrlsQuery = [({name, data}) => {
  if (numResults === 3) return null;
  if (name === "a"){
    numResults++;
    return data.href;
  }
}]
...

If you don't like keeping query-specific data outside of the query as we did in this example, you should use nested queries and keep information in a top-level query closure.

nested queries

Basic and recursive queries are the building blocks for all other queries. A nested query is a query which returns another query, which we'll call a subquery. Let's continue with the previous example, except this time, assume that our html snippet is part of a much larger document which contains thousands of anchor tags. In this case, we only the first three atlassubbed package urls! The query in the example above would return the first three urls for any anchor tag. Let's fix it using a nested query:

...
// use a nested query to limit the range of a recursive subquery
const onlyAtlassubbedUrlQuery = ({data}) => {
  // since html tag ids are unique, we don't need to check the tag name
  if (data.id === "atlassubbed-packages"){
    // start counting our anchor results
    let numResults = 0;
    return [({name, data}) => {
      if (numResults === 3) return null;
      if (name === "a"){
        numResults++;
        return data.href;
      }
    }]
  }
}
...

There is much more possible with nested queries: you can nest basic subqueries inside of recursive queries, you can nest many levels of queries -- do whatever you need to do for your use case.

advanced examples

If you've made it to this section, then you already know everything you need to know to start using this library. The following examples and discussion will go over certain cases you may run into.

querying fractions of a file

To better understand how the query stream works, let's try and query ./test/assets/app.html. We only care about the ul at the beginning of the file, which starts at index 113 and ends at index 236:

<!-- ./test/assets/app.html -->
<ul>
  <li style="color: blue;">Home</li>
  <li><a href="./about.html">About</a></li>
  <li>Posts</li>
</ul>

We want to write two queries:

Get the names of each page in the ul "navbar".
Get the very first link we find.

Let's use two QueryStreams for illustrative purposes -- although you can do this with one of them. We'll also be writing a bunch of output so you can see exactly what is happening under the hood:

const QueryStream = require("atlas-query-stream")
const HtmlParser = require("atlas-html-stream");
const { createReadStream } = require("fs");
const str = d => JSON.stringify(d);
 
// create our streams
const opts = {highWaterMark: 12, start: 113, end: 236};
const file = createReadStream("./test/assets/app.html", opts);
const parser = new HtmlParser();
const getPageNames = new QueryStream([({name}) => {
  if (name === "li") {
    return ({text}) => text;
  }
}])
const getFirstLink = new QueryStream(({name, data}) => {
  if (name === "a"){
    return data.href
  }
})
 
// begin the job
console.log("start: reading file")
file
  .on("data", d => console.log(`  read: ${d.length} bytes`))
  .on("end", () => console.log(`end: read file`))
console.log("start: parsing file")
const nodeStream = file
  .pipe(parser)
  .on("data", d => console.log(`  parsed: ${str(d)}`))
  .on("end", () => console.log(`end: parsed file`))
console.log("start: running basic query")
nodeStream
  .pipe(getFirstLink)
  .on("data", d => console.log(`    getFirstLink result: ${str(d)}`))
  .on("end", () => console.log(`end: getFirstLink query`))
console.log("start: running recursive query")
nodeStream
  .pipe(getPageNames)
  .on("data", d => console.log(`    getPageNames result: ${str(d)}`))
  .on("end", () => console.log(`end: getPageNames query`))

Since the getFirstLink query will complete roughly halfway into the file, we can expect the stream to end earlier than the getPageNames query. The getPageNames query is recursive, so it will run until the end of the file. Let's take a look at the output:

start: reading file
start: parsing file
start: running basic query
start: running recursive query
  read: 12 bytes
  parsed: {"name":"ul","data":{}}
  read: 12 bytes
  read: 12 bytes
  parsed: {"name":"li","data":{"style":"color: blue;"}}
  read: 12 bytes
  parsed: {"text":"Home"}
    getPageNames result: "Home"
  parsed: {"name":"li"}
  read: 12 bytes
  parsed: {"name":"li","data":{}}
  read: 12 bytes
  read: 12 bytes
  parsed: {"name":"a","data":{"href":"./about.html"}}
    getFirstLink result: "./about.html"
end: getFirstLink query
  read: 12 bytes
  parsed: {"text":"About"}
    getPageNames result: "About"
  parsed: {"name":"a"}
  parsed: {"name":"li"}
  read: 12 bytes
  parsed: {"name":"li","data":{}}
  read: 12 bytes
  parsed: {"text":"Posts"}
    getPageNames result: "Posts"
  parsed: {"name":"li"}
  read: 4 bytes
  parsed: {"name":"ul"}
end: read file
end: parsed file
end: getPageNames query

As you can see, the getFirstLink query stream immediately ends after finding its result, because it was a non-recursive query. The only reason it reads the entire section of the file is so that it can continue getting potential results for the recursive query. Let's see what happens if we run the exact same code, but comment out our recursive query:

...
// begin the job
console.log("start: reading file")
file
  // .on("data", d => console.log(`  read: ${d.length} bytes`))
  .on("end", () => console.log(`end: read file`))
console.log("start: parsing file")
const nodeStream = file
  .pipe(parser)
  .on("data", d => console.log(`  parsed: ${str(d)}`))
  .on("end", () => console.log(`end: parsed file`))
console.log("start: running basic query")
nodeStream
  .pipe(getFirstLink)
  .on("data", d => console.log(`    getFirstLink result: ${str(d)}`))
  .on("end", () => console.log(`end: getFirstLink query`))
// console.log("start: running recursive query")
// nodeStream
//   .pipe(getPageNames)
//   .on("data", d => console.log(`    getPageNames result: ${str(d)}`))
//   .on("end", () => console.log(`end: getPageNames query`))

We've commented out the entire recursive query, as well as our event listener which logs the bytes read from the file. Remember, node.js will continue to pipe something as long as you are listening for the data. Let's look at the output:

start: reading file
start: parsing file
start: running basic query
  parsed: {"name":"ul","data":{}}
  parsed: {"name":"li","data":{"style":"color: blue;"}}
  parsed: {"text":"Home"}
  parsed: {"name":"li"}
  parsed: {"name":"li","data":{}}
  parsed: {"name":"a","data":{"href":"./about.html"}}
    getFirstLink result: "./about.html"
end: getFirstLink query
end: read file

This time, the file stops streaming data to the parser as soon as the query returns its result. There's no need to continue reading and parsing the file if the query stream no longer has any queries on its stack.

writing smarter queries

Basic queries are preferred over recursive queries, since they are not executed after finding a result. If you must use a recursive query, ask yourself whether or not you can nest it inside a basic query to narrow down the html subtree in which it runs. When writing a query, it's best to return a non-null falsy as soon as you know the query will fail, so you can avoid doing unecessary processing.

For example, suppose your document has 9,020 li tags, but you only need to query the 20 li tags inside <ul id="1">. The following query could work:

const liQuery = [node => {
  if (node.name === "li"){
    // do some expensive processing and return a result
  }
}];

But, it will be executed on your entire node set, and it will do expensive processing on over 9,000 nodes. Instead, nest your recursive liQuery query inside of a basic query:

...
const betterQuery = ({data}) => {
  if (data && data.id === "1"){
    return liQuery;
  }
}

This query will limit the scope of your liQuery to the subtree of <ul id="1">. In other words, the liQuery is only executed on the 20 li tags inside of the target ul.

multiple queries

All the queries you write can be executed on the same query stream:

...
const queries = require("./queries");
// all of your queries will run
const queryStream = new QueryStream(...queries)

queries as modular plugins

It would be trivial to export a query factory as an npm package, which can then be imported and used like any other query. Let's assume there's a third party Reddit comment-scraping query which emits {author, text, url} for each comment on a Reddit page. To use it, all we need to do is import it and pass it into our QueryStream:

const QueryStream = require("atlas-query-stream");
// third party query factory, call the module to get an instance of the query
const pluginQuery = require("some-reddit-comments-query-package")()
// my query which outputs upvote counts for each comment
const upvotesQuery = require("./reddit-upvotes-query")
const engine = new QueryStream(upvotesQuery, pluginQuery)
...

In this example, the query stream will output data for the upvotes for each comment (from our query), as well as {author, text, url} objects thanks to the plugin.

reusing queries

Since subqueries are tracked within the scope of their parent node, make sure you use separate instances of a subquery for different scopes. For example, this is fine:

// works for a basic query
const { subquery } = require("./subquery");
const query = ({data}) => {
  if (name === "div"){
    // even if there are multiple divs in the tree,
    // this block runs once, since it's a basic query
    // thus the subquery will only run in one subtree
    return subquery
  }
}

However, the following will not work as expected, since we are using the same subquery for potentially multiple different subtrees:

// doesn't work for a recursive query
const { subquery } = require("./subquery")
const query = [({name}) => {
  if (name === "div"){
    // this block may run several times
    return subquery
  }
}]

The fix is pretty easy with a factory, which returns a new subquery every time it is called:

// works
const { makeSubquery } = require("./subquery");
const query = [({name}) => {
  if (name === "div"){
    return makeSubquery()
  }
}]

In the above case, each time the if block is run, it returns a unique subquery. If you are writing queries, always wrap them in an arrow function so that the caller can use multiple instances if they need to.

caveats

returning data

If a query returns an Array or a Function, the query stream will assume you are returning another query. Results should be any truthy value except an Array or Function. If your query needs to return an array of data, return an object with an array field instead: {results: yourArray}. Returning any falsy value will tell the query stream that the current query did not find any result. In the case of a return value other than null, the query will be re-run in the (sub)tree in which it was started. If null is returned, the query will be aborted.

query order

If you're running multiple queries (e.g. new QueryStream(...queries)), make sure your queries are pure and do not depend on each other. This means that code inside query A should not talk to code inside query B where stream = new QueryStream(A, B). Of course, it's perfectly fine for A and B to store information across their own closures (e.g. for nested queries).

malformatted html

Non-nested queries will work fine if there are missing closing tags. Nested queries won't work as expected if there are missing closing tags in the scope of the subquery, since the query stream uses the nesting level to decide whether or not it should keep running a subquery.

todo

subtrees and substrings

Theoretically, you should be able to write a self-recurring query function which outputs DOM subtrees or html substrings in very few lines of code, although doing so may require us to execute queries on closing nodes when their subtree expires.

atlas-query-stream

atlas-query-stream

install

why

introduction

transient dom

queries

input

examples

basic queries

recursive queries

aborting queries

nested queries

advanced examples

querying fractions of a file

writing smarter queries

multiple queries

queries as modular plugins

reusing queries

caveats

returning data

query order

malformatted html

todo

subtrees and substrings

Versions

Current Tags

Version History

Package Sidebar

Install

Repository

Homepage

Weekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

atlas-query-stream

atlas-query-stream

install

why

introduction

transient dom

queries

input

examples

basic queries

recursive queries

aborting queries

nested queries

advanced examples

querying fractions of a file

writing smarter queries

multiple queries

queries as modular plugins

reusing queries

caveats

returning data

query order

malformatted html

todo

subtrees and substrings

Versions

Current Tags

Version History

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

Weekly Downloads