crawlandparse

10.2.0 • Public • Published

Crawl and Parse Sites and Shops

  • Continuous parsing, price monitoring.
  • High throughput, autoscaling, proxies, throttling.
  • API designed for simplicity and robustness.
  • Alerts and Analytics (soon...).

Theory

Crawling works as a sequence of small steps. On each step parser will do the following:

  • Take an url
  • Download it
  • Parse downloaded HTML
  • Produce new urls to visit on the next steps
  • Extract and store data we are interested in

On the next step parser will take one of newly produced urls and process it again, and so on and on...

The good news is that you don't have to worry about those complexities, you only need to write the function that will take an HTML as an input and produce list of new urls and extracted data as an output. New urls are called entries, extracted data called objects.

HTML => your-function => { entries: [...], objects: [...] }

Example

You need a password to continue, contact us to get one.

Check your Node.JS version node -v, it should be at least v7. You can use nvm or other ways to install it

nvm install v7.2.1
nvm alias default v7.2.1
node -v

Install crawlandparse

npm install crawlandparse

Let's create the skeleton of our parser, create file blog.js:

const { Parser } = require('crawlandparse/parser')
const { get } = require('crawlandparse/http')
const { parse } = require('crawlandparse/parse')
const { upload } = require('crawlandparse/files')

const parser = new Parser('blog')

Take a look at the example blog we'll be parsing http://crawlandparse.com/examples/blog . Note that there are two types of pages - index with the list of posts with pagination and the page for the post itself.

We need to provide a list of seed urls, those urls will be an entry point to the site. Every url must have an ID, it's better to use some meaningful ID to make debugging easier.

const homePage = {
  type : 'index',
  id   : 'seed',
  url  : `http://crawlandparse.com/examples/blog`
}
parser.seeds([homePage])

Now we need to write the parsing function that will take HTML as an input, and produce new urls and extract useful data.

We need to parse two types of pages - index and post. Instead of writing complex function that handles it we'll write two separate functions for each page type.

Let's write function to parse an index page and extract list of links to the posts

parser.on('index', async ({ entry: { url }, proxy }) => {
  // Downloading and parsing HTML
  const html = parse(await get({ url, proxy }))

  // Extracting links to posts and storing it as entries, they will
  // be processed with the next step.
  const entries = html.all('.post').toArray().map((post) => ({
    type: 'post',
    // It's better to use meaningfull IDs. If it's not possible `hash(url)` could be used as ID.
    id:   post.attr('a', 'data-id'),
    url:  `http://crawlandparse.com${post.attr('a', 'href')}`
  }))

  return { entries }
})

It's time to test it, add the following temporary code, it will run the parser just once and will parse the homePage link.

async function test() {
  const result = await parser.process(homePage)
  console.log(result)
}
test().catch((e) => console.error(e))

Execute the following command to run it

user=<your-name> token=<your-token> node blog.js

And you should see something like code below. Success, we extracted two new links pointing to post pages.

{
  entries: [
    {
      type: 'post',
      id:   'post-1',
      url:  'http://crawlandparse.com/examples/blog/post-1.html'
    },
    {
      type: 'post',
      id:   'post-2',
      url:  'http://crawlandparse.com/examples/blog/post-2.html'
    }
  ],
  objects: []
}

Now add the function to parse the actual posts

parser.on('post', async ({ entry: { id, url }, proxy }) => {
  const html = parse(await get({ url, proxy }))

  // Downloading images and uploading it to file storage
  const imageUrls = html.attrs('img', 'src')
  const images = await upload({
    parser: parser.id,
    host,
    files:  imageUrls,
    proxy
  })

  return {
    objects: [{
      id,
      url,
      title: html.text('.title'),
      text:  html.text('.body'),
      images
    }]
  }
})

And change the temporary code to the following, so now it will parse the post once

async function test() {
  let result = await parser.process({
      type : 'post',
      id   : '/examples/blog/post-1.html',
      url  : 'http://crawlandparse.com/examples/blog/post-1.html'
    })
  console.log(result)
}
test().catch((e) => console.error(e))

You should see something like data below. Success, we parsed the post and extracted title and text.

{
  entries: [],
  objects: [
    {
      id:    'http://crawlandparse.com/examples/blog/post-1.html',
      url:   'http://crawlandparse.com/examples/blog/post-1.html',
      title: 'Post 1 title',
      text:  'Post 1 body',
      images: [
        {
          originalUrl: '/examples/blog/images/1.jpg',
          url:         '/files/alex/2/53/48/5348783a978d88de',          
          hash:       '5348783a978d88de',
          size:        18732
        }
      ]
    }
  ]
}

So far we written two function and successfully tested it in isolation. Let's run the actual parser now. Remove the temporary code completely and instead put this single line in the end of the file and run it

parser.run()

You should see that parser printed some messages in console about processing posts and after a couple of second it stopped. Open this link to see what has been downloaded

http://crawlandparse.com/users/<your-username>/parser/blog/latest-objects?token=<your-password>

You should see something like

[
  {
    id:     "post-2",
    url:    "http://crawlandparse.com/examples/blog/post-2.html",
    title:  "Post 2 title",
    text:   "Post 2 body",
    images: [
      {
        hash: "ad802943b26f6e63",
        size: 45531,
        url: "/files/alex/2/ad/80/ad802943b26f6e63",
        originalUrl: "/examples/blog/images/2.jpg"
      }
    ]
  },
  {
    id: "post-1",
    url: "http://crawlandparse.com/examples/blog/post-1.html",
    title: "Post 1 title",
    text: "Post 1 body",
    images: [
      {
        hash: "5348783a978d88de",
        size: 18732,
        url: "/files/alex/2/53/48/5348783a978d88de",
        originalUrl: "/examples/blog/images/1.jpg"
      }
    ]
  }
]

If you need to inspect entry or id use following URLs

http://crawlandparse.com/users/<your-username>/parser/blog/entries/<entry-id>?token=<your-password>
http://crawlandparse.com/users/<your-username>/parser/blog/objects/<object-id>?token=<your-password>

We successfully parsed the blog and extracted useful data. In the next sections we'll see how to do pagination and tell parser to upload extracted data to your app.

As you noticed, after processing the blog the parser didn't exited but just stop and did nothing. And if you'll run it again - nothing happens.

It's an intended behavior, by default every page has an expiration time 2 days. So, parser actually never stops, it process all the pages and then waits until pages get obsolete and then it updates it again. In our case the parser would reprocess all the pages after two days. Take look at API for more config options.

But, when debugging a parser it's inconvenient to wait for 2 days. So, there's an option to reset it, replace parser.run() with the following code to reset the parser

parser.clean()

After that you can run parser again. But constantly dropping base and re-running parser is still inconvenient. The much faster way is to use parser.process as we saw earlier.

If you have any question refer API Docs and Parser, Entry, ParsedObject, parse, helpers.

The complete source code of this example.

Additional API

There are ignore and indexHit methods in parser/manager-client file, with docs.

Readme

Keywords

none

Package Sidebar

Install

npm i crawlandparse

Weekly Downloads

2

Version

10.2.0

License

none

Unpacked Size

99.4 kB

Total Files

24

Last publish

Collaborators

  • igoodparser