limador

0.0.3 • Public • Published

limador

Powerful Scraping and Crawling library with anti-scraping, scalability, storage, static/dynamic contents, monitoring UI and more. Ready to deploy on cloud instances or serverless.

NOTE: THIS IS WORK IN PROGRESS. YOU CANNOT USE IT YET.

Motivation

There are plenty of scraping frameworks, but the ones we found were:

  • too simple (thin wrappers around Cheerio, Puppeteer, Playwright and the like)
  • too complex (over engineered)
  • too closed (requiring a subscription to unleash all potential)
  • unscalable
  • difficult to deploy to the cloud
  • difficult to operate once deployed

Why Limador?

  • simple for basic scraping => You can read the docs, try it and see results in less than 1h.
  • powerful for complex projects => Including those that need scraping trigguers, cron jobs, conditionally/recursive crawling, multiple scrapers for different kinds of content, etc.
  • ready to be deployed to the cloud => either in a single cloud instance or serverless.
  • horizontally scalable => with serverless cloud functions or autoscaling pools of cloud instances.
  • able to deal with anti-scraping => multiple rotating proxies, sticky sessions, human-like http requests (headers, cookies...), etc.
  • monitoring UI => track jobs progress, logs, results, etc.

Getting Started

Creating a basic scraping project is as easy as follows:

cd repo-folder
npm init
npm install -s limador puppeteer

Let's extract trendinf topics from Twitter using Limador. Create a file called index.js and add these contents:

const { Limador } = require('limador')

let limador = await Limador.init({
    queue: { type: 'memory' },
    database: { type: 'sqlite-memory' },
    batches: {
        'scrape-google-title': {
            title: 'Scrape Google Title',
            jobs: (params) => [{
                url: 'https://google.com',
                tool: 'puppeteer',
                call: 'pageScraper',
            }]
        }
    },
    scrapers: {
        'pageScraper': async ({page}) => {
            console.log(await page.title())
        }
    }
})
let batch = await limador.start('scrape-google-title')  
await batch.done()
await limador.stop()

Run it with:

> node index.js

Limador is running...
See progress, logs and scraped data at http://localhost:4300

Full example

The following is a full scraper with all features:

  • anti-scraping with: proxies, rate limit, human-like browsing
  • development and production environments
  • horizontal scaling
  • progress UI
  • storage of results in a database
  • storage of screenshots in a bucket
  • recursive crawling

index.js:

const { Limador, DataTypes } = require('limador')

const scrapers = {
	'pageScraper': async ({job, db, page}) => {
        // save screenshot
        const title = await page.title()
        await page.screenshot({ path: title + '.png' })

        // store needed data in a database
        const data = await page.evaluate(() => {
            return Array.from(document.querySelectorAll('#text-field'))
        })
        for(const elem of data)
            await db.insert('Foo', elem)

        // collect links that need to be crawled next
        const links = await page.evaluate(() => {
            return Array.from(document.querySelectorAll('a'))
        })
        const jobs = links.map((link) => {
            ... scraper.job.config,
            url: link,
            cookies: page.cookies(),
            cron: null
        })

        await batch.queue.addJobs(jobs)	}
}

const queue = process.env.NODE_ENV === 'production' ? {
    type: 'sqs',
    accessKeyId: 'your s3 key',
    secretAccessKey: 'your s3 secret',
    name: 'queue name'
} : { type: 'memory' }

async function slave() {
	let limador = new Limador({
	  slave: true,
      queue,
	  scrapers,
	  database: { kind: 'sqlite-memory' }
	})
	await limador.init()
}

async function master() {
	let limador = await Limador.init({
        maxcpu: 50,
        maxmem: 50,
        onLimits: () => { console.log('Limits reached')},
        api: { port: 5000 },
        queue,
        scrapers,
        database: { type: 'sqlite-memory' },
        batches: { 
            'batch-name': {
                title: 'batch title',
                params: [
                    { name: 'cron', title: 'Periodicity', type: 'CRON', default: '0 3 * * *' },
                    { name: 'url', tile: 'URL', type: 'STRING', default: 'https://google.es' }
                }
                jobs: (params) => return [{
                    url: params.url.value,
                    tool: 'puppeteer',
                    call: 'pageScraper',
                }]
            }
        }
	})

    await limador.db.defineModel('Foo', {
        name: DataTypes.TEXT
    })

    let batch = await limador.start('batch-name')  
    await batch.done()
    await limador.stop()
}

if(process.env.MASTER || process.env.NODE_ENV !== 'production')
	master()
else
	slave()

Package Sidebar

Install

npm i limador

Weekly Downloads

1

Version

0.0.3

License

MIT

Unpacked Size

36.9 kB

Total Files

17

Last publish

Collaborators

  • celtiberian