@junaid1460/crawler

1.0.1 • Public • Published

Simple crawler

NPM version Crawls hyperlink from provided base url

Example:

 import fetch from 'node-fetch';
import { ThrottledAsyncCalls } from './async_throttle';
import { Crawler } from './crawler';

const mediumHostName = "medium.com"
const mediumCrawler = new Crawler(
     ThrottledAsyncCalls.wrap({
         concurrency:  5,
         func: fetch
     }).func,
     {
         baseUrl: `https://${mediumHostName}`,
         hostName: mediumHostName,
         startUrl: `https://${mediumHostName}`,
         depth: 3,
         verbose: true
     }
 )
mediumCrawler.start().then(async (e) => {
    // Now process data
    // play with it
})  

The repo contains also a simple wrapper to limit call concurreny to specified number.

Example

import { ThrottledAsyncCalls } from '../src/async_throttle';

async function  test(x: number) {
  return x + 1
}

// Simple and powerfiul
const {func: func, object: boundObject} = ThrottledAsyncCalls.wrap({
    concurrency:  4,  // Max concurrent calls 
    func: test // function to wrap
})


// Call it like below

 function start(index) {
    return Promise.all([
     func(0).then(e => console.log(index)),
     func(0),
     func(0),
     func(0),
     func(0),
     func(0),
     func(0),
     func(0),
     func(0).then(e => console.log(index))
    ])
}

start(1).then(async e => {
    console.log("After all tasks done executing", e)
})

License

MIT

Readme

Keywords

Package Sidebar

Install

npm i @junaid1460/crawler

Weekly Downloads

0

Version

1.0.1

License

MIT

Unpacked Size

12.6 kB

Total Files

13

Last publish

Collaborators

  • junaid1460