TypeScript icon, indicating that this package has built-in type declarations

    0.4.1 • Public • Published

    Turbo Crawl

    The simple and fast crawling framework.


    Turbo Crawl is designed with the following principles:

    • Works out of the box - The default settings will produce a useful output.
    • Streams - Every interface is implemented using Node.js Streams for high performance
    • Easy Plugins - TypeScript interfaces can be implemented to customize functionality.

    Quick Start

    Open a terminal in an empty directory.

    $ npm i tcrawl

    $ alias tcrawl="node_modules/tcrawl/build/cli.js"

    $ tcrawl

    You now have access to the CLI and can run Turbo Crawl with default settings. Refer to the CLI documentation.


    import { Server } from "turbocrawl"
    /* const { Server } = require("turbocrawl") */
    const server = new Server()
    server.listen(() => {
      console.log("Turbo Crawl server is listening on port 8088)

    Now that the server is running, you can interact with it using the Client

    import { Client } from "turbocrawl"
    /* const { Client } = require("turbocrawl") */
    const client = new Client(8088, "localhost")
    client.crawl(["www.cnn.com", "www.nytimes.com", "www.newyorker.com"], (statusCode, response) => {
      console.log(statusCode, response)
    client.listCrawlers((crawlers) => {


    The Server constructor takes 3 optional arguments:

    constructor(port: number = 8088, 
                host: string = "localhost", 
                crawlerFactory: ICrawlerFactory = new DomainCrawlerFactory())

    Crawlers and the Crawler Factory

    The entry point to customization is crawlerFactory which is responsible for creating a crawler object for each domain that is crawled.

    interface ICrawlerFactory {
      create(domain: URL): ICrawler

    Note that a crawler takes a domain name URL as input, and is meant for crawling a single website.

    There are 4 components to a Crawler. The interface is built with streams so the arrows signify the flow of data from left to right. From "www.domain.com" to a Stream.Writable

    Turbo Crawl Pipeline

    Each component has a default implementation that works out of the box.

    Default Crawler Pipeline

    You most likely want to use the default URL Handler and customize the other 3 components as you see fit. The easiest way to do this is to extend the default DomainCrawler class and replace the detector, scraper, and consumer with your own classes in the constructor.

    Domain Crawler

    class DomainCrawler extends EventEmitter implements ICrawler {
        constructor(domain: URL,
                  consumer: ICrawlConsumer,
                  scraper?: IScraper,
                  detector?: ILinkDetector)
    class MyCustomCrawler extends DomainCrawler {
      constructor(domain: URL) {
              new MyCustomConsumer(),
              new MyCustomScraper(),
              new MyCustomLinkDetector())

    You can now create the server with your custom crawler factory

    class MyCustomCrawlerFactory implements ICrawlerFactory {
      public create(domain: URL): DomainCrawler {
        return MyCustomCrawler(domain)
    import { Server } from "turbocrawl"
    /* const { Server } = require("turbocrawl") */
    const server = new Server(8088, "localhost", new MyCustomCrawlerFactory())
    server.listen(() => {
      console.log("Turbo Crawl server is listening on port 8088 with custom crawler)

    Of course you can choose to customize all or none of these components to suit your needs.

    URL Detector

    interface ILinkDetector extends Readable {
      domain: URL
      options?: any
      getLinkCount(): number

    The URL Detector is given just a domain name as input and is responsible for finding URLs to scrape on that domain. There is only one URL Detector object per crawler. It is implemented as a Readable Stream so whenever your class has discovered a URL, it should write it to the stream as a string: this.push(url.href). When the stream ends, the getLinkCount() method should return the number of URLs that were detected to ensure that resources are cleaned up properly.

    The default Detector will find a website's Sitemap, usually in its robots.txt, and then extract webpages that have been modified in the past 48 hours. It is useful for news websites which often have up-to-date and valid sitemaps. The implementation is not tolerant of invalid sitemap entries; it will only find webpages since a specified date, and ignores sitemap entries without a valid date.


    export interface IScraperFactory {
      create(options?: any): Transform

    The Scraper Factory returns a Transform stream that takes an HTML stream as input, and outputs a stream of scraped data. A new scraper object is created for each webpage visited by a crawler.

    The default Scraper returns a JSON object of all of the <meta> tags on a webpage which can be useful for extracting structured data such as Open Graph or Schema.


    export interface ICrawlConsumer extends Writable {
      domain: URL
      options?: any

    The Consumer is responsible for writing out the scraped data, usually to a file. There is only one Consumer object per crawler.

    The default Consumer will write out all Scraper output to a file in ./.turbocrawl/crawled/.

    URL Handler

    export interface IURLHandler {
      stream(url: URL, callback: (url: URL, htmlstream?: Readable, err?: Error) => void): void

    The URL handler fetches the HTML for each URL discovered by the URL Detector. There is only one URL Handler object per crawler.

    This is the most difficult component to customize, and the default Handler has important features such as per-domain throttling and caching so take care when implementing your own.

    Advanced Customization

    The Crawler class has so far only taken a domain name as input e.g. "www.cnn.com". However, using a pass through link detector, one could easily create a crawler that takes a webpage as input and simply crawls that one page. Here's a sample implementation:

    class PassThroughLinkDetector extends PassThrough implements ILinkDetector {
      public domain: URL
      public options?: any
      constructor(webpage: URL, options?: any) {
        this.options = options
        this.domain = webpage
        this.on("pipe", () => {
      public getLinkCount(): number {
        return 1

    A custom URL handler could be written that instead of fetching the HTML, could use a Puppeteer instance to visit each page in order to run JavaScript and scrape dynamic websites.

    A custom scraper could use Mercury Parser to find the body of a news article. You can also try this npm package by Andreas Madson which attempts to scrape a news article from a stream of HTML and fits in more easily with the architecture of Turbo Crawl since it is based on Streams. Here's some sample code to get you started:

    const article = require("article")
    import { Duplex } from "stream"
    class ArticleScrapeStream extends Duplex {
      private parser: any
      constructor(url: URL, options?: any) {
        this.parser = article(url.href, (err, result) => {
      pipe(destination: any) {
    class ArticleScraperFactory implements IScraperFactory {
      public create(options: any): ReadableStream {
        return new ArticleScrapeStream(options.url)

    A custom consumer class could take scraped news articles and index them straight into an Elasticsearch database for large scale full text search of historic and breaking news.

    A consumer could write out a large corpus of news articles to the filesystem and import it into a NLP library like Python's NLTK.


    npm i turbocrawl

    DownloadsWeekly Downloads






    Unpacked Size

    78.7 kB

    Total Files


    Last publish


    • sunnypurewal