Lightsabing Node/PhantomJS crawler. Crawl almost everything, including AJAX content.


JEDI CRAWLER is a Node/PhantomJS crawler made to scrape pretty much anything from Node, with a really simple syntax. Work in progress ladies

npm install jedi-crawler

Register padawans to the jedi crawler, that have a pattern to match a URL, and jQuery-style selectors. You can also post-process the data if you need to do some treatment (number conversion, etc)


module.exports = function(jedi) {
    // Pattern to match URL 
    pattern: /\/wiki\//,
    // Selectors to be executed 
        sel: "#firstHeading span",
        type: "text"
        sel: "#toc ~ p:first",
        type: "text"
    // You can choose to process the data AFTER being crawled. 
    postProcessingfunction(data) {
      /// Do your custom processing on the data processed 
      data.title = data.title.toUpperCase();
      return data;

For now only two types of selectors are supported : "text" and "src"

I find having one file per padawan (crawler) pretty cool for code clarity and also padawans need to learn by themselve and be alone

npm install jedi-crawlers

You can then give your padawans to the Jedi by doing

var jedi = require('jedi-crawler');

And then you can do

jedi.crawl('', function(errresult){

As the jedi will figure out what padawan to use given on the URL and of the pattern you set

Crawlers only start to scrape the page as soon as $(document).ready is fired. Our own version of jQuery is injected into the page, but then we also give back the $ to its owner in case they're executing 3rd party libraries to modify the DOM or w/e

If your selectors matches severals DOM elements, then an array of every value is returned

Right now, PhantomJS is instantiated with "--load-images=no" option so the page loads faster

Pull that bad boy Make sure you have PhantomJS installed Run node main.js