Web Scraper
- CSS selectors
- exporting function
- pre-configured to insert results into SQLite database and generate CSV
- stop conditions:
- time
- number of results
- number of websites
- filter function to check for results
- post- and pre-processing functions
- init with options or set them later with
spider.setVal1(v).setVal2(v2)
- builder (call chaining) design pattern
- extensible
API
Docs in gh-pages.
const startURL = "https://stackoverflow.com/questions/...";const crawler = startURL;crawler // sec // #workers // distinct URLs // run returns void, you might want to provide an export function for each result (see below) // by default goes to sqlite ./db and prints to console ;
OR use init object in the constructor
// DEFAULT init optionsconst spiderOpts = // Function<String, String, String, Promise> exportFunct: exports // predicate i.e. Function<String, Boolean> true // Array<String> followSelectors: // String logInfoFile: undefined // logging goes to console // String logInfoFile: undefined // logging goes to console // Integer redirFollowCount: 3 // Integer respSecW8: 10 // Array<String> selectors: // Integer resultCount: 100 // Integer siteCount: 10 // #sites // Integer threadCount: 4 // Integer timeLimit: 60 // sec; const startURL = "https://stackoverflow.com/questions/...";const crawler = startURL spiderOpts;crawler;
const startURL = "https://stackoverflow.com/questions/...";const crawler = startURL;crawler // sec // #workers // distinct URLs // run returns void, you might want to provide an export function for each result (see below) // by default goes to sqlite ./db and prints to console ;
See export functions below to save results.
Export Function
Must be of type (uri: string, selector: string, text: string) => Promise<*>
.
There is a few configurable export functions that you can use:
Import the exporting module:
const exporting Spider = ;
Declare a spider:
const spider = uri /* opts */ ;
-
sqlite
Generates a
Result
table with columns:id INT
,text TEXT
,selector TEXT
,uri TEXT
columns.spider // generate output db name;spider; -
console
spider // default formatter;spider // string formatter for (uri, selector, text);spider; -
file
spider // default file name, default formatter;spider // custom file name, default csv formatter;spider // custom file name, string formatter;spider; -
combine
(used to broadcast results to many exports)spider; -
db
spider // look at sequelize docs; -
default
(enabled by default, sends to console, CSV file and sqlite database)
It's very easy to define your own export function. E.g. imagine wanting to POST each result to some 3rd party API.
const myExportFunction = async { const res = await http; return;};
Example
More examples in ./examples
.
const Spider exporting = ; { const s = 'https://www.jobsite.co.uk/jobs/javascript'; const sqliteExport = await exporting; s // don't look for jobs in London, make sure they are graduate! // next page // stop after 3 websites (urls) // run for 30 sec ;};