Rattler is the next web scraper, designed to rattle around the web and extract the info that you need, quickly, efficiently and very politely. In other words, getting the stuff done!
You just have provide a configuration object to the Rattler constructor. The config must have a baseURL (basically the domain from which you want to extract) and a list of scrape definition which Rattler will use to extract the info that you requested.
Rattler uses Axios to make http requests and Cheerio to load the page and extract the stuff that you want, so you can use css selectors (basically the path inside the DOM of the element) to define where in the page you want to extract the info.
You can extract from one page or multiple pages, depending on the configuration. You can also instruct Rattler to keep extracting the info while navigating the pagination links, as long as there is a next link. See the usage section.
||The config object containing information about what and where you want to extract.|
||The base url from which you want to load the page for all requests.|
||The scrape list of stuff that you want to extract. Must have min one element inside.|
||Each scrape request in the scrape list will produce a result in JSON format, this field represent the name of the key inside the result of this scrape request.|
|config.scrapeList[n].searchURL||string||no||The search url to be used to load the page. In absence of a baseURL this field will be required.|
||The css selector of the element you want to extract for this particular element in the scrapeList.|
|config.scrapeList[n].followNext||object||no||An object containing rules to find the next link and follow it to apply the scrape definition|
||The cssSelector of the next link|
||The maximun number of next links that will be followed if found. Min 1, max 20.|
baseURL: ''scrapeList:label: 'resultStats'searchURL: '/search?q=let+me+google+that+for+you'cssSelector: '#main-content.section.div.ul.li.next.a'
Multiple element extracted from same page:
baseURL: ''scrapeList:label: 'resultStats'searchURL: '/search?q=let+me+google+that+for+you'cssSelector: 'div.resultStats'label: 'languagesInfo'searchURL: '/search?q=let+me+google+that+for+you'cssSelector: '#SIvCob'
Multiple element extracted from different page in the same baseURL:
baseURL: ''scrapeList:label: 'resultStats'searchURL: '/search?q=let+me+google+that+for+you'cssSelector: 'div.resultStats'label: 'languagesInfo'searchURL: '/search?q=hi+there'cssSelector: '#SIvCob'
Scrape then follow until you have next page or hit the maxDepth limit
baseURL: ''scrapeList:label: 'pricesForNewYork'searchURL: '/manhattan'cssSelector: 'span.item-price'followNext:cssSelector: 'div.pagination.ul.li.next'maxDepth: 20
This method will extract the information that you have specified in the config from each of the scrapeList element and present the results in JSON format. For each scrapeList element an http request will be performed (or it will load the page from cache if the request has been already executed) and the combined text contentes of each element in the set of matched elemens, including descendands, will be returned.
const Rattler = ;const config =baseURL: ''scrapeList:label: 'resultStats'searchURL: '/search?q=let+me+google+that+for+you'cssSelector: 'div.resultStats';const rt = config;rt;