lb-scrapy
A simple and lightweight scrapy for node.js.
Install
NPM
npm install lb-scrapy
Our first Spider
const fs = ; const Request Spider } = ; * { let urls = 'http://quotes.toscrape.com/page/1/' 'http://quotes.toscrape.com/page/2/' ; for let url of urls url: url callback: thisparse } { let page = responseurl0; let filename = `quotes-.html`; fs; console; } let spider = ;spiderstart;
A shortcut to the start_requests method
Instead of implementing a start_requests
method that generate Request
objects from URLs, you can just define a start_urls
class attribute with a list of URLs. This list will then be used by the default implementation of start_requests()
to create the initial requests for your spider.
const fs = ; const Request Spider } = ; { super...arguments; thisstart_urls = 'http://quotes.toscrape.com/page/1/' 'http://quotes.toscrape.com/page/2/' } { let page = responseurl0; let filename = `quotes-.html`; fs; console; } let spider = ;spiderstart;
Extracting data in our spider
A scrapy spider typically generates many dictionaries containing the data extracted from the page. To do that, we use the yield
keyword in the callback, as you see below:
const Request Spider } = ; { super...arguments; thisstart_urls = 'http://quotes.toscrape.com/page/1/' 'http://quotes.toscrape.com/page/2/' } * { let $ = responseselector; let items = ; for let item of items item; } let spider = ;spiderstart;
If you run this spider, it will output the extracted data with the log:
2019-04-05 19:39:30 [INFO]: Scrapyed from <200 http://quotes.toscrape.com/page/1/>{"text":"“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”","author":"Albert Einstein","tags":["change","deep-thoughts","thinking","world"]}2019-04-05 19:39:30 [INFO]: Scrapyed from <200 http://quotes.toscrape.com/page/1/>{"text":"“It is our choices, Harry, that show what we truly are, far more than our abilities.”","author":"J.K. Rowling","tags":["abilities","choices"]}
We use cheerio to parse HTML in response, you can use it with response.selector
.
Item Pipeline
After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through serveral components that are executed sequentially.
Each item pipeline component is a Javascript class that implements a simple method. They receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.
Writing your own item pipeline
Each item pipeline component is a Javascript class that must implement the following method:
process_item(item, response, spider)
The method is called for every item pipeline component. process_item()
must return a dict with data. Dropped items are no logger processed by further pipeline components.
Parameters
item
: the item scrapedresponse
: the response which generated the itemspider
: the spider which scraped the item
Item Pipeline Example
The following pipeline stores all scraped items into a single result.txt
file, containing one item per line serialized in JSON format:
const fs = ;const ItemPipeline = ; { super...arguments; fs; } { fs; return item; }
Activating an Item Pipeline compponent
To active an Item Pipeline component you must add its class to the ITEM_PIPELINES
setting, like in the following example:
const fs = ;const Request Spider ItemPipeline } = ; { super...arguments; fs; } { fs; return item; } { super...arguments; thissettings = 'ITEM_PIPELINES': class: FileItemPipeline priority: 200 ; } ... more code let spider = ;spiderstart;
Spider
Spiders are classes which define how a certain site will be scraped, including how to perform the crawl and how to extract structured data from thee pages. In other words, Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site.
class lb-scrapy.Spider
This is the simplest spider, and the one from which every other spider must inherit. It doesn’t provide any special functionality. It just provides a default start_requests()
implementation which sends requests from the start_urls
spider attribute and calls the spider’s method
start_urls
A list of URLs where the spider will begin to crawl from, when no particular URLs are specified. So, the first pages downloaded will be those listed here. The subsequent Request
will be generated successively from data contained in the start URLs.
settings
A dictionary of settings that will be overridden from the global configuration when running this spider. It must be defined as a class attribute since the settings are updated before instantiation.
start_requests()
This method must return an iterable with the first Requests to crawl for this spider. It is called by Scrapy when the spider is opened for scraping. Scrapy calls it only once, so it is safe to implement start_requests()
as a generator.
The default implemetation generates Request({url})
for each url in start_urls
.
If you want to change the Requests used to start scraping a domain, this is the method to override. For example, if you need to start by logging in using a POST request, you could do:
const Request Spider } = ; * { url: 'http://httpbin.org/post' method: 'POST' data: 'user': 'john' 'pass': 'secret' callback: thislogged_in ; } { // here you would extract links to follow and return Requests // for each of them, with another callback } let spider = ;spiderstart;
parse(response)
This is the default callback used by Scrapy to process downloaded responses, when their requests don’t specify a callback.
The parse
method is in charge of processing the response and returning scraped data and/or more URLs to follow. Other Requests callbacks have the same requirements as the Spider
class.
This method, as well as any other Request callback, must return an iterable of Request
and/or dicts.
Parameters
response
: the response to parse
start()
Call this method to start the spider.
Request and Respons
Scrapy uses Request
and Response
objects for crawling web sites.
Typically, Request
objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response
object which travels back to the spider that issued the request.
class lb-scrapy.Request(options)
A Request
object represents an HTTP request, which is usually generated in the Spider and executed by the Downloader, and thus generating a Response
.
options
url(string)
: the URL of the requestcallback(function)
: the function that will be called with the response of the request as its first parameter. If a Request doesn't specify a callback, the spider'sparse()
method will be used.method(string)
: the HTTP method of the request. Defaults to'GET'
.headers(object)
: the headers of the request.dont_filter(boolean)
: indicates that this request should not be filtered by the scheduler. Defaults tofalse
.data(object)
: body for POST and PUT requests. if given, this addsContent-type: application/x-www-form-urlencoded
header. Defaults tonull
.json(object)
: JSON representation of body. if given, this addsContent-type: application/json
header. Defaults tonull
.cookies(object)
: the request cookie. For example:let request =url: 'http://httpbin.org/cookies'cookies:cookies_are: 'working'meta(object)
: an object that contains arbitrary metadata for this request. It can alse be accessed from theresponse.meta
attribute.
class lb-scrapy.Response
A Response
object represents an HTTP response, which is usually downloaded (by the Downloader) and fed to the Spiders for processing.
url
The URL of the response.
status
The HTTP status of the response.
headers
The headers of the response.
text
The response body.
cookies
The cookies contained in the response.
meta
An object that passed from the request.
selector
The cheerio
instance which loads the response body.
urljoin(url)
Constructs an absolute url by combining the Response’s url with a possible relative url.
License
MIT