Web page Content Extractor (wce)
Extract the content of any web page by using various content extractor libraries.
Currently the following ones are implemented:
This is the base module of the Webpage Content Extractor API module.
Usage example
var winston = ;var util = ;var wce = ;var logger = new winstonLogger{};logger; var extractors ='read-art' 'node-readability';var options = {};var WCE = extractors options; try WCE ; catch error logger;
WCE-Proxy
It is a built-in wrapper for content proxies. This wrapper could be used to retrieve the previously extracted content of the URLs from a cache through a REST API. This REST API could built in any language and it could store the content of the url in any database, but the wce-proxy wrapper was made, then I had a few expectations:
- The content of an URL could be queried with a GET request, the queried URL sent in the GET parameter to the server. Ex.: http://wce-proxy/?url=http://cnn.com
- If the proxy found content of the URL, then it is respond with 200 http status code and the respond's body contains the content of the URL.
- If the content of the URL not found, then the responde code is 204 and the body is empty.
- Any other status code will be handled as an error. The proxy could send back error messages in the repond's body.
- The proxy could accept data through POST request. A request should contains two parameters: url and content.
- When the content of URL successfully stored in the proxy's database, then the proxy should return with 200 http status code and the 'Success' message in the body.
- Any other status code will be handled as an error, the respond's body could contains information about the reason.