REST API over the Web page Content Extractor (wce) node module.
Currently works with the following extractors:
For detailed information, please check the Webpage Content Extractor module's Github page.
git clone https://github.com/mxr576/webpage-content-extractor-api.git wce-apinode wce-api/index.js
Build the image on your local machine:
git clone https://github.com/mxr576/webpage-content-extractor-api.git wce-apicd wce-api/dockerdocker build -t mxr576/wce-api .
or pull the pre-built image from Dockerhub
docker pull mxr576/wce-api
then start a new container:
docker run -id -p 8001:8001 --name wce-api -t mxr576/wce-api
The extractor listen on the 8001 port, by default. You can test it via http://127.0.0.1:8001/?url=http://cnn.com.
The default extractor is read-art. You can change this in the config/default.json file or you can override it with environment specific settings, for example in conf/development.json . As you can see, you can specify multiple extractor in the config file. The order of the extractors is important, because the first one will be the primary extractor and the second one will be its fallback, when the first can not extract the content of an URL.
If you would like to use the readablity.com's Parser, then you have to set up your access token in the config file beforehand. You can clain your Parser key here.