node package manager
Don’t reinvent the wheel. Reuse code within your team. Create a free org »

wce-api

Web page Content Extractor API (wce-api)


REST API over the Web page Content Extractor (wce) node module.

Currently works with the following extractors:

  1. readability.com's Parser API
  2. read-art
  3. node-readablity
  4. node-unfluff
  5. wce-proxy

For detailed information, please check the Webpage Content Extractor module's Github page.

Usage example

git clone https://github.com/mxr576/webpage-content-extractor-api.git wce-api
node wce-api/index.js

Docker usage example

Build the image on your local machine:

git clone https://github.com/mxr576/webpage-content-extractor-api.git wce-api
cd wce-api/docker
docker build -t mxr576/wce-api .

or pull the pre-built image from Dockerhub

docker pull mxr576/wce-api

then start a new container:

docker run -id -p 8001:8001 --name wce-api -t mxr576/wce-api

About the settings

The extractor listen on the 8001 port, by default. You can test it via http://127.0.0.1:8001/?url=http://cnn.com.

The default extractor is read-art. You can change this in the config/default.json file or you can override it with environment specific settings, for example in conf/development.json . As you can see, you can specify multiple extractor in the config file. The order of the extractors is important, because the first one will be the primary extractor and the second one will be its fallback, when the first can not extract the content of an URL.

If you would like to use the readablity.com's Parser, then you have to set up your access token in the config file beforehand. You can clain your Parser key here.

Licence

Apache Licence 2.0