🤖 pdf-bot
Easily create a microservice for generating PDFs using headless Chrome.
pdf-bot
is installed on a server and will receive URLs to turn into PDFs through its API or CLI. pdf-bot
will manage a queue of PDF jobs. Once a PDF job has run it will notify you using a webhook so you can fetch the API. pdf-bot
supports storing PDFs on S3 out of the box. Failed PDF generations and Webhook pings will be retried after a configurable decaying schedule.
pdf-bot
uses html-pdf-chrome
under the hood and supports all the settings that it supports. Major thanks to @westy92 for making this possible.
How does it work?
Imagine you have an app that creates invoices. You want to save those invoices as PDF. You install pdf-bot
on a server as an API. Your app server sends the URL of the invoice to the pdf-bot
server. A cronjob on the pdf-bot
server keeps checking for new jobs, generates a PDF using headless Chrome and sends the location back to the application server using a webhook.
Prerequisites
- Node.js v6 or later
Installation
$ npm install -g pdf-bot$ pdf-bot install
Make sure the node path is in your $PATH
pdf-bot install
will prompt for some basic configurations and then create a storage folder where your database and pdf files will be saved.
Configuration
pdf-bot
comes packaged with sensible defaults. At the very minimum you must have a config file in the same folder from which you are executing pdf-bot
with a storagePath
given. However, in reality what you probably want to do is use the pdf-bot install
command to generate a configuration file and then use an alias ALIAS pdf-bot = "pdf-bot -c /home/pdf-bot.config.js"
pdf-bot.config.js
var htmlPdf = moduleexports = api: token: 'crazy-secret' generator: completionTrigger: 1000 // 1 sec timeout storagePath: 'storage'
$ pdf-bot -c ./pdf-bot.config.js push https://esbenp.github.io
See a full list of the available configuration options.
Usage guide
Structure and concept
pdf-bot
is meant to be a microservice that runs a server to generate PDFs for you. That usually means you will send requests from your application server to the PDF server to request an url to be generated as a PDF. pdf-bot
will manage a queue and retry failed generations. Once a job is successfully generated a path to it will be sent back to your application server.
Let us check out the flow for an app that generates PDF invoices.
1. (App server): An invoice is created ----> Send URL to invoice to pdf-bot server
2. (pdf-bot server): Put the URL in the queue
3. (pdf-bot server): PDF is generated using headless Chrome
4. (pdf-bot server): (if failed try again using 1 min, 3 min, 10 min, 30 min, 60 min delay)
5. (pdf-bot server): Upload PDF to storage (e.g. Amazon S3)
6. (pdf-bot server): Send S3 location of PDF back to the app server
7. (App server): Receive S3 location of PDF -> Check signature sum matches for security
8. (App server): Handle PDF however you see fit (move it, download it, save it etc.)
You can send meta data to the pdf-bot
server that will be sent back to the application. This can help you identify what PDF you are receiving.
Setup
On your pdf-bot
server start by creating a config file pdf-bot.config.js
. You can see an example file here
pdf-bot.config.js
moduleexports = api: port: 3000 token: 'api-token' storage: 's3': webhook: secret: '1234' url: 'http://localhost:3000/webhooks/pdf'
As a minimum you should configure an access token for your API. This will be used to authenticate jobs sent to your pdf-bot
server. You also need to add a webhook
configuration to have pdf notifications sent back to your application server. You should add a secret
that will be used to generate a signature used to check that the request has not been tampered with during transfer.
Start your API using
pdf-bot -c ./pdf-bot.config.js api
This will start an express server that listens for new jobs on port 3000
.
Setting up Chrome
pdf-bot
uses html-pdf-chrome which in turns uses chrome-launcher to launch chrome. You should check out those two resources on how to properly setup Chrome. However, with chrome-launcher
Chrome should be started automatically. Otherwise, html-pdf-chrome
has a small guide on how to have it running as a process using pm2
.
You can install chrome on Ubuntu using
sudo apt-get update && apt-get install chromium-browser
If you are testing things on OSX or similar, chrome-launcher
should be able to find and automatically startup Chrome for you.
Setting up the receiving API
In the examples folder there is a small example on how the application API could look. Basically, you just have to define an endpoint that will receive the webhook and check that the signature matches.
api
Setup production environment
Follow the guide under production/
to see how to setup pdf-bot
using pm2
and nginx
Setup crontab
We setup our crontab to continuously look for jobs that have not yet been completed.
* * * * * node $(npm bin -g)/pdf-bot -c ./pdf-bot.config.js shift:all >> /var/log/pdfbot.log 2>&1* * * * * node $(npm bin -g)/pdf-bot -c ./pdf-bot.config.js ping:retry-failed >> /var/log/pdfbot.log 2>&1
Quick example using the CLI
Let us assume I want to generate a PDF for https://esbenp.github.io
. I can add the job using the pdf-bot
CLI.
$ pdf-bot -c ./pdf-bot.config.js push https://esbenp.github.io --meta '{"id":1}'
Next, if my crontab is not setup to run it automatically I can run it using the shift:all
command
$ pdf-bot -c ./pdf-bot.config.js shift:all
This will look for the oldest uncompleted job and run it.
How can I generate PDFs for sites that use Javascript?
This is a common issue with PDF generation. Luckily, html-pdf-chrome
has a really awesome API for dealing with Javascript. You can specify a timeout in milliseconds, wait for elements or custom events. To add a wait simply configure the generator
key in your configuration. Below are a few examples.
Wait for 5 seconds
var htmlPdf = moduleexports = api: token: 'api-token' // html-pdf-chrome options generator: completionTrigger: 5000 // waits for 5 sec webhook: secret: '1234' url: 'http://localhost:3000/webhooks/pdf'
Wait for event
var htmlPdf = moduleexports = api: token: 'api-token' // html-pdf-chrome options generator: completionTrigger: 'myEvent' // name of the event to listen for '#myElement' // optional DOM element CSS selector to listen on, defaults to body 5000 // optional timeout (milliseconds) webhook: secret: '1234' url: 'http://localhost:3000/webhooks/pdf'
In your Javascript trigger the event when rendering is complete
document;
Wait for variable
var htmlPdf = moduleexports = api: token: 'api-token' // html-pdf-chrome options generator: completionTrigger: 'myVarName' // optional, name of the variable to wait for. Defaults to 'htmlPdfDone' 5000 // optional, timeout (milliseconds) webhook: secret: '1234' url: 'http://localhost:3000/webhooks/pdf'
In your Javascript set the variable when the rendering is complete
windowmyVarName = true;
You can find more completion triggers in html-pdf-chrome's documentation
API
Below are given the endpoints that are exposed by pdf-server
's REST API
Push URL to queue: POST /
key | type | required | description |
---|---|---|---|
url | string | yes | The URL to generate a PDF from |
meta | object | Optional meta data object to send back to the webhook url |
Example
curl -X POST -H 'Authorization: Bearer api-token' -H 'Content-Type: application/json' http://pdf-bot.com/ -d ' { "url":"https://esbenp.github.io", "meta":{ "type":"invoice", "id":1 } }'
Database
LowDB (file-database) (default)
If you have low conurrency (run a job every now and then) you can use the default database driver that uses LowDB.
var LowDB = moduleexports = api: token: 'api-token' db: webhook: secret: '1234' url: 'http://localhost:3000/webhooks/pdf'
PostgreSQL
var pgsql = moduleexports = api: token: 'api-token' db: webhook: secret: '1234' url: 'http://localhost:3000/webhooks/pdf'
Optionally, you can specify a database url by specifying a connectionString
.
To install the necessary database tables, run db:migrate
. You can also destroy the database by running db:destroy
.
Storage
Currently pdf-bot
comes bundled with build-in support for storing PDFs on Amazon S3.
Feel free to contribute a PR if you want to see other storage plugins in pdf-bot
!
Amazon S3
To install S3 storage add a key to the storage
configuration. Notice, you can add as many different locations you want by giving them different keys.
var createS3Config = moduleexports = api: token: 'api-token' storage: 'my_s3': webhook: secret: '1234' url: 'http://localhost:3000/webhooks/pdf'
Options
var decaySchedule = 1000 * 60 // 1 minute 1000 * 60 * 3 // 3 minutes 1000 * 60 * 10 // 10 minutes 1000 * 60 * 30 // 30 minutes 1000 * 60 * 60 // 1 hour; moduleexports = // The settings of the API api: // The port your express.js instance listens to requests from. (default: 3000) port: 3000 // Spawn command when a job has been pushed to the API postPushCommand: '/home/user/.npm-global/bin/pdf-bot' '-c' './pdf-bot.config.js' 'shift:all' // The token used to validate requests to your API. Not required, but 100% recommended. token: 'api-token' db: // see other drivers under Database // html-pdf-chrome generator: // Triggers that specify when the PDF should be generated completionTrigger: 1000 // waits for 1 sec // The port to listen for Chrome (default: 9222) port: 9222 queue: // How frequent should pdf-bot retry failed generations? // (default: 1 min, 3 min, 10 min, 30 min, 60 min) { return decayScheduleretries - 1 ? decayScheduleretries - 1 : 0 } // How many times should pdf-bot try to generate a PDF? // (default: 5) generationMaxTries: 5 // How many generations to run at the same time when using shift:all parallelism: 4 // How frequent should pdf-bot retry failed webhook pings? // (default: 1 min, 3 min, 10 min, 30 min, 60 min) { return decayScheduleretries - 1 ? decayScheduleretries - 1 : 0 } // How many times should pdf-bot try to ping a webhook? // (default: 5) webhookMaxTries: 5 storage: 's3': webhook: // The prefix to add to all pdf-bot headers on the webhook response. // I.e. X-PDF-Transaction and X-PDF-Signature. (default: X-PDF-) headerNamespace: 'X-PDF-' // Extra request options to add to the Webhook ping. requestOptions: // The secret used to generate the hmac-sha1 signature hash. // !Not required, but should definitely be included! secret: '1234' // The endpoint to send PDF messages to. url: 'http://localhost:3000/webhooks/pdf'
CLI
pdf-bot
comes with a full CLI included! Use -c
to pass a configuration to pdf-bot
. You can also use --help
to get a list of all commands. An example is given below.
$ pdf-bot.js --config ./examples/pdf-bot.config.js --help Usage: pdf-bot [options] [command] Options: -V, --version output the version number -c, --config <path> Path to configuration file -h, --help output usage information Commands: api Start the API db:migrate db:destroy install generate [jobID] Generate PDF
Debug mode
pdf-bot
uses debug
for debug messages. You can turn on debugging by setting the environment variable DEBUG=pdf:*
like so
DEBUG=pdf:* pdf-bot jobs
Tests
$ npm run test
Issues
Please report issues to the issue tracker
License
The MIT License (MIT). Please see License File for more information.