eurlex.js
eurlex.js is a command line utility to retrieve documents (specifically: regulation drafts) in all supported languages from the EUR-Lex website and convert them into JSON. It is made with node.js and can be installed locally via npm.
Install
eurlex.js can be installed using npm:
npm install -g eurlex
Of course you must have node node.js with npm installed.
eurlex.js works fine with Linux, *BSD and Darwin, but never was tested with Win32.
Usage
Once installed you can use eurlex on the command line:
eurlex [options] <EUR-Lex URI>
You get a brief description of all the options with
eurlex --help
If you are curious what it looks like to get and convert something, try:
eurlex -vu -l de,en,fr COM:2012:0011:FIN -o eurlex-com-2012-0011-fin.json
profile.json
Since the HTML otuput of Eurlex is pretty far from being machine readable, eurlex.js applies a lot of magic to read it anyway. The magic can be fine tuned with setting in a file called profile.json
. Here is a stripped and commented version of profile.json
:
"lang": "en""de""..." // array of avalable languages "expressions": // regular expressions "lang": "..." // to match the language of the document "title": "..." // to match the title of the document "delimiters": // delimiters (they are all regex) "en": // for this language "recitals": "...""..." // start and end of recitals "articles": "...""..." // start and end of articles "chapter": "^CHAPTER " // string to match a chapter "section": "^SECTION " // string to match a section "article": "^Article " // string to match an article "fixes": // before a line is parsed "...""..." // .replace(/first/, "second") "...""..." // as many as you need "lv": "recitals": "...""..." "articles": "...""..." "chapter": // if this is an array "^([XVI]+) NODAĻA" // if matches: chapter "^([XVI]+) NODAĻA$" // if matches: text missing "^([XVI]+) NODAĻA (.*)$" // $1 is the literal, $2 is the text "section": // same here... "^([0-9]+)\\. IEDAĻA" "^([0-9]+)\\. IEDAĻA$" "^([0-9]+)\\. IEDAĻA (.*)$" "article": // note! for article[3] "^([0-9]+)\\. pants" // $1 is the literal, __$3__ is the text "^([0-9]+)\\. pants$" "^([0-9]+)(\\.) pants (.*)$" "fixes": // fixes indeed can be empty
Limitations & Known issues
- In Magyar, paragraphs and points partly use the same literal enclosures, which leads to paragraphs will be interpreted as headless points. You should be safe using
--unify
with another language as first parameter. - The translations for Malti are formatted pretty crappy and have redundant fragments. You have to hardly rely on the fixes in your profile.json
License
eurlex.js is licensed under EUPL