mw-ocg-texter

Converts mediawiki collection bundles (as generated by mw-ocg-bundler) to plaintext

mw-ocg-texter

Converts mediawiki collection bundles (as generated by mw-ocg-bundler) to stripped plaintext.

This is a proof-of-concept, but it could be used to archive or embed the textual content of wikipedia in a minimal amount of space.

Node version 0.8 and 0.10 are tested to work.

Install the node package dependencies.

npm install

Install other system dependencies.

apt-get install unzip

You may wish to install the mw-ocg-bundler npm package to create bundles from wikipedia articles. The below text assumes that you have done so; ignore the mw-ocg-bundler references if you have bundles from some other source.

To generate a plaintext file named out.txt from the English (enwiki) wikipedia article "United States":

mw-ocg-bundler -o us.zip --prefix enwiki "United States"
bin/mw-ocg-texter -o out.txt us.zip

The default format does 80-column word wrap. If you would like to use "semantic" new lines (that is, newlines end paragraphs and there are no newlines within paragraphs) use the --no-wrap option:

bin/mw-ocg-texter --no-wrap -o out.txt us.zip

For other options, see:

bin/mw-ocg-texter --help

This backend should implement the Unicode Nearly Plain-Text Encoding of Mathematics to render math content.

GPLv2

(c) 2013-2014 by C. Scott Ananian