Converts mediawiki collection bundles (as generated by mw-ocg-bundler) to plaintext


Converts mediawiki collection bundles (as generated by mw-ocg-bundler) to stripped plaintext.

This is a proof-of-concept, but it could be used to archive or embed the textual content of wikipedia in a minimal amount of space.

Node version 0.8 and 0.10 are tested to work.

Install the node package dependencies.

npm install

Install other system dependencies.

apt-get install unzip

You may wish to install the mw-ocg-bundler npm package to create bundles from wikipedia articles. The below text assumes that you have done so; ignore the mw-ocg-bundler references if you have bundles from some other source.

To generate a plaintext file named out.txt from the English (enwiki) wikipedia article "United States":

mw-ocg-bundler -o --prefix enwiki "United States"
bin/mw-ocg-texter -o out.txt

The default format does 80-column word wrap. If you would like to use "semantic" new lines (that is, newlines end paragraphs and there are no newlines within paragraphs) use the --no-wrap option:

bin/mw-ocg-texter --no-wrap -o out.txt

For other options, see:

bin/mw-ocg-texter --help

To convert a single article without the bundle creation step, use:

bin/mw-ocg-texter -h -t "United States"

The -h option specifies the hostname of the wiki, and the -t option gives the title to convert. The content will be fetched from the Wikimedia REST API and converted, with output to standard out (unless the -o option is given).

This backend should implement the Unicode Nearly Plain-Text Encoding of Mathematics to render math content.


(c) 2013-2014 by C. Scott Ananian