Converts mediawiki collection bundles (as generated by mw-ocg-bundler) to plaintext
Converts mediawiki collection bundles (as generated by mw-ocg-bundler) to stripped plaintext.
This is a proof-of-concept, but it could be used to archive or embed the textual content of wikipedia in a minimal amount of space.
Node version 0.8 and 0.10 are tested to work.
Install the node package dependencies.
Install other system dependencies.
apt-get install unzip
You may wish to install the mw-ocg-bundler npm package to create bundles
from wikipedia articles. The below text assumes that you have done
so; ignore the
mw-ocg-bundler references if you have bundles from
some other source.
To generate a plaintext file named
out.txt from the English
enwiki) wikipedia article "United States":
mw-ocg-bundler -o us.zip --prefix enwiki "United States"bin/mw-ocg-texter -o out.txt us.zip
The default format does 80-column word wrap. If you would like to
use "semantic" new lines (that is, newlines end paragraphs and there
are no newlines within paragraphs) use the
bin/mw-ocg-texter --no-wrap -o out.txt us.zip
For other options, see:
To convert a single article without the bundle creation step, use:
bin/mw-ocg-texter -h en.wikipedia.org -t "United States"
-h option specifies the hostname of the wiki, and the
option gives the title to convert. The content will be fetched
from the Wikimedia REST API and converted, with output to standard
out (unless the
-o option is given).
This backend should implement the Unicode Nearly Plain-Text Encoding of Mathematics to render math content.
(c) 2013-2014 by C. Scott Ananian