mw-ocg-texter
Converts mediawiki collection bundles (as generated by mw-ocg-bundler) to stripped plaintext.
This is a proof-of-concept, but it could be used to archive or embed the textual content of wikipedia in a minimal amount of space.
Installation
Node version 0.8 and 0.10 are tested to work.
Install the node package dependencies.
npm install
Install other system dependencies.
apt-get install unzip
Generating bundles
You may wish to install the mw-ocg-bundler npm package to create bundles
from wikipedia articles. The below text assumes that you have done
so; ignore the mw-ocg-bundler
references if you have bundles from
some other source.
Running
To generate a plaintext file named out.txt
from the en.wikipedia.org
article
"United States":
$SOMEPATH/bin/mw-ocg-bundler -v -o us.zip -h en.wikipedia.org "United States"
bin/mw-ocg-texter -o out.txt us.zip
In the above command $SOMEPATH
is the place you installed
mw-ocg-bundler
; if you've used the directory structure recommended
by mw-ocg-service
this will be ../mw-ocg-bundler
.
The default format does 80-column word wrap. If you would like to
use "semantic" new lines (that is, newlines end paragraphs and there
are no newlines within paragraphs) use the --no-wrap
option:
bin/mw-ocg-texter --no-wrap -o out.txt us.zip
For other options, see:
bin/mw-ocg-texter --help
Standalone mode
To convert a single article without the bundle creation step, use:
bin/mw-ocg-texter -h en.wikipedia.org -t "United States"
The -h
option specifies the hostname of the wiki, and the -t
option gives the title to convert. The content will be fetched
from the Wikimedia REST API and converted, with output to standard
out (unless the -o
option is given).
Other ideas
This backend should implement the Unicode Nearly Plain-Text Encoding of Mathematics to render math content.
Related Projects
License
GPLv2
(c) 2013-2014 by C. Scott Ananian