wikipedia-elasticsearch-import

1.0.2 • Public • Published

wikipedia-elasticsearch-import

Import Wikipedia dumps into Elasticsearch server using streams and bulk indexing for speed.

What does this module do?

Wikipedia is publishing dumps of their whole database in every language on a regular basis, which you can download and use for free. This module parses the giant Wikipedia xml dump file, converts it into stream and imports the contents right into your Elasticsearch server or farm.

How to import Wikipedia dump into your own Elasticsearch server

  1. In order to import Wikpedia dump you must run the Elasticsearch server first. Please refer to the Elasticsearch documentation how to do this.
  2. Download latest Wikipedia from one of the following locations depending on the language you want:
  1. Unzip the downloaded file .xml.bz2, e.g. enwiki-20180801-pages-articles-multistream.xml.bz2 into the unzipped .xml file.
  2. Edit the config.js file to configure Wikipedia dump .xml file and Elasticsearch server connection settings.
  3. Run the importer with npm start and watch your Elasticsearch database is being populated with raw Wikipedia documents.

Settings

  • You can set limit on bulk documents import in the config.js which is 100 by default.
  • Set index, type, host, port, and logFile. If you enabled x-pack plugin for Elasticsearch you can also set the httpAuth setting, otherwise it's ignored.

Please contribute

Please visit my GitHub to post your questions, suggestions and pull requests.

Package Sidebar

Install

npm i wikipedia-elasticsearch-import

Weekly Downloads

1

Version

1.0.2

License

ISC

Unpacked Size

15.7 kB

Total Files

8

Last publish

Collaborators

  • pawelotto