wikitext2plaintext

0.1.0 • Public • Published

Wikitext2Plaintext

Javascript module used to convert mediawiki markdown/wikitext to plain text. This module employs the repetitive application of regular expressions to strip out the markdown and therefore is not intended to be used in high-performance applications.

The module was developed to be used in scenarios where the wikitext only needs to be converted to plaintext on a best effort basis and perfect results are not required. It was designed to be used on wikitext/markdown from mediawiki and specifically Wikipedia database dumps.

Note that this library does NOT convert the HTML version of a wiki page, it only converts the wikitext/markdown version.

Install

npm install wikitext2plaintext

Usage

To remove wikitext from a string using the default options.

const WT2PT = require('wikitext2plaintext');
 
var wt = new WT2PT();
 
wt.parse('## The Title ##\r\n*List item 1\r\n*List item 2\r\n');
 
/*
The Title
- List item 1
- List item 2
*/

If you do not want certain wikitext removed or a specific rule is causing problems with your particular use-case, you can exclude specific rules or specific rule groups. The rules and rule groups are listed at the bottom of this page.

API

Constructor

You must create an instance of the parser prior to using the functions.

const wikitext2plaintext = require('wikitext2plaintext');
 
var wt2pt = new wikitext2plaintext();

wt2pt.parse(wiki_text)

  • wiki_text (string) - Contains the wiki/markdown text to convert to
  • Return value (string) - Contains the plain text version of the wiki text which was passed in

This is the main function used to convert wiki text to plain text.

var wt2pt = new wikitext2plaintext();
var plaintext;
 
plaintext = wt.parse('## The Title ##\r\n*List item 1\r\n*List item 2\r\n');
 
console.log(plaintext);
/*
The Title
- List item 1
- List item 2
*/

wt2pt.exclude_rule_group(rule_group_name)

Causes the specified rule group (see table below) to be excluded. All rules within that rule group will NOT be applied when the parse function is called.

  • rule_group_name (string) - The name of the rule group to exclude during parsing.
  • Return value (n/a) - No value returned from this function

wt2pt.include_rule_group(rule_group_name)

Causes the specified rule group (see table below) to be included. All rules within that rule group will be applied when the parse function is called.

  • rule_group_name (string) - The name of the rule group to include during parsing.
  • Return value (n/a) - No value returned from this function

wt2pt.exclude_rule(rule_name)

Causes the specified rule group (see table below) to be excluded. The specified rule will NOT be applied when the parse function is called.

  • rule_name (string) - The name of the rule to exclude during parsing.
  • Return value (n/a) - No value returned from this function

wt2pt.include_rule(rule_name)

Causes the specified rule group (see table below) to be included. The specified rule will be applied when the parse function is called.

  • rule_name (string) - The name of the rule to include during parsing.
  • Return value (n/a) - No value returned from this function

wt2pt.repeat_rule_group(rule_group_name, repeat_count)

Calling this function causes the rule group (rule_group_name) to be applied multiple times (repeat_count). This is done in order to handle nested markdown.

  • rule_group_name (string) - The name of the rule group to repeat.
  • repeat_count (number) - The number of times the rule group should be repeated (between 1 and 1000)
  • Return value (n/a) - No value returned from this function

Rules & Rule Groups

All rules in bold are run by default.

Rule Name Rule Group Description
BOLD_TAGS N/A Removes any bold tags (leaves text)
HEADER_TAGS N/A Removes any header tags (leaves text)
WIKI_TABLES_REMOVE WIKI_TABLES Removes wiki tables entirely (including removal of text)
FILE_LINKS LINKS Removes media/file references and replaces with the alt description
LOCAL_LINKS_ALT LINKS Replaces local wiki links with their alt link text
LOCAL_LINKS LINK Replaces local links with their name (when no alt text exists)
EXTERNAL_LINKS_ALT LINKS Replaces external links with their alt text
EXTERNAL_LINKS_REMOVE LINKS Removes external links which have no alt text
EXTERNAL_LINKS_KEEP_URL LINKS Replaces external links which have no alt text with the URL
CATEGORIES_FORMAT N/A Replaces a reference to a category with "Category - "
CATEGORIES_REMOVE N/A Remove any category references
LIST_DEPTH_6 LISTS Prefix depth 6 list elements with 6 dashes in place of markdown
LIST_DEPTH_5 LISTS Prefix depth 5 list elements with 5 dashes in place of markdown
LIST_DEPTH_4 LISTS Prefix depth 4 list elements with 4 dashes in place of markdown
LIST_DEPTH_3 LISTS Prefix depth 3 list elements with 3 dashes in place of markdown
LIST_DEPTH_2 LISTS Prefix depth 2 list elements with 2 dashes in place of markdown
LIST_DEPTH_1 LISTS Prefix depth 1 list elements with 1 dashes in place of markdown
HTML_REF_TAGS HTML_TAGS Removes HTML "ref" tags
HTML_COMMENT_TAGS HTML_TAGS Removes HTML "comment" tags
HTML_MATH_TAGS HTML_TAGS Removes HTML "math" tags
HTML_SUB_TAGS HTML_TAGS Removes HTML "sub" tags
HTML_SUP_TAGS HTML_TAGS Removes HTML "sup" tags
HTML_BLOCKQUOTE_TAGS HTML_TAGS Removes HTML "blockquote" tags
CITE_TITLE DBL_CURLY_TAGS Replaces Wikipedia "cite" templates with the title of the cite
CITATION_TITLE_1 DBL_CURLY_TAGS Replaces Wikipedia "citation" templates with the title and publisher
CITATION_TITLE_2 DBL_CURLY_TAGS Removes Wikipedia "citation" templates with the title and publisher (reverse)
ISBN_FORMAT DBL_CURLY_TAGS Replaces ISBN templates with the ISBN number
IMDB_STATIC DBL_CURLY_TAGS Replaces IMDB templates with static text: "IMDB Reference"
DMOZ_FORMAT DBL_CURLY_TAGS Replaces DMOZ templates with the name of the DMOZ reference
OFFICIAL_WEB_STATIC DBL_CURLY_TAGS Replaces official website links with static text: "Official Website"
CITE_REMOVE DBL_CURLY_TAGS Removes all cite templates
CURLY_OTHER DBL_CURLY_TAGS Removes all templates/content enclosed in double curly brackets
REPEATED_BLANK_LINES_REMOVE N/A Removes repeated blank lines which get created when removing markdown

Package Sidebar

Install

npm i wikitext2plaintext

Weekly Downloads

6

Version

0.1.0

License

MIT

Unpacked Size

19.2 kB

Total Files

4

Last publish

Collaborators

  • dankennedy