virastar

    0.21.0 • Public • Published

    Virastar (ویراستار)

    Virastar is a Persian text cleaner.

    A javascript port of aziz/virastar with lots of help from ebraminio/persiantools

    see live demo

    Build Status Dependency Status NPM version GitHub issues GitHub license js-semistandard-style

    Install

    npm install virastar

    Usage

    var Virastar = require('virastar');
    var virastar = new Virastar();
     
    virastar.cleanup("فارسي را كمی درست تر می نويسيم"); // Outputs: "فارسی را کمی درست‌تر می‌نویسیم"

    Browser

    <script src="lib/virastar.js"></script>
    <script>
      var virastar = new Virastar();
      alert(virastar.cleanup("فارسي را كمی درست تر می نويسيم"));
    </script> 

    Virastar([text] [,options])

    text

    Type: string

    string of persian source to be cleaned.

    options

    Type: object

    Virastar("سلام 123" ,{"fix_english_numbers":false}); // Outputs: "سلام 123"

    Options and Specifications

    Virastar comes with a list of options to control its behavior.

    normalize_eol

    default: true

    • replaces windows end of lines with unix eol (\n)

    decode_htmlentities

    default: true

    • converts numeral and selected html character-sets into original characters

    fix_dashes

    default: true

    • replaces triple dash to mdash
    • replaces double dash to ndash

    fix_three_dots

    default: true

    • removes spaces between dots
    • replaces three dots with ellipsis character

    normalize_ellipsis

    default: true

    • replaces more than one ellipsis with one
    • replaces (space|tab|zwnj) after ellipsis with one space

    normalize_dates

    default: true

    • re-orders date parts with slash as delimiter

    fix_english_quotes_pairs

    default: true

    • replaces english quote pairs (“”) with their persian equivalent («»)

    fix_english_quotes

    default: true

    • replaces english quote marks with their persian equivalent

    fix_hamzeh

    default: true

    • replaces ه followed by (space|ZWNJ|lrm) follow by ی with هٔ
    • replaces ه followed by (space|ZWNJ|lrm|nothing) follow by ء with هٔ
    • replaces هٓ or single-character ۀ with the standard هٔ

    fix_hamzeh_arabic

    default: false

    • converts arabic hamzeh ة to هٔ

    cleanup_rlm

    default: true

    • converts Right-to-left marks followed by persian characters to zero-width non-joiners (ZWNJ)

    cleanup_zwnj

    default: true

    • converts all soft hyphens (&shy;) into zwnj
    • removes more than one zwnj
    • cleans zwnj after characters that don't conncet to the next
    • cleans zwnj before and after numbers, english words, spaces and punctuations
    • removes unnecessary zwnj on start/end of each line

    fix_arabic_numbers

    default: true

    • replaces arabic numbers with their persian equivalent

    fix_english_numbers

    default: true

    • replaces english numbers with their persian equivalent

    fix_numeral_symbols

    default: true

    • replaces english percent signs (U+066A)
    • replaces dots between numbers into decimal separator (U+066B)
    • replaces commas between numbers into thousands separator (U+066C)

    fix_misc_non_persian_chars

    default: true

    • replaces arabic normal/swash kaf with its persian equivalent
    • replaces arabic/urdu/pushtu/uyghur yeh with its persian equivalent
    • replaces kurdish he with its persian equivalent

    fix_punctuations

    default: true

    • replaces ,, ; with its persian equivalent

    fix_question_mark

    default: true

    • replaces question marks with its persian equivalent

    fix_perfix_spacing

    default: true

    • puts zwnj between the word and the prefix:
      • mi*, nemi*, bi*

    fix_suffix_spacing

    default: true

    • puts zwnj between the word and the suffix:
      • *ha, *haye
      • *am, *at, *ash, *ei, *eid, *eem, *and, *man, *tan, *shan
      • *tar, *tari, *tarin
      • *hayee, *hayam, *hayat, *hayash, *hayetan, *hayeman, *hayeshan

    fix_suffix_misc

    default: true

    • replaces ه followed by ئ or ی, and then by ی, with ه‌ای

    fix_spacing_for_braces_and_quotes

    default: true

    • removes inside spaces and more than one outside for (), [], {}, “” and «»

    fix_spacing_for_punctuations

    default: true

    • removes space before punctuations
    • removes more than one space after punctuations, except followed by new-lines
    • removes space after colon that separates time parts
    • removes space after dots in numbers
    • removes space before some common domain tlds
    • removes space between question and exclamation marks
    • removes space between same marks

    fix_diacritics

    default: true

    • cleans zwnj before diacritic characters
    • cleans more than one diacritic characters
    • cleans spaces before diacritic characters

    remove_diacritics

    default: false

    • removes all diacritic characters

    fix_persian_glyphs

    default: true

    • converts incorrect persian glyphs to standard characters

    fix_misc_spacing

    default: true

    • removes space before parentheses on misc cases
    • removes space before braces containing numbers

    cleanup_spacing

    default: true

    • replaces more than one space with just a single one
    • cleans whitespace/zwnj between new-lines

    cleanup_line_breaks

    default: true

    • cleans more than two contiguous line breaks

    cleanup_begin_and_end

    default: true

    • removes space/tab/zwnj/nbsp from the beginning of the new-lines
    • removes spaces, tabs, zwnj, direction marks and new lines from the beginning and end of text

    markdown

    markdown_normalize_braces

    default: true

    • removes spaces between [] and () ([text] (link) into [text](link))
    • removes space between ! and opening brace (! [alt](src) into ![alt](src))
    • removes spaces inside double (), [], {} ([[ text ]] into [[text]])
    • removes spaces between double (), [], {} ([[text] ] into [[text]])

    markdown_normalize_lists

    default: true

    • removes extra lines between two items on a markdown list beginning with -, * or #

    skip_markdown_ordered_lists_numbers_conversion

    default: false

    • skips converting english numbers of ordered lists in markdown

    aggressive editing

    cleanup_extra_marks

    default: true

    • replaces more than one exclamation mark with just one
    • replaces more than one english or persian question mark with just one
    • re-orders consecutive marks: ?! into !?

    kashidas_as_parenthetic

    default: true

    • replaces kashidas to ndash in parenthetic

    cleanup_kashidas

    default: true

    • converts kashida between numbers to ndash
    • removes all kashidas between non-whitespace characters

    extras

    preserve_frontmatter

    default: true

    • preserves frontmatter data in the text

    preserve_HTML

    default: true

    • preserves all html tags in the text

    preserve_comments

    default: true

    • preserves all html comments in the text

    preserve_entities

    default: true

    • preserves all html entities in the text

    preserve_URIs

    default: true

    • preserves all uri strings in the text

    preserve_brackets

    default: false

    • preserves strings inside square brackets ([])

    preserve_braces

    default: false

    • preserves strings inside curly braces ({})

    preserve_nbsps

    default: true

    • preserves all no-break space entities in the text

    License

    This software is licensed under the MIT License. View the license.

    Keywords

    Install

    npm i virastar

    DownloadsWeekly Downloads

    13

    Version

    0.21.0

    License

    none

    Unpacked Size

    75.9 kB

    Total Files

    7

    Last publish

    Collaborators

    • juvee