Node.js port of https://github.com/google/budou:
English uses spacing and hyphenation as cues to allow for beautiful and legible line breaks. Certain CJK languages have none of these, and are notoriously more difficult. Breaks occur randomly, usually in the middle of a word. This is a long standing issue in typography on web, and results in degradation of readability.
Budou automatically translates CJK sentences into organized HTML code with lexical chunks wrapped in non-breaking markup so as to semantically control line breaks. Budou uses Google Cloud Natural Language API (NL API) to analyze the input sentence, and it concatenates proper words in order to produce meaningful chunks utilizing part-of-speech (pos) tagging and syntactic information. Processed chunks are wrapped with
SPANtag, so semantic units will no longer be split at the end of a line by specifying their
Install budou-node using
npm install budou
yarn add budou
How to use
Get the parser by completing authentication with a credential file for NL API, which can be downloaded from Google Cloud Platform by navigating through "API Manager" > "Credentials" > "Create credentials" > "Service account key" > "JSON".
The path of file can be set as an ENV var,
GOOGLE_APPLICATION_CREDENTIALS , or passed as
an option to the
const Budou =// Login to Cloud Natural Language API with credentialsconst parser = Budou// Set options and parse text for resultconst options = attributes: class: 'wordwrap' language: 'ja'const result = await parserconsole// => "<span><span class="wordwrap">今日も</span><span class="wordwrap">元気です</span></span>"console // => "今日も"console // => "元気です"
To make the semantic units in the output HTML wrap correctly at the end of the line
<span> tag with
display: inline-block in CSS.
See Original Docs for:
parser.parse(text, options) method accepts options below in addition to the input text.
||A key-value mapping for attributes of output
||Whether to use caching. Helps reduce calls to NL API for repeated text.|
||Language of the text. If
||Whether to use Entity mode.|
||Maximum chunk character length. If a chunk is longer than this it will not be wrapped in a
Budou is backed up by Google Natural Language API, so cost may be incurred when using that API.
In other languages including Japanese, the default parser uses Syntax Analysis and incurs cost according to monthly usage. If you enable Entity mode by specifying
use_entity=True, the parser uses both of Syntax Analysis and Entity Analysis, which will incur additional cost.
Google Cloud Natural Language API has free quota to start testing the feature at free of cost, but please refer to [Google Cloud Natural Language API Pricing Guide]> (https://cloud.google.com/natural-language/pricing) for more detailed pricing information.
This Node.js library was derived from the original Budou python library https://github.com/google/budou licensed under Apache-2.0. In no way associated or endorsed.