VNTK
Vietnamese NLP Toolkit for Node
Installation In A Nutshell
- Install Node.js
- Run:
$ npm install vntk --save
If you are interested in contributing to vntk, or just hacking on it, then fork it away!
Jump to guide: How to build an NLP API Server using Vntk.
Documentation
CLI Utilities
1. Installation
Vntk cli will install nice and easy with:
npm install -g @vntk/cli
Then you need to pay attention to how to use these cli utilities to preprocess text from files, especially vietnamese that describe at the end of each apis usage. If you wish to improve the tool, please fork and make it better here.
2. Usage Example
After the CLI has installed, you need to open your Terminal
(or Command Prompt on Windows) and type command you need to use.
For instance, the following command will open a file and process it by using Word Tokenizer to tokenize each lines in the file.
# Process a text file or a folder $ vntk ws input.txt --output output.txt # Output file will contain lines which have tokenized.
API Usage
1. Tokenizer
Regex Tokenizer using Regular Expression.
Tokenizer is provided to break text into arrays of tokens!
Example:
var vntk = ;var tokenizer = vntk; console// [ 'Giá', 'khuyến', 'mãi', ':', '140.000', 'đ', '/', 'kg', '==>', 'giảm', 'được', '20', '%' ] console// Giá khuyến mãi : 140.000 đ / kg ==> giảm được 20 %
Command line: vntk tok <file_name.txt>
2. Word Segmentation
Vietnamese Word Segmentation using Conditional Random Fields, called:
Word Tokenizer
.
Word Tokenizer helps break text into arrays of words!
var vntk = ;var tokenizer = vntk; console;// [ 'Chào mừng', 'các', 'bạn', 'trẻ', 'tới', 'thành phố', 'Hà Nội' ]
Load custom trained model:
var vntk = ;var tokenizer = vntk; console;// Chào_mừng các bạn trẻ tới thành_phố Hà_Nội
Command line: vntk ws <file_name.txt>
3. POS Tagging
Vietnamese Part of Speech Tagging using Conditional Random Fields, called:
posTag
.
Pos_Tag helps labeling the part of speech of sentences!
var vntk = ;var pos_tag = vntk; console// [ [ 'Chợ', 'N' ],// [ 'thịt', 'N' ],// [ 'chó', 'N' ],// [ 'nổi tiếng', 'A' ],// [ 'ở', 'E' ],// [ 'TP', 'N' ],// [ 'Hồ', 'Np' ],// [ 'Chí', 'Np' ],// [ 'Minh', 'Np' ],// [ 'bị', 'V' ],// [ 'truy quét', 'V' ] ]
Load custom trained model:
var vntk = ;var pos_tag = vntk; console// [N Cán bộ] [N xã] [C và] [L những] [N chiêu] [CH "] [V xin] [V làm] [N hộ] [A nghèo] [CH "] [V cười] [V ra] [N nước mắt]
Command line: vntk pos <file_name.txt>
4. Chunking
Vietnamese Chunking using Conditional Random Fields
Chucking helps labeling the part of speech of sentences and short phrases (like noun phrases)!
var vntk = ;var chunking = vntk; console// [ [ 'Nhật ký', 'N', 'B-NP' ],// [ 'SEA', 'N', 'B-NP' ],// [ 'Games', 'Np', 'B-NP' ],// [ 'ngày', 'N', 'B-NP' ],// [ '21/8', 'M', 'B-NP' ],// [ ':', 'CH', 'O' ],// [ 'Ánh', 'Np', 'B-NP' ],// [ 'Viên', 'Np', 'I-NP' ],// [ 'thắng', 'V', 'B-VP' ],// [ 'giòn giã', 'N', 'B-NP' ],// [ 'ở', 'E', 'B-PP' ],// [ 'vòng', 'N', 'B-NP' ],// [ 'loại', 'N', 'B-NP' ],// [ '.', 'CH', 'O' ] ]
Load custom trained model:
var vntk = ;var chunking = vntk; console;// [NP Nhật ký] [NP SEA] [NP Games] [NP ngày] [NP 21/8] : [NP Ánh Viên] [VP thắng] [NP giòn giã] [PP ở] [NP vòng] [NP loại] .
Command line: vntk chunk <file_name.txt>
5. Named Entity Recognition
Vietnamese Named Entity Recognition (NER) using Conditional Random Fields
In NER, your goal is to find named entities, which tend to be noun phrases (though aren't always)
var vntk = ;var ner = vntk; console// [ [ 'Chưa', 'R', 'O', 'O' ],// [ 'tiết lộ', 'V', 'B-VP', 'O' ],// [ 'lịch trình', 'V', 'B-VP', 'O' ],// [ 'tới', 'E', 'B-PP', 'O' ],// [ 'Việt Nam', 'Np', 'B-NP', 'B-LOC' ],// [ 'của', 'E', 'B-PP', 'O' ],// [ 'Tổng thống', 'N', 'B-NP', 'O' ],// [ 'Mỹ', 'Np', 'B-NP', 'B-LOC' ],// [ 'Donald', 'Np', 'B-NP', 'B-PER' ],// [ 'Trump', 'Np', 'B-NP', 'I-PER' ] ]
Load custom trained model:
var vntk = ;var ner = vntk; console// Chưa tiết lộ lịch trình tới [LOC Việt Nam] của Tổng thống [LOC Mỹ] [PER Donald Trump]
Command line: vntk ner <file_name.txt>
6. Utility
Dictionary
- Check a word is exists in dictionary
var vntk = ;var dictionary = vntk; dictionary;// true
- Lookup word definitons
var vntk = ;var dictionary = vntk; var senses = dictionary;console; // Output example: 'chào thầy giáo ~ con chào mẹ' sub_pos: 'Vt' definition: 'tỏ thái độ kính trọng hoặc quan tâm đối với ai bằng lời nói hay cử chỉ, khi gặp nhau hoặc khi từ biệt' pos: 'V' example: 'đứng nghiêm làm lễ chào cờ' sub_pos: 'Vu' definition: 'tỏ thái độ kính cẩn trước cái gì thiêng liêng, cao quý' pos: 'V' example: 'chào hàng ~ lời chào cao hơn mâm cỗ (tng)' sub_pos: 'Vu' definition: 'mời ăn uống hoặc mua hàng' pos: 'V'
Clean html
var vntk = ;var util = vntk; util;// Xin chào!!!
# command line vntk clean <file_name1.txt>
7. TF-IDF
Term Frequency–Inverse Document Frequency (tf-idf) is implemented to determine how important a word (or words) is to a document relative to a corpus. See following example.
var vntk = ;var tfidf = ; tfidf;tfidf;tfidf;tfidf; console;tfidf; console;tfidf;
The above output:
Bộ Công an --------------------------------document #0 is 6.553712897371581document #1 is 3.7768564486857903document #2 is 2.7768564486857903document #3 is 0.7768564486857903Tổng cục An ninh --------------------------document #0 is 1.5537128973715806document #1 is 0.7768564486857903document #2 is 0.7768564486857903document #3 is 9.242592351485516
8. Classifiers
Naive Bayes, fastText are classifiers currently supported.
Bayes Classifier
The following examples use the BayesClassifier class:
var vntk = ; var classifier = ; classifier;classifier;classifier;classifier;classifier;classifier;classifier;classifier;classifier;classifier; classifier; console;// output: when console;// output: who
FastText Classifier
According to fasttext.cc. We have a simple classifier for executing prediction models about cooking
from stackexchange questions:
const path = ;const vntk = ; const model = path;const classifier = model; classifier;
9. Language identification
VNTK Langid can identify 176 languages from text samples and return confidence scores for each (see the list of ISO codes below). This model was trained by fastText on data from Wikipedia, Tatoeba and SETimes, used under CC-BY-SA.
Api usage example:
- langid.detect([input])
- langid.getLanguages([input, num, callback])
- langid.langids - list of supported languages
const langid = ; // returns the most accuracy language detectedlangid ; // returns the list of detectable languageslangid ; // returns list of supported languaguesconsole
Load custom trained model:
var vntk = ;var langid = vntk;
List of supported languages
af als am an ar arz as ast av az azb ba bar bcl be bg bh bn bo bpy br bs bxr ca cbk ce ceb ckb co cs cv cy da de diq dsb dty dv el eml en eo es et eu fa fi fr frr fy ga gd gl gn gom gu gv he hi hif hr hsb ht hu hy ia id ie ilo io is it ja jbo jv ka kk km kn ko krc ku kv kw ky la lb lez li lmo lo lrc lt lv mai mg mhr min mk ml mn mr mrj ms mt mwl my myv mzn nah nap nds ne new nl nn no oc or os pa pam pfl pl pms pnb ps pt qu rm ro ru rue sa sah sc scn sco sd sh si sk sl so sq sr su sv sw ta te tg th tk tl tr tt tyv ug uk ur uz vec vep vi vls vo wa war wuu xal xmf yi yo yue zh
10. CRFSuite
For quick access to CRFSuite
which shipped with vntk
we can refer to it via following api.
var crfsuite = require('vntk').crfsuite()
Then create a Tagger
or Trainer
:
var crfsuite = var tagger = var trainer =
For detail documentation, click here.
NLP API Server
Follow these steps to quickly serve an NLP API server using vntk:
# Clone the repository git clone https://github.com/vunb/vntk # Move to source code folder cd vntk # Install dependencies npm install # Run NLP API server npm run server # Copy and paste the following link to your browser to see result in action # http://localhost:3000/api/tok/Phó Thủ tướng Vương Đình Huệ yêu cầu điều chỉnh tên gọi “trạm thu giá” BOT
Detail checkout: ./server
Contributing
Pull requests and stars are highly welcome.
For bugs and feature requests, please create an issue.
LICENSE
MIT.