tokenizers
    TypeScript icon, indicating that this package has built-in type declarations

    0.8.0 • Public • Published



    Build GitHub


    NodeJS implementation of today's most used tokenizers, with a focus on performance and versatility. Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

    Main features

    • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
    • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
    • Easy to use, but also extremely versatile.
    • Designed for research and production.
    • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
    • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

    Installation

    npm install tokenizers@latest

    Basic example

    import { BertWordPieceTokenizer } from "tokenizers";
    
    const wordPieceTokenizer = await BertWordPieceTokenizer.fromOptions({ vocabFile: "./vocab.txt" });
    const wpEncoded = await wordPieceTokenizer.encode("Who is John?", "John is a teacher");
    
    console.log(wpEncoded.length);
    console.log(wpEncoded.tokens);
    console.log(wpEncoded.ids);
    console.log(wpEncoded.attentionMask);
    console.log(wpEncoded.offsets);
    console.log(wpEncoded.overflowing);
    console.log(wpEncoded.specialTokensMask);
    console.log(wpEncoded.typeIds);
    console.log(wpEncoded.wordIndexes);

    Provided Tokenizers

    • BPETokenizer: The original BPE
    • ByteLevelBPETokenizer: The byte level version of the BPE
    • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
    • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

    License

    Apache License 2.0

    Keywords

    none

    Install

    npm i tokenizers

    DownloadsWeekly Downloads

    109

    Version

    0.8.0

    License

    Apache-2.0

    Unpacked Size

    108 kB

    Total Files

    39

    Last publish

    Collaborators

    • julien-c
    • n1t0
    • pierric