The tokenizer
function takes a string as input and returns an object with the following properties:
-
count
: the number of tokens in the input string -
characters
: the number of characters in the input string -
text
: the original input string -
tokens
: an array of objects, where each object represents a token and its position in the input string. Each token object has the following properties:-
token
: the token string -
start
: the starting index of the token in the input string -
end
: the ending index of the token in the input string
-
The tokenizer
function uses the js-tiktoken
library to encode the input string into tokens using the GPT-2 encoding scheme. It then decodes the tokens back into strings, maps the tokens to their positions in the input string using the mapTokensToChunks
function, and returns the resulting object.
To use this module, you can import the tokenizer
function and call it with a string argument. Here's an example:
import { tokenizer } from 'your-module-name';
const input = 'This is a sample input string.';
const result = await tokenizer(input);
console.log(result);
/*
{
count: 7,
characters: 28,
text: 'This is a sample input string.',
tokens: [
{ token: 'This', start: 0, end: 3 },
{ token: 'Ġis', start: 5, end: 7 },
{ token: 'Ġa', start: 8, end: 8 },
{ token: 'Ġsample', start: 10, end: 16 },
{ token: 'Ġinput', start: 18, end: 22 },
{ token: 'Ġstring', start: 24, end: 29 },
{ token: '.', start: 29, end: 29 }
]
}
*/