Interface Encoding
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final StringName of the environment variable key to control when JTokkit should switch to a different tokenizer. -
Method Summary
Modifier and TypeMethodDescriptionintcountTokens(String text) Encodes the given text into a list of token ids and returns the amount of tokens.intcountTokensOrdinary(String text) Encodes the given text into a list of token ids and returns the amount of tokens.decode(IntArrayList tokens) Decodes the given list of token ids into a text.byte[]decodeBytes(IntArrayList tokens) Decodes the given list of token ids into a byte array.Encodes the given text into a list of token ids.Encodes the given text into a list of token ids.encodeOrdinary(String text) Encodes the given text into a list of token ids, ignoring special tokens.encodeOrdinary(String text, int maxTokens) Encodes the given text into a list of token ids, ignoring special tokens.getName()Returns the name of this encoding.
-
Field Details
-
VERY_LARGE_TOKENIZER_BYTE_THRESHOLD_KEY
Name of the environment variable key to control when JTokkit should switch to a different tokenizer. For all inputs below the given threshold, JTokkit uses a tokenizer that scales in quadratic time but is faster for small inputs. For larger inputs, a linearly scaling tokenizer is used. Per default, when this environment variable is not set, the threshold is configured accordingly to our benchmarks to be near-optimal for a wide variety of use-cases, but if you have a very specialized input format, you may want to experiment and benchmark with different input size thresholds.- See Also:
-
-
Method Details
-
encode
Encodes the given text into a list of token ids.Special tokens are artificial tokens used to unlock capabilities from a model, such as fill-in-the-middle. There is no support for parsing special tokens in a text, so if the text contains special tokens, this method will throw an
UnsupportedOperationException.If you want to encode special tokens as ordinary text, use
encodeOrdinary(String).Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE); encoding.encode("hello world"); // returns [15339, 1917] encoding.encode("hello <|endoftext|> world"); // raises an UnsupportedOperationException- Parameters:
text- the text to encode- Returns:
- the list of token ids
- Throws:
UnsupportedOperationException- if the text contains special tokens which are not supported for now
-
encode
Encodes the given text into a list of token ids.Special tokens are artificial tokens used to unlock capabilities from a model, such as fill-in-the-middle. There is no support for parsing special tokens in a text, so if the text contains special tokens, this method will throw an
UnsupportedOperationException.If you want to encode special tokens as ordinary text, use
encodeOrdinary(String, int).This method will truncate the list of token ids if the number of tokens exceeds the given maxTokens parameter. Note that it will try to keep characters together, that are encoded into multiple tokens. For example, if the text contains a character which is encoded into 3 tokens, and due to the maxTokens parameter the last token of the character is truncated, the first two tokens of the character will also be truncated. Therefore, the actual number of tokens may be less than the given maxTokens parameter.
Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE); encoding.encode("hello world", 100); // returns [15339, 1917] encoding.encode("hello <|endoftext|> world", 100); // raises an UnsupportedOperationException- Parameters:
text- the text to encodemaxTokens- the maximum number of tokens to encode- Returns:
- the
EncodingResultcontaining a list of token ids and whether the tokens were truncated due to the maxTokens parameter - Throws:
UnsupportedOperationException- if the text contains special tokens which are not supported for now
-
encodeOrdinary
Encodes the given text into a list of token ids, ignoring special tokens.This method does not throw an exception if the text contains special tokens, but instead encodes them as if they were ordinary text.
Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE); encoding.encodeOrdinary("hello world"); // returns [15339, 1917] encoding.encodeOrdinary("hello <|endoftext|> world"); // returns [15339, 83739, 8862, 728, 428, 91, 29, 1917]- Parameters:
text- the text to encode- Returns:
- the list of token ids
-
encodeOrdinary
Encodes the given text into a list of token ids, ignoring special tokens.This method does not throw an exception if the text contains special tokens, but instead encodes them as if they were ordinary text.
It will truncate the list of token ids if the number of tokens exceeds the given maxTokens parameter. Note that it will try to keep characters together, that are encoded into multiple tokens. For example, if the text contains a character which is encoded into 3 tokens, and due to the maxTokens parameter the last token of the character is truncated, the first two tokens of the character will also be truncated. Therefore, the actual number of tokens may be less than the given maxTokens parameter.
Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE); encoding.encodeOrdinary("hello world", 100); // returns [15339, 1917] encoding.encodeOrdinary("hello <|endoftext|> world", 100); // returns [15339, 83739, 8862, 728, 428, 91, 29, 1917]- Parameters:
text- the text to encodemaxTokens- the maximum number of tokens to encode- Returns:
- the
EncodingResultcontaining a list of token ids and whether the tokens were truncated due to the maxTokens parameter
-
countTokens
Encodes the given text into a list of token ids and returns the amount of tokens. It is more performant thanencode(String).Special tokens are artificial tokens used to unlock capabilities from a model, such as fill-in-the-middle. There is no support for parsing special tokens in a text, so if the text contains special tokens, this method will throw an
UnsupportedOperationException.If you want to encode special tokens as ordinary text, use
countTokensOrdinary(String).Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE); encoding.countTokens("hello world"); // returns 2 encoding.countTokens("hello <|endoftext|> world"); // raises an UnsupportedOperationException- Parameters:
text- the text to count tokens for- Returns:
- the amount of tokens
- Throws:
UnsupportedOperationException- if the text contains special tokens which are not supported for now
-
countTokensOrdinary
Encodes the given text into a list of token ids and returns the amount of tokens. It is more performant thanencodeOrdinary(String).This method does not throw an exception if the text contains special tokens, but instead encodes them as if they were ordinary text.
Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE); encoding.countTokensOrdinary("hello world"); // returns 2 encoding.countTokensOrdinary("hello <|endoftext|> world"); // returns 8- Parameters:
text- the text to count tokens for- Returns:
- the amount of tokens
-
decode
Decodes the given list of token ids into a text.Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE); encoding.decode(List.of(15339, 1917)); // returns "hello world" encoding.decode(List.of(15339, 1917, Integer.MAX_VALUE)); // raises an IllegalArgumentException
- Parameters:
tokens- the list of token ids- Returns:
- the decoded text
- Throws:
IllegalArgumentException- if the list contains invalid token ids
-
decodeBytes
Decodes the given list of token ids into a byte array.Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE); encoding.decodeBytes(List.of(15339, 1917)); // returns [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100] encoding.decodeBytes(List.of(15339, 1917, Integer.MAX_VALUE)); // raises an IllegalArgumentException
- Parameters:
tokens- the list of token ids- Returns:
- the decoded byte array
- Throws:
IllegalArgumentException- if the list contains invalid token ids
-
getName
String getName()Returns the name of this encoding. This is the name which is used to identify the encoding and must be unique for registration in theEncodingRegistry.- Returns:
- the name of this encoding
-