Interface Encoding


public interface Encoding
  • Field Details

    • VERY_LARGE_TOKENIZER_BYTE_THRESHOLD_KEY

      static final String VERY_LARGE_TOKENIZER_BYTE_THRESHOLD_KEY
      Name of the environment variable key to control when JTokkit should switch to a different tokenizer. For all inputs below the given threshold, JTokkit uses a tokenizer that scales in quadratic time but is faster for small inputs. For larger inputs, a linearly scaling tokenizer is used. Per default, when this environment variable is not set, the threshold is configured accordingly to our benchmarks to be near-optimal for a wide variety of use-cases, but if you have a very specialized input format, you may want to experiment and benchmark with different input size thresholds.
      See Also:
  • Method Details

    • encode

      IntArrayList encode(String text)
      Encodes the given text into a list of token ids.

      Special tokens are artificial tokens used to unlock capabilities from a model, such as fill-in-the-middle. There is no support for parsing special tokens in a text, so if the text contains special tokens, this method will throw an UnsupportedOperationException.

      If you want to encode special tokens as ordinary text, use encodeOrdinary(String).

       Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
       encoding.encode("hello world");
       // returns [15339, 1917]
      
       encoding.encode("hello <|endoftext|> world");
       // raises an UnsupportedOperationException
       
      Parameters:
      text - the text to encode
      Returns:
      the list of token ids
      Throws:
      UnsupportedOperationException - if the text contains special tokens which are not supported for now
    • encode

      EncodingResult encode(String text, int maxTokens)
      Encodes the given text into a list of token ids.

      Special tokens are artificial tokens used to unlock capabilities from a model, such as fill-in-the-middle. There is no support for parsing special tokens in a text, so if the text contains special tokens, this method will throw an UnsupportedOperationException.

      If you want to encode special tokens as ordinary text, use encodeOrdinary(String, int).

      This method will truncate the list of token ids if the number of tokens exceeds the given maxTokens parameter. Note that it will try to keep characters together, that are encoded into multiple tokens. For example, if the text contains a character which is encoded into 3 tokens, and due to the maxTokens parameter the last token of the character is truncated, the first two tokens of the character will also be truncated. Therefore, the actual number of tokens may be less than the given maxTokens parameter.

       Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
       encoding.encode("hello world", 100);
       // returns [15339, 1917]
      
       encoding.encode("hello <|endoftext|> world", 100);
       // raises an UnsupportedOperationException
       
      Parameters:
      text - the text to encode
      maxTokens - the maximum number of tokens to encode
      Returns:
      the EncodingResult containing a list of token ids and whether the tokens were truncated due to the maxTokens parameter
      Throws:
      UnsupportedOperationException - if the text contains special tokens which are not supported for now
    • encodeOrdinary

      IntArrayList encodeOrdinary(String text)
      Encodes the given text into a list of token ids, ignoring special tokens.

      This method does not throw an exception if the text contains special tokens, but instead encodes them as if they were ordinary text.

       Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
       encoding.encodeOrdinary("hello world");
       // returns [15339, 1917]
      
       encoding.encodeOrdinary("hello <|endoftext|> world");
       // returns [15339, 83739, 8862, 728, 428, 91, 29, 1917]
       
      Parameters:
      text - the text to encode
      Returns:
      the list of token ids
    • encodeOrdinary

      EncodingResult encodeOrdinary(String text, int maxTokens)
      Encodes the given text into a list of token ids, ignoring special tokens.

      This method does not throw an exception if the text contains special tokens, but instead encodes them as if they were ordinary text.

      It will truncate the list of token ids if the number of tokens exceeds the given maxTokens parameter. Note that it will try to keep characters together, that are encoded into multiple tokens. For example, if the text contains a character which is encoded into 3 tokens, and due to the maxTokens parameter the last token of the character is truncated, the first two tokens of the character will also be truncated. Therefore, the actual number of tokens may be less than the given maxTokens parameter.

       Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
       encoding.encodeOrdinary("hello world", 100);
       // returns [15339, 1917]
      
       encoding.encodeOrdinary("hello <|endoftext|> world", 100);
       // returns [15339, 83739, 8862, 728, 428, 91, 29, 1917]
       
      Parameters:
      text - the text to encode
      maxTokens - the maximum number of tokens to encode
      Returns:
      the EncodingResult containing a list of token ids and whether the tokens were truncated due to the maxTokens parameter
    • countTokens

      int countTokens(String text)
      Encodes the given text into a list of token ids and returns the amount of tokens. It is more performant than encode(String).

      Special tokens are artificial tokens used to unlock capabilities from a model, such as fill-in-the-middle. There is no support for parsing special tokens in a text, so if the text contains special tokens, this method will throw an UnsupportedOperationException.

      If you want to encode special tokens as ordinary text, use countTokensOrdinary(String).

       Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
       encoding.countTokens("hello world");
       // returns 2
      
       encoding.countTokens("hello <|endoftext|> world");
       // raises an UnsupportedOperationException
       
      Parameters:
      text - the text to count tokens for
      Returns:
      the amount of tokens
      Throws:
      UnsupportedOperationException - if the text contains special tokens which are not supported for now
    • countTokensOrdinary

      int countTokensOrdinary(String text)
      Encodes the given text into a list of token ids and returns the amount of tokens. It is more performant than encodeOrdinary(String).

      This method does not throw an exception if the text contains special tokens, but instead encodes them as if they were ordinary text.

       Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
       encoding.countTokensOrdinary("hello world");
       // returns 2
      
       encoding.countTokensOrdinary("hello <|endoftext|> world");
       // returns 8
       
      Parameters:
      text - the text to count tokens for
      Returns:
      the amount of tokens
    • decode

      String decode(IntArrayList tokens)
      Decodes the given list of token ids into a text.
       Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
       encoding.decode(List.of(15339, 1917));
       // returns "hello world"
      
       encoding.decode(List.of(15339, 1917, Integer.MAX_VALUE));
       // raises an IllegalArgumentException
       
      Parameters:
      tokens - the list of token ids
      Returns:
      the decoded text
      Throws:
      IllegalArgumentException - if the list contains invalid token ids
    • decodeBytes

      byte[] decodeBytes(IntArrayList tokens)
      Decodes the given list of token ids into a byte array.
       Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
       encoding.decodeBytes(List.of(15339, 1917));
       // returns [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]
      
       encoding.decodeBytes(List.of(15339, 1917, Integer.MAX_VALUE));
       // raises an IllegalArgumentException
       
      Parameters:
      tokens - the list of token ids
      Returns:
      the decoded byte array
      Throws:
      IllegalArgumentException - if the list contains invalid token ids
    • getName

      String getName()
      Returns the name of this encoding. This is the name which is used to identify the encoding and must be unique for registration in the EncodingRegistry.
      Returns:
      the name of this encoding