Using OpenAI's TikToken in Ruby

The encoding used in the tiktoken library (and the Ruby binding discussed in this post) is a specific way of converting text into a sequence of tokens, which are then represented by their unique IDs. The encoding scheme is designed to work with OpenAI models like gpt-3.5-turbo and is based on the model's vocabulary and tokenizer.

There's a simple Ruby binding for TikToken made by Iapark that compiles the underlying Rust library.

First add it to your Gemfile

gem "tiktoken_ruby"

Then use it in your code. The service module I wrote today to use it in my Rails app looks like this:

require 'tiktoken_ruby'

module TikToken
  extend self

  DEFAULT_MODEL = "gpt-3.5-turbo"

  def count(string, model: DEFAULT_MODEL)
    get_tokens(string, model: model).length

  def get_tokens(string, model: DEFAULT_MODEL)
    encoding = Tiktoken.encoding_for_model(model)
    tokens = encoding.encode(string) do |token|
      [token, encoding.decode([token])]

Here's what it looks like in practice.

irb> TikToken.count("Absence is to love what wind is to fire; it extinguishes the small, it inflames the great.")
=> 19

irb> TikToken.get_tokens("Absence is to love what wind is to fire; it extinguishes the small, it inflames the great.")
 374=>" is",
 311=>" to",
 3021=>" love",
 1148=>" what",
 10160=>" wind",
 4027=>" fire",
 433=>" it",
 56807=>" extingu",
 279=>" the",
 2678=>" small",
 4704=>" infl",
 2294=>" great",

The encoding is essential for processing text with the OpenAI models, as it allows them to understand and generate text in a format that is compatible with their internal representations. In the context of the tiktoken library, the encoding is particularly helpful for estimating token counts in a text string without making an API call to OpenAI services.