Using OpenAI's TikToken in Ruby
The encoding used in the tiktoken library (and the Ruby binding discussed in this post) is a specific way of converting text into a sequence of tokens, which are then represented by their unique IDs. The encoding scheme is designed to work with OpenAI models like gpt-3.5-turbo and is based on the model's vocabulary and tokenizer.
There's a simple Ruby binding for TikToken made by Iapark that compiles the underlying Rust library. https://rubygems.org/gems/tiktoken_ruby
First add it to your Gemfile
gem "tiktoken_ruby"
Then use it in your code. The service module I wrote today to use it in my Rails app looks like this:
require 'tiktoken_ruby'
module TikToken
extend self
DEFAULT_MODEL = "gpt-3.5-turbo"
def count(string, model: DEFAULT_MODEL)
get_tokens(string, model: model).length
end
def get_tokens(string, model: DEFAULT_MODEL)
encoding = Tiktoken.encoding_for_model(model)
tokens = encoding.encode(string)
tokens.map do |token|
[token, encoding.decode([token])]
end.to_h
end
end
Here's what it looks like in practice.
irb> TikToken.count("Absence is to love what wind is to fire; it extinguishes the small, it inflames the great.")
=> 19
irb> TikToken.get_tokens("Absence is to love what wind is to fire; it extinguishes the small, it inflames the great.")
=>
{28878=>"Abs",
768=>"ence",
374=>" is",
311=>" to",
3021=>" love",
1148=>" what",
10160=>" wind",
4027=>" fire",
26=>";",
433=>" it",
56807=>" extingu",
21168=>"ishes",
279=>" the",
2678=>" small",
11=>",",
4704=>" infl",
986=>"ames",
2294=>" great",
13=>"."}
The encoding is essential for processing text with the OpenAI models, as it allows them to understand and generate text in a format that is compatible with their internal representations. In the context of the tiktoken library, the encoding is particularly helpful for estimating token counts in a text string without making an API call to OpenAI services.