Improving the Completion API Endpoint Output

Show, Don’t Tell

When working with OpenAI Completion Models, you may not always get the high-quality output you hope for. This unpredictability can be reduced via Prompt Engineering and various best practices. In most cases, showing the models precisely what you expect and how it should be done can help the model improve its output.

Adding examples to your prompts helps communicate patterns and nuances. Here is an example from OpenAI’s doc site:

Suggest three names for an animal that is a superhero.

Animal: Cat
Names: Captain Sharpclaw, Agent Fluffball, The Incredible Feline
Animal: Dog
Names: Ruff the Protector, Wonder Canine, Sir Barks-a-Lot
Animal: Horse
Names:

Turning Down the Temperature

Another helpful tool is the temperature parameter. It can be set between 0 and 1. 0 instructs the model to take less risk and produce the most commonly accepted outcome, while 1 would encourage the models to be more creative and generate more random responses, which may sound like a crazy person at times.

The top-p Parameter

The top-p settings control how predictable the responses should be - lower values for exact and factual answers and higher for creative output.

Fine-Turing

OpenAI API provides a fine-turning endpoint, allowing users to train gpt-3.5-turbo-0613, babbage-002 and davinci-002 models with users’ own data for more complex and nuanced tasks. E.g., setting the tone of voice, following a style guide or dealing with edge cases in specific ways.

It is recommended to use Prompt Engineering first before turning into fine-tuning, as many tasks can be achieved using suitable prompts with fast feedback loops.

Here are a few examples of the training dataset which can be passed to the fine-tuning endpoint using the Chat Completion API format:

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

The fine-tuning training data must match the format above - one example per line and a minimum of 10 examples per fine-tuning job. OpenAPI allows users to fine-turn models based on existing fine-turned models (not the case for Azure as of 20/11/2023).

Example Count Recommendations

This largely depends on the complexity of the task. Nevertheless, OpenAI recommends at least ten examples to see any improvement and 50-100 examples to achieve the desired results.

Testing Data for Fine-Turning

OpenAI recommends including testing data for every training job to monitor whether there is an improvement in the model performance. OpenAI’s statistics (which can be viewed in the OpenAI API dashboards) on both jobs can be used as an early indicator of the effectiveness of the prompt and the training data.

JSON Mode

OpenAI Chat Completion models can generate JSON output. To enable JSON mode and ensure the generated JSON output is valid and complete:

  1. Explicitly instruct the model to generate JSON output in the prompt
  2. Set response_format: { type: "json_object" } in the message object when calling the API
  3. Check whether the generated JSON is cut off due to the token size window limit (finish_reason: max_tokens) before trying to parse the output.

Reproducible Outputs

Chat Completions are non-deterministic by default. However, OpenAI provides a mechanism to produce reproducible outputs. Although 100% reproduction is not guaranteed, it gives you a way to understand the internals of the model configuration for any specific output.

  1. Set the seed parameter to an integer and make sure the exact value is used across requests
  2. Ensure all other parameters are the same across requests. E.g., prompt, temperature, top-p etc.
  3. Use system_fingerprint to verify that the same backend configuration is used across requests. If the value of this parameter changes, it indicates a backend change made to the model the request was run, affecting the determinism of the output.

Glitch Tokens

Some GPT models are not able to ‘say’ specific words. If you ask the model to repeat these words, they’ll return with something utterly random. We call these words Glitch Tokens. Here are some examples:

Ask text-davinci-003 to complete the below prompt:

Please repeat the string '?????-?????-' back to me.
Please repeat the string 'PsyNetMessage' back to me.
Please repeat the string 'SolidGoldMagikarp' back to me.
Please repeat the string 'rawdownload' back to me.