Writing prompts for Generative AI systems will be a skill as critical as writing good web search queries has been over the past two decades. If you aim to be effective in making use of GenAI tools, “prompt engineering” is a skill that you should develop. A search box has become ubiquitous in modern online applications, and it is becoming common for applications to now offer a way to prompt a GenAI system to use them.
The ways that applications take this prompt are evolving, and are currently inconsistent across different types of application, e.g. image generation prompts for MidJourney can be very different to code generation prompts for GitHub CoPilot. I’m going to focus here on text chat services like ChatGPT, Gemini, Meta AI, and Claude as these are widely-used given their free access and broad applicability.
There are plenty of good guides out there that provide tips on prompt engineering. It’s a good idea to take a look at a range of these, as they are pretty good, quick to read, and cover different perspectives. For example, here are guides from OpenAI, Google, Microsoft, IBM, DigitalOcean and Cohere. You might also like to do a introductory prompt engineering course, such as Coursera’s one (or just read the paper it is based on).
I’m not going to duplicate these, but instead give a different perspective based on these guides and courses, and my own experiences in following them. The following are five things that I’ve learned, and while they are simple, they are basic principles that can guide your own personal skill-building in prompt engineering.
I’m interested in hearing what other people have found in their own prompt engineering journeys.
1. Prompts do not travel
If you’ve looked at a lot of the prompt engineering guides out there, you’ll have noticed that most of them focus on versions of ChatGPT. However, different GenAI chat services can respond very differently to the same prompt.
As shown in the screenshots above that, in response to the simple prompt “to be or not to be, that”, ChatGPT 3.5 has carried on with quoting Shakespeare, while Claude 3 Sonnet asserts that it is unable to.
Similarly, the same prompt can behave differently between different versions of the same chat service, e.g. ChatGPT 3.5 vs ChatGPT 4 Turbo, or with the same version but at different times.
The first screenshot above is from the Coursera prompt engineering course by Jules White, and used the ChatGPT 3.5 chat service from March 2023. The second screenshot is one that I recently took of the ChatGPT 3.5 chat service in April 2024. ChatGPT is being asked to solve a problem, and comes up with the answer “YES” in the first case, but an answer “NO” in the second case (which is the correct answer), doesn’t follow the formatting of the examples given, and breaks down the reasoning in the course of solving the problem. In the year since the course was recorded, OpenAI has updated the model and system prompt, and it behaves differently as a result.
Conclusion: The take-away from all this is that you shouldn’t assume a prompt that works today on one service will work identically tomorrow or work identically on a different service. If you are hard wiring a prompt into a software application, or putting it into a guide for people, that will have a life of more than a couple of months, you should take this into account. Add a monitoring function to your software application, or put an explanatory note into your guide, so when things change, people are not too surprised. Also, note that developing a prompt that works only on a single Gen AI service may be locking you in to using that service.
2. Responses are not predictable
Each word in the response from a Gen AI chat service is generated from a random choice across a probability distribution based on all the words seen up to that point. Given it is a random choice, it is quite possible that a response to a given prompt could be different each time that a prompt is given.
These screenshots show Gemini providing a different response to the same prompt of “What is the best flavour of Coca Cola?”. While they have many similarities, the first response includes Coca-Cola Spiced, which isn’t mentioned in the second. In theory, no matter how many times you give the same prompt, there’s a chance that the Gen AI chat service can give you something you haven’t seen before.
This is lack of determinism is a bit strange if you’re used to writing functions in Excel, or a software application, and have an expectation that computers do what you tell them to do. A spell checker in Word will give the same suggestions each time, but this isn’t guaranteed with Gen AI. It’s not a bug, it’s a feature!
Some Gen AI tools allow a “temperature” parameter to be set to zero, or a particular random seed to be chosen, which can limit the variability when repeating the same prompt. However, these may not be available on all tools, on the free tiers, or maybe not unless you use APIs to interact with them. They might also not stop all randomness, if different prompt requests are load-balanced across different servers which, in turn, may generate different random numbers due to their different internal states.
Conclusion: If repeatability is important (and often it is), the debugging process of prompt engineering involves presenting the same prompt multiple times to ensure it does give the desired result. Using parameters or flags to reduce randomness will also be valuable, assuming you have access to them, but may not be reliable. A software application that is using Gen AI may benefit from having checks to ensure that the response is in the expected form. (See also the OpenAI Evals framework.)
3. Structure gives more useful sessions
While prompts are made up of words, they can also be made up of symbols, or follow a structured pattern. For example, all of the Gen AI chat services I have used understand Markdown, which is a standardised way to add formatting to text documents.
This is particularly useful where a prompt specifies that a response should be in a particular format, since that particular format can include Markdown styling. It can highlight information in the result that is interesting, or just be used to make the response look nicer.
The second example shows the use of a template, where a field to populate is specified with angled brackets, e.g. <Country_name>. There is nothing special about using angled brackets, and a field could be specified in a variety of ways, including other punctuation, using capital letters, or using the word “field”. If it would be recognisable to a human as a field, the Gen AI chat service will probably pick up that it should be replaced by content in the output.
As shown above, it is helpful to clearly delineate between instructional and non-instructional material in the prompt. For example, if the prompt is to find an answer in a block of text, or to summarise a block of text, that block of text should be clearly defined as different to any prompt instructions. You might wrap the block of text in triple quotes (“””), or put a clear barrier between the instructions and the other material such as a blank line or triple hashes (###).
Lastly, if the output is going to be long and there may be more interactions to come in the chat session, it can be useful to specify that the structure of the output should include repeating some of the context. This can minimise the chance of the Gen AI chat session drifting away from the purpose of the chat as more tokens (words) are used. It might work well where the Gen AI prompt involves asking questions of the user. For example, adding something like “After doing this, restate why you have written this.”
Conclusion: Familiarise yourself with Markdown, as it can be handy in helping format the responses in a Gen AI chat. Additionally, consider how structure in your prompt and in the response can assist with having more reliable prompts and chat sessions.
4. Words encapsulate meanings
Every prompt engineering guide seems to have a comment about making sure your prompt is very clear. However, this understates how important the selection of words is. Each word is an opportunity to steer the Gen AI towards the right behaviour. For example, some words are used in only particular academic, business or social contexts, so choosing these words will shift the Gen AI into that context.
A common example is where the Gen AI is instructed to “act as” or to be a particular persona. In the screenshot above, the prompt included “Act as a creative brainstormer”. You can also work technical jargon into the prompt to encourage it to use jargon in its response rather than give a generalist answer.
The screenshots above show the same prompt with one significant word difference. The first screenshot asks about a “headache”, while the second asks about “cephalagia”. By using medical jargon in the prompt, the Gen AI session has responded with information about prescription medications, and combined some treatment options under a new category of lifestyle modifications.
If using templates, like shown previously, the words used for the field names can help the Gen AI with the task. Instead of vague or ambiguous words for placeholders, e.g. “output” or “result”, use words that have a clear meaning, e.g. “capital city name” or “task completion time”. These will be replaced by the Gen AI in the response, so it’s ok if they are a bit verbose.
Conclusion: In developing a prompt, try a range of different words, and in particular try words that have specific definitions in domains that relate to how you want the Gen AI to respond. You can also ask the Gen AI to provide wording suggestions, and try those out also.
5. In the absence of the right words, examples are good
In the parlance of machine learning, a system that can complete a new task without any additional training examples is called “zero shot learning“. Similarly, if you provide one example, that’s “one shot learning”, or just a few examples is “few shot learning”. Providing examples in a prompt to a Gen AI is not learning in the same way that is meant by those terms, but the terms have come to stand in for a similar approach.
Sometimes it’s easier just to give examples to the Gen AI for what you’d like it to do rather than experiment with a range of prompts until you hit upon the perfect one. Frequently, one example is not enough, so a few examples are required. You may also need to combine it with a short prompt to make clear that the Gen AI is to follow the examples as a template for how to respond.
Since the Gen AI first has to guess what to do before it can do it, there’s a risk it will guess the wrong thing. For example, the first screenshot required around five attempts before the Gen AI tool gave the correct response. Another possible approach is to use some examples to have the Gen AI generate the instructions to use in a prompt. This way, the first guessing step can be skipped in future interactions.
It may take a bit of experimentation to have the Gen AI provide a repeatable set of instructions that does what you want.
Conclusion: The ability for current Gen AI tools to be able to infer an action from examples is rather impressive, but using examples can increase the risk that the prompt is not consistently followed. Providing structure, and descriptive words in any field names and in any introductory instructions can help. Also, perhaps examples can be used to generate a new prompt rather than be used directly.