April 2024 – Andrew E. Scott

There’s currently a spate of negative articles about the Apple Vision Pro. I think these overstate any “failure” of this device, and I’m going to tell you why. Assuming you read on. 😀

Zuckerberg got in early with his own negative take, but recently we’ve had some others. You might have seen articles like the one on the Disconnect blog, or the one from Forbes. It seems to be time for a pile-on.

It’s tempting to think back to the launch of the original iPhone 📱 , which was a 2G feature-phone, without copy-and-paste functionality or an app store. It lacked many features of what were already standard smartphone features at the time. However, it was received generally very positively by a customer base hungry for what even this limited phone delivered.

What is different this time is that the AVP 🕶 delivers very well against a use case that not many people want. Specifically, the use case of watching movies on a big, high-def screen no matter where you are. It has great resolution and sound, and the pass-through capabilities allow you to virtually hang a big screen in your environment wherever you want. If that’s what you’re after, it’s truly amazing.

Unfortunately, that isn’t one of the three areas that I’ve previously written about where VR headsets have a strong advantage. These are in fitness applications, virtual office environments, and immersive training. And, no, I don’t consider general games/entertainment to be an area where VR is particularly strong at the moment.

This initial version of the AVP is a bit heavy to wear on the head, has a UI optimised for consumption (eye-tracking based), lacks controllers, doesn’t support multiple monitor extensions for a laptop/desktop, has detectable motion blur when moving your head, and each AVP needs to be personally tailored for the individual who will wear it making it harder for a single unit to be shared by multiple people. (I’m going to ignore the issues around the high price, external battery pack, and the odd EyeSight feature that makes eyes visible through the headset, as these aren’t relevant to my argument.)

Together, these design decisions make it hard to have a great fitness application or virtual office environment. Immersive training is still possible for the situation where the user is downloading their own training apps, but the custom fit procedure makes it harder for a training organisation to use them.

Given the lack of fit between the product and the strongest VR use cases, I’m not surprised that the first version of the AVP hasn’t been a runaway success. However, I don’t think this means Apple has failed. The AVP is an amazing technical achievement, and the many of the issues can be corrected in a future version.

It appears that Zuckerberg hasn’t written-off Apple either, as he recently announced Meta Horizon OS, where the operating system for the Meta Quest headsets is being licensed to other hardware makers for their own headsets. It’s a clear play to position Meta as the “open” alternative to Apple in the VR headset space, similar to the Android ecosystem versus the iPhone, and makes sense only if Meta feels a competitive threat.

The next version of AVP should ensure it supports some of the three areas of VR strength. For instance,

To better support virtual office environments, provide an alternative to the eye-tracking based UI, fix the motion blur, and allow multiple virtual monitors to be extended into the AVP (there are rumours that this is coming).
To better support fitness applications, make the unit on the head lighter (e.g. move more weight to the battery pack, remove the components for EyeSight), fix the motion blur, and add support for controllers.
To better support immersive training, make the unit on the head lighter, fix motion blur, and add support for controllers, but most of all, remove the need for custom fit components (i.e. different Light Seals, Light Seal Cushions and Optical Inserts).

What is probably the simplest one is listed first, which might be addressed through software updates. The others will require hardware changes. Still, the possibility to support the virtual office environment use case better through software changes alone gives me confidence that Apple will get there.

So, I wouldn’t call the AVP a failure. Perhaps a misfire? The initial version of Apple TV wasn’t a roaring success either. Like Zuckerberg, I believe Apple has proven their technical capabilities, and will evolve their product into something compelling.

Writing prompts for Generative AI systems will be a skill as critical as writing good web search queries has been over the past two decades. If you aim to be effective in making use of GenAI tools, “prompt engineering” is a skill that you should develop. A search box has become ubiquitous in modern online applications, and it is becoming common for applications to now offer a way to prompt a GenAI system to use them.

The ways that applications take this prompt are evolving, and are currently inconsistent across different types of application, e.g. image generation prompts for MidJourney can be very different to code generation prompts for GitHub CoPilot. I’m going to focus here on text chat services like ChatGPT, Gemini, Meta AI, and Claude as these are widely-used given their free access and broad applicability.

There are plenty of good guides out there that provide tips on prompt engineering. It’s a good idea to take a look at a range of these, as they are pretty good, quick to read, and cover different perspectives. For example, here are guides from OpenAI, Google, Microsoft, IBM, DigitalOcean and Cohere. You might also like to do a introductory prompt engineering course, such as Coursera’s one (or just read the paper it is based on).

I’m not going to duplicate these, but instead give a different perspective based on these guides and courses, and my own experiences in following them. The following are five things that I’ve learned, and while they are simple, they are basic principles that can guide your own personal skill-building in prompt engineering.

I’m interested in hearing what other people have found in their own prompt engineering journeys.

1. Prompts do not travel

If you’ve looked at a lot of the prompt engineering guides out there, you’ll have noticed that most of them focus on versions of ChatGPT. However, different GenAI chat services can respond very differently to the same prompt.

As shown in the screenshots above that, in response to the simple prompt “to be or not to be, that”, ChatGPT 3.5 has carried on with quoting Shakespeare, while Claude 3 Sonnet asserts that it is unable to.

Similarly, the same prompt can behave differently between different versions of the same chat service, e.g. ChatGPT 3.5 vs ChatGPT 4 Turbo, or with the same version but at different times.

The first screenshot above is from the Coursera prompt engineering course by Jules White, and used the ChatGPT 3.5 chat service from March 2023. The second screenshot is one that I recently took of the ChatGPT 3.5 chat service in April 2024. ChatGPT is being asked to solve a problem, and comes up with the answer “YES” in the first case, but an answer “NO” in the second case (which is the correct answer), doesn’t follow the formatting of the examples given, and breaks down the reasoning in the course of solving the problem. In the year since the course was recorded, OpenAI has updated the model and system prompt, and it behaves differently as a result.

Conclusion: The take-away from all this is that you shouldn’t assume a prompt that works today on one service will work identically tomorrow or work identically on a different service. If you are hard wiring a prompt into a software application, or putting it into a guide for people, that will have a life of more than a couple of months, you should take this into account. Add a monitoring function to your software application, or put an explanatory note into your guide, so when things change, people are not too surprised. Also, note that developing a prompt that works only on a single Gen AI service may be locking you in to using that service.

2. Responses are not predictable

Each word in the response from a Gen AI chat service is generated from a random choice across a probability distribution based on all the words seen up to that point. Given it is a random choice, it is quite possible that a response to a given prompt could be different each time that a prompt is given.

These screenshots show Gemini providing a different response to the same prompt of “What is the best flavour of Coca Cola?”. While they have many similarities, the first response includes Coca-Cola Spiced, which isn’t mentioned in the second. In theory, no matter how many times you give the same prompt, there’s a chance that the Gen AI chat service can give you something you haven’t seen before.

This is lack of determinism is a bit strange if you’re used to writing functions in Excel, or a software application, and have an expectation that computers do what you tell them to do. A spell checker in Word will give the same suggestions each time, but this isn’t guaranteed with Gen AI. It’s not a bug, it’s a feature!

Some Gen AI tools allow a “temperature” parameter to be set to zero, or a particular random seed to be chosen, which can limit the variability when repeating the same prompt. However, these may not be available on all tools, on the free tiers, or maybe not unless you use APIs to interact with them. They might also not stop all randomness, if different prompt requests are load-balanced across different servers which, in turn, may generate different random numbers due to their different internal states.

Conclusion: If repeatability is important (and often it is), the debugging process of prompt engineering involves presenting the same prompt multiple times to ensure it does give the desired result. Using parameters or flags to reduce randomness will also be valuable, assuming you have access to them, but may not be reliable. A software application that is using Gen AI may benefit from having checks to ensure that the response is in the expected form. (See also the OpenAI Evals framework.)

3. Structure gives more useful sessions

While prompts are made up of words, they can also be made up of symbols, or follow a structured pattern. For example, all of the Gen AI chat services I have used understand Markdown, which is a standardised way to add formatting to text documents.

This is particularly useful where a prompt specifies that a response should be in a particular format, since that particular format can include Markdown styling. It can highlight information in the result that is interesting, or just be used to make the response look nicer.

The second example shows the use of a template, where a field to populate is specified with angled brackets, e.g. <Country_name>. There is nothing special about using angled brackets, and a field could be specified in a variety of ways, including other punctuation, using capital letters, or using the word “field”. If it would be recognisable to a human as a field, the Gen AI chat service will probably pick up that it should be replaced by content in the output.

As shown above, it is helpful to clearly delineate between instructional and non-instructional material in the prompt. For example, if the prompt is to find an answer in a block of text, or to summarise a block of text, that block of text should be clearly defined as different to any prompt instructions. You might wrap the block of text in triple quotes (“””), or put a clear barrier between the instructions and the other material such as a blank line or triple hashes (###).

Lastly, if the output is going to be long and there may be more interactions to come in the chat session, it can be useful to specify that the structure of the output should include repeating some of the context. This can minimise the chance of the Gen AI chat session drifting away from the purpose of the chat as more tokens (words) are used. It might work well where the Gen AI prompt involves asking questions of the user. For example, adding something like “After doing this, restate why you have written this.”

Conclusion: Familiarise yourself with Markdown, as it can be handy in helping format the responses in a Gen AI chat. Additionally, consider how structure in your prompt and in the response can assist with having more reliable prompts and chat sessions.

4. Words encapsulate meanings

Every prompt engineering guide seems to have a comment about making sure your prompt is very clear. However, this understates how important the selection of words is. Each word is an opportunity to steer the Gen AI towards the right behaviour. For example, some words are used in only particular academic, business or social contexts, so choosing these words will shift the Gen AI into that context.

A common example is where the Gen AI is instructed to “act as” or to be a particular persona. In the screenshot above, the prompt included “Act as a creative brainstormer”. You can also work technical jargon into the prompt to encourage it to use jargon in its response rather than give a generalist answer.

The screenshots above show the same prompt with one significant word difference. The first screenshot asks about a “headache”, while the second asks about “cephalagia”. By using medical jargon in the prompt, the Gen AI session has responded with information about prescription medications, and combined some treatment options under a new category of lifestyle modifications.

If using templates, like shown previously, the words used for the field names can help the Gen AI with the task. Instead of vague or ambiguous words for placeholders, e.g. “output” or “result”, use words that have a clear meaning, e.g. “capital city name” or “task completion time”. These will be replaced by the Gen AI in the response, so it’s ok if they are a bit verbose.

Conclusion: In developing a prompt, try a range of different words, and in particular try words that have specific definitions in domains that relate to how you want the Gen AI to respond. You can also ask the Gen AI to provide wording suggestions, and try those out also.

5. In the absence of the right words, examples are good

In the parlance of machine learning, a system that can complete a new task without any additional training examples is called “zero shot learning“. Similarly, if you provide one example, that’s “one shot learning”, or just a few examples is “few shot learning”. Providing examples in a prompt to a Gen AI is not learning in the same way that is meant by those terms, but the terms have come to stand in for a similar approach.

Sometimes it’s easier just to give examples to the Gen AI for what you’d like it to do rather than experiment with a range of prompts until you hit upon the perfect one. Frequently, one example is not enough, so a few examples are required. You may also need to combine it with a short prompt to make clear that the Gen AI is to follow the examples as a template for how to respond.

Since the Gen AI first has to guess what to do before it can do it, there’s a risk it will guess the wrong thing. For example, the first screenshot required around five attempts before the Gen AI tool gave the correct response. Another possible approach is to use some examples to have the Gen AI generate the instructions to use in a prompt. This way, the first guessing step can be skipped in future interactions.

It may take a bit of experimentation to have the Gen AI provide a repeatable set of instructions that does what you want.

Conclusion: The ability for current Gen AI tools to be able to infer an action from examples is rather impressive, but using examples can increase the risk that the prompt is not consistently followed. Providing structure, and descriptive words in any field names and in any introductory instructions can help. Also, perhaps examples can be used to generate a new prompt rather than be used directly.

Month: April 2024

Apple Vision Pro – down but not out

5 Basic Prompt Engineering Learnings

1. Prompts do not travel

2. Responses are not predictable

3. Structure gives more useful sessions

4. Words encapsulate meanings

5. In the absence of the right words, examples are good