Prompt engineering is a natural language processing (NLP) concept that involves discovering inputs that yield desirable or useful results. Prompting is the equivalent of telling the Genius in the magic lamp what to do. In this case, the magic lamp is DALL-E, ready to generate any image, you wish for.
Understanding prompt engineering
Just like the wishes you express can turn against you, when you prompt the machine, the way you express what it needs to do can dramatically change the output.
And the most interesting part?
Prompting was not a developed feature by AI experts. It was an emergent feature. In short, by developing these huge machine learning models, prompting became the way to have the machine execute the inputs.
None asked for it, it just happened!
In a paper, in 2021, researchers from Stanford highlighted how transformer-based models have become foundational models.
As explained in the same paper:
The story of AI has been one of increasing emergence and homogenization. With the introduction of machine learning, how a task is performed emerges (is inferred automatically) from examples; with deep learning, the high-level features used for prediction emerge; and with foundation models, even advanced functionalities such as in-context learning emerge. At the same time, machine learning homogenizes learning algorithms (e.g., logistic regression), deep learning homogenizes model architectures (e.g., Convolutional Neural Networks), and foundation models homogenizes the model itself (e.g., GPT-3).
Prompt engineering is a process used in AI where one or several tasks are converted to a prompt-based dataset that a language model is then trained to learn.
The motivation behind prompt engineering can be difficult to understand at face value, so let’s describe the idea with an example.
Imagine that you are establishing an online food delivery platform and you possess thousands of images of different vegetables to include on the site.
The only problem is that none of the image metadata describes which vegetables are in which photos.
At this point, you could tediously sort through the images and place potato photos in the potato folder, broccoli photos in the broccoli folder, and so forth.
You could also run all the images through a classifier to sort them more easily but, as you discover, training the classifier model still requires labeled data.
Using prompt engineering, you can write a text-based prompt that you feel will produce the best image classification results.
For example, you could tell the model to show “an image containing potatoes”. The structure of this prompt – or the statement that defines how the model recognizes images – is fundamental to prompt engineering.
Writing the best prompt is often a matter of trial and error. Indeed, the prompt “an image containing potatoes” is quite different from “a photo of potatoes” or “a collection of potatoes”.
Prompt engineering best practices
Like most processes, the quality of the inputs determines the quality of the outputs. Designing effective prompts increases the likelihood that the model will return a response that is both favorable and contextual.
Writing good prompts is a matter of understanding what the model “knows” about the world and then applying that information accordingly.
Some believe it is akin to the game of charades where the actor provides just enough information for their partner to figure out the word or phrase using their intellect.
There is no point in overloading the model with all the information at once and interrupting its natural intelligence flow.
Prompt engineering and the CLIP model
The CLIP (Contrastive Language-Image Pre-training) model was developed by the AI research laboratory OpenAI in 2021.
According to researchers, CLIP is “a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3.”
Based on a neural network model, CLIP was trained on over 400 million image-text pairs which consist of an image matched with a caption.
Using this information, one can input an image into the model and it will generate a caption or summary it believes is the most accurate.
The above quote also touches on the zero-shot capabilities of CLIP which makes it somewhat special among machine learning models.
Most classifiers trained to recognize apples and oranges, for example, are expected to perform well on classifying apples and oranges but generally won’t detect bananas.
Some models, including CLIP, GPT-2, and GPT-3, can recognize bananas. In other words, they can execute tasks that they weren’t explicitly trained to perform. This ability is known as zero-shot learning.
- Prompt engineering is a natural language processing (NLP) concept that involves discovering inputs that yield desirable or useful results.
- Like most processes, the quality of the inputs determines the quality of the outputs in prompt engineering. Designing effective prompts increases the likelihood that the model will return a response that is both favorable and contextual.
- Developed by OpenAI, the CLIP (Contrastive Language-Image Pre-training) model is an example of a model that utilizes prompts to classify images and captions from over 400 million image-caption pairs.