One reason artificial intelligence-based chatbots have taken the world by storm in recent months is because they can generate or finesse text for a variety of purposes, whether it’s to create an ad campaign or write a resume.
These chatbots are powered by large language model (LLM) algorithms, which can mimic human intelligence and create textual content as well as audio, video, images, and computer code. LLMs are a type of artificial intelligence trained on a massive trove of articles, books, or internet-based resources and other input to produce human-like responses to natural language inputs.
A growing number of tech firms have unveiled generative AI tools based on LLMs for business use to automate application tasks. For example, Microsoft last week rolled out to a limited number of users a chatbot based on OpenAI’s ChatGPT; it’s embedded in Microsoft 365 and can automate CRM and ERP application functions.
For example, the new Microsoft 365 Copilot can be used in Word to create a first draft of a document, potentially saving hours of time writing, sourcing, and editing. Salesforce also announced plans to release a GPT-based chatbot for use with its CRM platform.
Most LLMs, such as OpenAI’s GPT-4, are pretrained as next word or content prediction engines — that is how most businesses use them, “out of the box,” as it were. And while LLM-based chatbots have produced their share of errors, pretrained LLMs work relatively well at feeding mostly accurate and compelling content that, at the very least, can be used as a jumping off point.
Many industries, however, require more customized LLM algorithms, those that understand their jargon and produce content specific to their users. LLMs for the healthcare industry, for instance, might need to process and interpret electronic health records (EHRs), suggest treatments, or create a patient healthcare summary based on physician notes or voice recordings. An LLM tuned to the financial services industry can summarize earnings calls, create meeting transcripts, and perform fraud analysis to protect consumers.
Across various industries, ensuring a high degree of response accuracy can be paramount.
Most LLMs can be accessed through an application programming interface (API) that allows the user to create parameters or adjustments to how the LLM responds. A question or request sent to a chatbot is called a prompt, in that the user is prompting a response. Prompts can be natural language questions, code snippets, or commands, but for the LMM to do its job accurately, the prompts have to be on point.
And that necessity has given rise to a new skill: prompt engineering.
Prompt engineering explained
Prompt engineering is the process of crafting and optimizing text prompts for large language models to achieve desired outcomes. “[It] helps LLMs for rapid iteration in product prototyping and exploration, as it tailors the LLM to better align with the task definition quickly and easily,” said Marshall Choy, senior vice president of product at SambaNova Systems, a Silicon Valley startup that makes semiconductors for artificial intelligence (AI).
Perhaps as important for users, prompt engineering is poised to become a vital skill for IT and business professionals, according to Eno Reyes, a machine learning engineer with Hugging Face, a community-driven platform that creates and hosts LLMs.
“Lots of people I know in software, IT, and consulting use prompt engineering all the time for their personal work,” Reyes said in an email reply to Computerworld. “As LLMs become increasingly integrated into various industries, their potential to enhance productivity is immense.”
By effectively employing prompt engineering, business users can optimize LLMs to perform their specific tasks more efficiently and accurately, ranging from customer support to content generation and data analysis, Reyes said.
The best known LLM at the moment — OpenAI’s GPT-3 — is the basis for the wildly popular ChatGPT chatbot. The GPT-3 LLM works on a 175-billion-parameter model that can generate text and computer code with short written prompts. OpenAI’s latest version, GPT-4, is estimated to have up to 280 billion parameters, making it much more likely to produce accurate responses.
Along with OpenAI’s GPT LLM, popular generative AI platforms include open models such as Hugging Face’s BLOOM and XLM-RoBERTa, Nvidia’s NeMO LLM, XLNet, Co:here and GLM-130B.
Because prompt engineering is a nascent and emerging discipline, enterprises are relying on booklets and prompt guides as a way to ensure optimal responses from their AI applications. There are even marketplaces emerging for prompts, such as the 100 best prompts for ChatGPT.
“People are even selling prompt suggestions,” said Arun Chandrasekaran, a distinguished vice president analyst at Gartner Research, adding that the recent spate of attention on generative AI has cast a spotlight on the need for better prompt engineering.
“It is a relatively newer domain,” he said. “Generative AI applications are often relying on self-supervised giant AI models and hence getting optimal responses from them needs more know-how, trials and additional effort. I am sure with growing maturity we might see better guidance and best practices from the AI model creators on effective ways to get the best out of the AI models and applications.”
Good input equals good output
The machine-learning component of LLMs automatically learns from data input. In addition to the data originally used to create a LLM, such as GPT-4, OpenAI created something called Reinforcement Learning Human Feedback, where a human being trains the model on how to give human-like answers.
For example, a user will frame a question to the LLM and then write the ideal answer. Then the user will ask the model the same question again, and the model will offer many other different responses. If it’s a fact-based question, the hope is the answer will remain the same; if it’s an open-ended question, the goal is to produce multiple, human-like creative responses.
For example, if a user asks ChatGPT to generate a poem about a person sitting on a beach in Hawaii, the expectation is it will generate a different poem each time. “So, what human trainers do is rate the answers from best to worst,” Chandrasekaran said. “That’s an input to the model to make sure it’s giving a more human-like or best answer, while trying to minimize the worst answers. But how you frame questions [has] a huge bearing on the output you get from a model.”
Organizations can train a GPT-model by ingesting custom data sets that are internal to that company. For example, they may take enterprise data and label and annotate it to increase its quality and then ingest it into the GPT-4 model. That fine tunes the model so it can answer questions specific to that organization.
Fine tuning cna also be industry specific. There is already a cottage industry emerging of start-ups that take GPT-4 and ingest a lot of information specific to a vertical industries, such as financial services.
“They may ingest Lexus-Nexus and Bloomberg information, they may ingest SEC information like 8K and 10K reports. But the point is that the model is learning a lot of language or information very specific to that domain,” Chandrasekaran said. “So, the fine tuning can happen either at an industry level or organizational level.”
For example, Harvey is a startup that’s partnered with OpenAI to create what it calls a “copilot for lawyers” or a version of ChatGPT for legal professionals. Lawyers can use the customized ChatGPT chatbot to discover any legal precedence for certain judges to prepare for their next case, Chandrasekaran said.
“I see the value of selling prompts not so much for language but for images,” Chandrasekaran said. “There are all kinds of models in generative AI space, including text-to-image models.”
For example, a user can request a generative AI model to produce an image of a guitar player strumming away on the moon. “I think the text-to-image domain has more of an emphasis in prompt marketplaces,” Chandrasekaran said.
Hugging Face as a one-stop LLM hub
While Hugging Face creates some of its own LLMs, including BLOOM, the organization’s primary role is to be a hub for third-party machine learning models, as GitHub does for code; Hugging Face currently hosts more than 100,000 machine-learning models, including a variety of LLMs from startups and big tech.
As new models are open-sourced, they are typically made available on the hub, creating a one-stop destination for emerging open-source LLMs.
To fine-tune a LLM for a specific business or industry using Hugging Face, users can leverage the organization’s “Transformers” APIs and “Datasets” libraries. For example, in financial services, a user could import a pre-trained LLM such as Flan-UL2, load a dataset of financial news articles, and use the “transformers” trainer to fine-tune the model to generate summaries of those articles. Integrations with AWS, DeepSpeed, and Accelerate further streamline and optimize the training.
The whole process can be done in fewer than 100 lines of code, according to Reyes.
Another way to get started with prompt engineering involves Hugging Face’s Inference API; it’s a simple HTTP request endpoint supporting more than 80,000 transformer models, according to Reyes. “This API allows users to send text prompts and receive responses from open-source models on our platform, including LLMs,” Reyes said. “If you want to go even simpler, you can actually send text without code by using the inference widget on the LLM models in the Hugging Face hub.”
Few-shot and zero-shot learning
LLM prompt engineering typically takes one of two forms: few-shot and zero-shot learning or training.
Zero-shot learning involves feeding a simple instruction as a prompt that produces an expected response from the LLM. It’s designed to teach an LLM to perform new tasks without using labeled data for those specific tasks. Think of zero-shot as reinforcement learning.
Conversely, few-shot learning uses a small amount of sample information or data to train the LLM for desired responses. Few-shot learning consists of three main components:
- Task Description: A short description of what the model should do, e.g. “Translate English to French”
- Examples: A few examples showing the model what it is expected to do, for example, “sea otter => loutre de mer”
- Prompt: The beginning of a new example, which the model should complete by generating the missing text, such as “cheese => “
In reality, there are few organizations today with custom training models to suit their needs because most models are still in an early stage of development, according to Gartner’s Chandrasekaran. And while few-shot and zero-shot learning can help, learning prompt engineering as a skill is important, both for IT and business users alike.
“Prompt engineering is an important skill to possess today since foundations models are good at few-shot and zero shot learning, but their performance is in many ways influenced by how we methodically craft prompts,” Chandrasekaran said. “Depending on the use case and domain, these skills will be important for both IT and business users.”
Most APIs let users apply their own prompt-engineering techniques. Whenever a user sends text to an LLM, there is potential for refining prompts to achieve specific outcomes, according to Reyes.
“However, this flexibility also opens the door to malicious use cases, such as prompt injection,” Reyes said. “Instances like [Microsoft’s] Bing’s Sydney demonstrated how people could exploit prompt engineering for unintended purposes. As a growing field of study, addressing prompt injection in both malicious use cases and ‘red-teaming’ for pen-testing will be crucial for the future, ensuring the responsible and secure use of LLMs across various applications.”