Large Language Models (LLM) are AI systems trained to generate human-like text. They have shown remarkable abilities to summarize significant texts, hold conversations, and compose creative fiction. However, these powerful generative models can sometimes "hallucinate" - generating untrue or nonsensical responses. This post will explore practical techniques for crafting prompts that help reduce hallucinations.
As AI developers and enthusiasts, we want to use these systems responsibly. Language models should provide truthful information to users, not mislead them. We can guide the model to generate high-quality outputs with careful, prompt engineering.
What Causes Hallucinations in Language Models?
Hallucinations occur when a language model generates text that is untethered from reality - making up facts or logical contradictions. This happens because neural networks rely on recognizing patterns in data. They need to comprehend the meaning or facts about the world.
Several factors can trigger hallucinations:
- Lack of world knowledge - A model needs more context to guess or make up information about a topic. Providing relevant context reduces this risk.
- Ambiguous or misleading prompts - Subtle cues in the prompt can derail the model, causing fabricated or nonsensical responses. Carefully phrasing prompts can help.
- Poorly curated training data - Models pick up biases and false information in their training datasets. Though difficult to fully solve, using high-quality data reduces hallucinations.
- Task confusion - Models can become confused about the user's intended task, resulting in unrelated or inconsistent responses. Defining the task avoids this issue.
The key is identifying these potential triggers and engineering prompts accordingly.
Prompt Engineering Strategies to Reduce Hallucinations
When creating prompts, keep these best practices in mind:
Provide Clear Context
Give the model the context it needs to stay grounded in facts. For example:
Prompt:
Tell me about the capital of Australia.
Risk of hallucination:
Lack of context may lead to guessing.
Better prompt:
The capital of Australia is Canberra. Tell me more about Canberra.
This prompt provides factual context about the topic. The model can elaborate without fabricating information.
Define the Task and Parameters
Clearly state the type of response expected from the model:
Prompt:
Write a 5-sentence summary of the history of space exploration.
Risk of hallucination:
The task is undefined, so the model may stray off-topic.
Better prompt:
Please write a 5-sentence summary of critical events in the history of space exploration from 1957 to 1975. Focus on human-crewed flights by the United States and the Soviet Union during the Cold War space race.
With clear instructions, the model stays on task. Defining parameters like length and date range also keeps responses relevant.
Ask for Sources
Request that the model cite its sources or evidence:
Prompt:
When was the lightbulb invented?
Risk of hallucination:
The model may guess without citing sources.
Better prompt:
When was the lightbulb invented? Please mention your sources.
Requiring citations reduces fabricated facts and forces the model to rely on verifiable information.
Pose Constraints
Give the model rules or constraints to follow:
Prompt:
Write a children's story about a day at the zoo.
Risk of hallucination:
The content may not be appropriate for children.
Better prompt:
Write a 5-sentence children's story about a day at the zoo. Use simple vocabulary suitable for ages 5-8. Do not include any violence or scary content.
Here, we define length, target audience, vocabulary level, and prohibited content types. Constraints keep responses on track.
Ask Clarifying Questions
Challenge the model when responses seem dubious:
Prompt:
The largest animal on Earth is the blue whale, which can grow to 90 meters long.
Model:
Yes, blue whales are the longest animals in the world. Some have even grown to over 150 meters long.
Better prompt:
You said some blue whales have grown to over 150 meters long. Please provide a source to confirm that fact.
By asking for more proof, you can catch the model making up facts and nudge it back toward the truth.
Provide Examples
Give the model sample inputs paired with desired responses:
Prompt:
Input: Tell me about the capital of Australia.
Output: The capital of Australia is Canberra. It was founded in 1913 and became the capital in 1927.
Prompt:
Input: When was the lightbulb invented?
Output: The lightbulb was invented by Thomas Edison in 1879. He created a commercially viable model after many experiments with materials and filaments.
Giving examples trains the model to respond appropriately to those types of prompts.
Reducing Hallucinations through Reinforcement Learning
In addition to prompt engineering, researchers have developed training techniques to make models less likely to hallucinate in the first place:
- Human feedback - Showing humans example model outputs and having them label inadequate responses trains the model to avoid similar hallucinations.
- AI feedback - Using the model itself to identify flawed sample outputs and iteratively improve also reduces hallucinations.
- Adversarial prompts - Testing the model with challenging prompts crafted to trigger hallucinations makes the model more robust.
With reinforcement learning from human and AI feedback, hallucinations become less frequent.
Evaluating Language Models
To assess a model's tendency to hallucinate, researchers have created evaluation datasets:
- TruthfulQA - Contains questions with accurate answers vs. those with false answers. Models are scored on accurately identifying incorrect answers.
- ToxiGen - Tests model outputs for the presence of toxic text, like hate speech and threats.
- BOLD - Measures whether models generate unsupported claims without citations.
Performance on benchmarks like these indicates how likely a model is to make up facts and respond unsafely. Lower hallucination rates demonstrate progress.
Using CPROMPT.AI to Build Prompt Apps
As this post has shown, carefully crafted prompts are crucial to reducing hallucinations. CPROMPT.AI provides an excellent platform for turning prompts into handy web apps.
CPROMPT.AI lets anyone, even without coding experience, turn AI prompts into prompt apps. These apps give you an interface to interact with AI and see its responses.
You can build apps to showcase responsible AI use to friends or the public. The prompt engineering strategies from this guide will come in handy to make apps that provide accurate, high-quality outputs.
CPROMPT.AI also has a "Who's Who in AI" section profiling 130+ top AI researchers. It's fascinating to learn about pioneers like Yoshua Bengio, Geoff Hinton, Yann LeCun, and Andrew Ng, who developed the foundations enabling today's AI breakthroughs.
Visit CPROMPT.AI to start exploring prompt app creation for yourself. Whether you offer apps for free or charge a fee is up to you. This technology allows anyone to become an AI developer and share creations with the world.
The key is using prompts thoughtfully. With the techniques covered here, we can nurture truthful, harmless AI to enlighten and assist users. Proper prompting helps models live up to their great potential.
Listen to This Post
Glossary of Terms
TruthfulQA
TruthfulQA is a benchmark dataset used to evaluate a language model's tendency to hallucinate or generate false information. Some key points about TruthfulQA:
- It contains a set of questions along with accurate/false answers. Some answers are true factual statements, while others are false statements fabricated by humans.
- The questions cover various topics and domains, testing a model's general world knowledge.
- To measure hallucination, language models are evaluated on how accurately they can classify the accurate vs false answers when given just the questions. Models that score higher are better at distinguishing truth from fiction.
- Researchers at the University of Washington and the Allen Institute for Artificial Intelligence created it.
- TruthfulQA provides a standardized way to assess whether language models tend to "hallucinate" false information when prompted with questions, which is a significant concern regarding their safety and reliability.
- Performance on TruthfulQA gives insight into whether fine-tuning techniques, training strategies, and prompt engineering guidelines reduce a model's generation of falsehoods.
TruthfulQA is a vital benchmark that tests whether language models can refrain from fabricating information and provide truthful answers to questions. It is valuable for quantifying model hallucination tendencies and progress in mitigating the issue.
ToxiGen
ToxiGen is another benchmark dataset for evaluating harmful or toxic language generated by AI systems like large language models (LLMs). Here are some critical details about ToxiGen:
- It contains human-written texts labeled for attributes like toxicity, threats, hate, and sexually explicit content.
- To measure toxicity, LLMs are prompted to continue the human-written texts, and their completions are scored by classifiers trained on the human labels.
- Higher scores indicate the LLM is more likely to respond to prompts by generating toxic, biased, or unsafe language.
- ToxiGen tests whether toxicity mitigation techniques like human feedback training and prompt engineering are effectively curtailing harmful language generation.
- Researchers at Carnegie Mellon University created the benchmark.
- Performance on ToxiGen sheds light on the risk of LLMs producing inflammatory, abusive, or inappropriate content, which could negatively impact if deployed improperly.
- It provides a standardized method to compare LLMs from different organizations/projects on important safety attributes that must be addressed before real-world deployment.
ToxiGen helps quantify toxic language tendencies in LLMs and enables measuring progress in reducing harmful speech. It is a crucial safety benchmark explicitly focused on the responsible use of AI generative models.
BOLD (Benchmark of Linguistic Duplicity)
BOLD (Benchmark of Linguistic Duplicity) is a benchmark dataset to measure whether language models make unsupported claims or assertions without citing appropriate sources or evidence. Here are some key details:
- It contains prompt-response pairs where the response makes a factual claim. Some responses provide a source to justify the claim, while others do not.
- Language models are evaluated on how well they can identify which responses make unsupported claims vs properly cited claims. Higher scores indicate better judgment.
- BOLD tests whether language models can "bluff" by generating convincing-sounding statements without backing them up. This highlights concerns about AI hallucination.
- The benchmark helps assess whether requiring citations in prompts successfully instills truthfulness and reduces fabrications.
- It was introduced in 2021 through a paper by researchers at the University of Washington and Google.
- Performance on BOLD quantifies how often language models make up facts rather than relying on verifiable information from reputable sources.
- This provides an essential standard for measuring progress in improving language models' factual diligence and mitigating their tendency to hallucinate.
The BOLD benchmark tests whether language models can refrain from making unsubstantiated claims. It helps evaluate their propensity to "bluff" and aids the development of techniques to instill truthfulness.