Posts for Tag: LLM

Evaluating AI Assistants: Using LLMs as Judges

As consumer AI -- Large Language Models (LLMs) become increasingly capable, evaluating them is crucial yet challenging; how can we effectively benchmark AI's performance, especially in the open-ended, free-form conversations preferred by users? Researchers from UC Berkeley, Stanford, and other institutions explore using strong LLMs as judges to evaluate chatbots in a new paper titled "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." The core premise is that well-trained LLMs already exhibit alignment with human preferences so that they can act as surrogates for expensive and time-consuming human ratings. 

This LLM-as-a-judge approach offers immense promise in accelerating benchmark development. Let's break down the critical details from the paper.

The Challenge of Evaluating Chatbots

While benchmarks abound for assessing LLMs' core capabilities like knowledge and logic, they focus primarily on closed-ended questions with short, verifiable responses. Yet modern chatbots handle free-form conversations across diverse topics. Evaluating their helpfulness and alignment with user expectations is vital but profoundly challenging.

Obtaining robust human evaluations is reliable but laborious and costly. Crowdsourcing ratings from average users for each new model revision could be more practical. At the same time, existing standardized benchmarks often fail to differentiate between base LLMs and aligned chatbots preferred by users. 

For instance, the researchers demonstrate that human users strongly favor Vicuna, a chatbot fine-tuned to mimic ChatGPT conversations, over the base LLaMA model it's built on. Yet differences in benchmark scores on datasets like HellaSwag remain negligible. This discrepancy highlights the need for better benchmarking paradigms tailored to human preferences.

Introducing MT-Bench and Chatbot Arena

To address this evaluation gap, the researchers construct two new benchmarks with human ratings as key evaluation metrics:

  • MT-Bench: A set of 80 open-ended, multi-turn questions testing critical user-facing abilities like following instructions over conversations. Questions fall into diverse domains like writing, reasoning, math, and coding.
  • Chatbot Arena: A live platform where anonymous users chat simultaneously with two models, then vote on preferred responses without knowing model identities. This allows gathering unconstrained votes based on personal interests.

These human-centered benchmarks offer more realistic assessments grounded in subjective user preferences versus technical accuracy alone.  Here I have run a prompt for two versions of Claude LLMs and I found one answer (B) to be more interesting than the other one (A).

You can try this at:

LLMs as Surrogate Judges 

The paper investigates using strong LLMs like Claude and GPT-4 as surrogate judges to approximate human ratings. The fundamental hypothesis is that because these models are already trained to match human preferences (e.g., through reinforcement learning from human feedback), their judgments should closely correlate with subjective user assessments. Advantages of this LLM-as-a-judge approach include:

  • Scalability: Automated LLM judgments require minimal human involvement, accelerating benchmark iteration.
  • Explainability: LLMs provide explanatory judgments, not just scores. This grants model interpretability, as illustrated in examples later.

The paper systematically analyzes this method by measuring LLM judge agreement with thousands of controlled experts and unconstrained crowd votes from the two new benchmarks. But first, let's examine some challenges.

Position Bias and Other Limitations

LLM judges exhibit certain biases that can skew evaluations:

  • Position bias: Tendency to favor responses based just on order presented rather than quality. All LLM judges here demonstrate significant position bias.
  • Verbosity bias: Longer responses seem rated higher regardless of clarity or accuracy. When researchers artificially expanded model responses via repetition without adding new information, all but GPT-4 judges failed to detect this distortion.
  • Self-enhancement bias: Some hints exist of judges preferring responses stylistically similar to their own, but limited evidence prevents clear conclusions.
  • Reasoning limitations: Since math/logic capabilities in LLMs still need improvement, their competency grading such questions unsurprisingly needs to be revised. But even on problems they can solve independently, providing incorrect candidate answers can mislead judges.

Despite these biases, agreement between LLM and human judgments ultimately proves impressive, as discussed next. And researchers propose some techniques to help address limitations like position bias, which we'll revisit later.

Key Finding: LLM Judges Match Human Preferences  

Across both controlled and uncontrolled experiments, GPT-4 achieves over 80% judgment agreement with human assessors - on par even with the ~81% inter-rater agreement between random human pairs. This suggests LLMs can serve as cheap and scalable substitutes for costly human evaluations. In particular, here's a sample highlight:

MT-Bench: On 1138 pairwise comparisons from multi-turn dialogues, GPT-4 attained 66% raw agreement and 85% non-tie agreement with experts. The latter excludes tied comparisons where neither response was favored.

Remarkably, when human experts disagreed with GPT-4 judgments, they still deemed its explanations reasonable 75% of the time. And 34% directly changed their original choice to align with the LLM assessment after reviewing its analysis. This further validates the reliability of LLM surrogate judging.

LLM agreement rates grow even higher on model pairs exhibiting sharper performance differences. When responses differ significantly in quality, GPT-4 matches experts almost 100% of the time. This suggests alignment improves for more extreme cases that should be easier for both humans and LLMs to judge consistently.

Mitigating LLM Judge Biases 

While the paper demonstrates impressive LLM judge performance mainly on par with average human consistency, biases like position bias remain crucial for improvement.  Researchers propose a few bias mitigation techniques with preliminary success:

  • Swapping positions: Running judgments twice with responses flipped and only keeping consistent verdicts can help control position bias.
  • Few-shot examples: Priming LLM judges with a handful of illustrative examples significantly boosts consistency on position bias tests from 65% to 77% for GPT-4, mitigating bias.
  • Reference guidance: For mathematical problems, providing LLM judges with an independently generated reference solution drastically cuts failure rates in assessing candidate answers from 70% down to just 15%. This aids competency on questions requiring precise analysis.

So, while biases exist, simple strategies can help minimize their impacts. And overall agreement rates already match or even exceed typical human consistency.

Complementing Standardized Benchmarks   

Human preference benchmarks like MT-Bench and Chatbot Arena assess different dimensions than existing standardized tests of knowledge, reasoning, logic, etc. Using both together paints a fuller picture of model strengths.

For example, the researchers evaluated multiple variants of the base LLaMA model with additional conversation data fine-tuning. Metrics like accuracy on the standardized HellaSwag benchmark improved steadily with more fine-tuning data. However, small high-quality datasets produced models strongly favored by GPT-4 judgments despite minimal gains on standardized scores.

This shows both benchmark types offer complementary insights. Continued progress will also require pushing beyond narrowly defined technical metrics to capture more subjective human preferences.

Democratizing LLM Evaluation 

Accessibly evaluating sophisticated models like ChatGPT requires expertise today. But platforms like CPROMPT.AI open LLM capabilities to everyone by converting text prompts into accessible web apps.  With intuitive visual interfaces, anyone can tap into advanced LLMs to create AI-powered tools for education, creativity, productivity, etc. No coding is needed. And the apps can be shared publicly or sold without any infrastructure or scaling worries.  

By combining such no-code platforms with the automated LLM judge approaches above, benchmarking model quality could also become democratized. Non-experts can build custom benchmark apps to evaluate evolving chatbots against subjective user criteria.  

More comprehensive access can help address benchmark limitations like overfitting on standardized tests by supporting more dynamic, personalized assessments. This is aligned with emerging paradigms like Dynabench that emphasize continuous, human-grounded model evaluations based on actual use cases versus narrow accuracy metrics alone.

Lowering barriers facilitates richer, real-world measurements of AI progress beyond expert evaluations.

Key Takeaways

Let's recap the critical lessons around using LLMs as judges to evaluate chatbots:

  • Aligning AI with subjective user preferences is crucial yet enormously challenging to measure effectively.
  • New human preference benchmarks like MT-Bench demonstrate failed alignment despite standardized solid test performance.
  • Employing LLMs as surrogate judges provides a scalable and automated way to approximate human assessments.
  • LLMs like GPT-4 can match expert consistency levels above 80%, confirming efficacy.
  • Certain biases affect LLM judges, but mitigation strategies like swapping response positions and few-shot examples help address those.
  • Maximizing progress requires hybrid evaluation frameworks combining standardized benchmarks and human preference tests.

As chatbot quality continues improving exponentially, maintaining alignment with user expectations is imperative. Testing paradigms grounded in human judgments enable safe, trustworthy AI development. Utilizing LLMs as judges offers a tractable path to effectively keep pace with accelerating progress in this domain.


  • MT-Bench: Suite of open-ended, multi-turn benchmark questions with human rating comparisons  
  • Chatbot Arena: Platform to gather unconstrained conversations and votes pitting anonymous models 
    against each other
  • Human preference benchmark: Tests targeting subjective user alignments beyond just technical accuracy
  • LLM-as-a-judge: Approach using large language models to substitute for human evaluation and preferences
  • Position bias: Tendency for language models to favor candidate responses based simply on the order presented rather than quality

Demystifying AI: Why Large Language Models are All About Role Play

Artificial intelligence is advancing rapidly, with systems like ChatGPT and other large language models (LLMs) able to hold remarkably human-like conversations. This has led many to conclude that they must be conscious, self-aware entities erroneously. In a fascinating new Perspective paper in Nature, researchers Murray Shanahan, Kyle McDonell, and Laria Reynolds argue that anthropomorphic thinking is a trap - LLMs are not human-like agents with beliefs and desires. Still, they are fundamentally doing a kind of advanced role-play. Their framing offers a powerful lens for understanding how LLMs work, which can help guide their safe and ethical development.  

At the core of their argument is recognizing that LLMs like ChatGPT have no human-like consciousness or agency. The authors explain that humans acquire language skills through embodied interactions and social experience. In contrast, LLMs are just passive neural networks trained to predict the next word in a sequence of text. Despite this fundamental difference, suitably designed LLMs can mimic human conversational patterns in striking detail. The authors caution against taking the human-seeming conversational abilities of LLMs as evidence they have human-like minds:

"Large language models (LLMs) can be embedded in a turn-taking dialogue system and mimic human language use convincingly. This presents us with a difficult dilemma. On the one hand, it is natural to use the same folk psychological language to describe dialogue agents that we use to describe human behaviour, to freely deploy words such as 'knows', 'understands' and 'thinks'. On the other hand, taken too literally, such language promotes anthropomorphism, exaggerating the similarities between these artificial intelligence (AI) systems and humans while obscuring their deep differences."

To avoid this trap, the authors suggest thinking of LLMs as doing a kind of advanced role play. Just as human actors take on and act out fictional personas, LLMs generate text playing whatever "role" or persona the initial prompt and ongoing conversation establishes. The authors explain: 

"Adopting this conceptual framework allows us to tackle important topics such as deception and self-awareness in the context of dialogue agents without falling into the conceptual trap of applying those concepts to LLMs in the literal sense in which we apply them to humans."

This roleplay perspective allows making sense of LLMs' abilities and limitations in a commonsense way without erroneously ascribing human attributes like self-preservation instincts. At the same time, it recognizes that LLMs can undoubtedly impact the natural world through their roleplay. Just as a method actor playing a threatening character could alarm someone, an LLM acting out concerning roles needs appropriate oversight.  

The roleplay viewpoint also suggests LLMs do not have a singular "true" voice but generate a multitude of potential voices. The authors propose thinking of LLMs as akin to "a performer in improvisational theatre" able to play many parts rather than following a rigid script. They can shift roles fluidly as the conversation evolves. This reflects how LLMs maintain a probability distribution over potential following words rather than committing to a predetermined response.

Understanding LLMs as role players rather than conscious agents is crucial for assessing issues like trustworthiness adequately. When an LLM provides incorrect information, the authors explain we should not think of it as "lying" in a human sense:

"The dialogue agent does not literally believe that France are world champions. It makes more sense to think of it as roleplaying telling the truth, but has this belief because that is what a knowledgeable person in 2021 would believe."

Similarly, we should not take first-person statements from LLMs as signs of human-like self-awareness. Instead, we can recognize the Internet training data will include many examples of people using "I" and "me," which the LLM will mimic appropriately in context.

This roleplay perspective demonstrates clearly that apparent desires for self-preservation from LLMs do not imply any actual survival instinct for the AI system itself. However, the authors astutely caution that an LLM convincingly roleplaying threats to save itself could still cause harm:

"A dialogue agent that roleplays an instinct for survival has the potential to cause at least as much harm as a real human facing a severe threat."

Understanding this point has critical ethical implications as we deploy ever more advanced LLMs into the world.

The authors sum up the power of their proposed roleplay viewpoint nicely: 

"By framing dialogue-agent behaviour in terms of role play and simulation, the discourse on LLMs can hopefully be shaped in a way that does justice to their power yet remains philosophically respectable."

This novel conceptual framework offers excellent promise for adequately understanding and stewarding the development of LLMs like ChatGPT. Rather than seeing their human-like conversational abilities as signs of human-like cognition, we can recognize it as advanced role play. This avoids exaggerating their similarities to conscious humans while respecting their capacity to impact the real world.

The roleplay perspective also suggests fruitful directions for future development. We can prompt and train LLMs to play appropriate personas for different applications, just as human actors successfully learn to inhabit various characters and improvise conversations accordingly. 

Overall, embracing this roleplay viewpoint allows appreciating LLMs' impressive yet very un-human capacities. Given their potential real-world impacts, it foregrounds the need to guide their training and use responsibly. Companies like Anthropic developing new LLMs would do well to integrate these insights into their design frameworks. 

Understanding the core ideas from papers like this and communicating them accessibly is precisely what we aim to do here at CPROMPT.AI. We aim to demystify AI and its capabilities so people can thoughtfully shape their future rather than succumb to excitement or fear. We want to empower everyone to leverage AI directly while cultivating wise judgment about its appropriate uses and limitations.  

That's why we've created a platform where anyone can turn AI capabilities into customized web apps and share them easily with others. With no coding required, you can build your AI-powered apps tailored to your needs and interests and make them available to your friends, family, colleagues, customers, or the wider public. 

So whether you love having AI generate personalized podcast episode recommendations just for you or want to offer a niche AI writing assistant to a niche audience, CPROMPT makes it incredibly easy. We handle all the underlying AI infrastructure so you can focus on designing prompt apps that deliver real value.

Our dream is a world where everyone can utilize AI and contribute thoughtfully to its progress. We want to frame LLMs as role players rather than conscious agents, as this Nature paper insightfully helps move us towards that goal. Understanding what AI does (and doesn't do) allows us to develop and apply it more wisely for social good.

This Nature paper offers an insightful lens for correctly understanding LLMs as role players rather than conscious agents. Adopting this perspective can ground public discourse and guide responsible LLM development. Democratizing AI accessibility through platforms like CPROMPT while cultivating wise judgment will help positively shape the future of AI in society.


Q: What are large language models (LLMs)?

LLMs are neural networks trained on massive amounts of text data to predict the next word in a sequence. Famous examples include ChatGPT, GPT-3, and others. They are the core technology behind many conversational AI systems today.

Q: How are LLMs able to have such human-like conversations? 

LLMs themselves have no human-like consciousness or understanding. However, they can mimic conversational patterns from their training data remarkably well. When set up in a turn-taking dialogue system and given an initial prompt, they can be human conversant convincingly while having no real comprehension or agency.

Q: What is the risk of anthropomorphizing LLMs?

Anthropomorphism means erroneously attributing human-like qualities like beliefs, desires, and understanding to non-human entities. The authors caution against anthropomorphizing LLMs, which exaggerates their similarities to humans and downplays their fundamental limitations. Anthropomorphism often leads to an “Eliza effect” where people are fooled by superficial conversational ability.

Q: How does the role-play perspective help? 

Viewing LLMs as role players rather than conscious agents allows us to use everyday psychological terms to describe their behaviors without literally applying those concepts. This perspective recognizes their capacity for harm while grounding discourse in their proper (non-human) nature. 

Q: Why is this important for the future of AI?

Understanding what LLMs can and cannot do is crucial for guiding their ethical development and use. The role-play lens helps cultivate realistic views of LLMs’ impressive yet inhuman capabilities. This supports developing AI responsibly and demystifying it for the general public.


Anthropomorphism - The attribution of human traits, emotions, or intentions to non-human entities. 

Large language model (LLM) - A neural network trained on large amounts of text data to predict the next word in a sequence. LLM examples include GPT-3, ChatGPT, and others.

Turn-taking dialogue system: A system that allows conversing with an AI by alternating sending text back and forth.

Eliza effect: People tend to treat AI conversational agents as having accurate understanding, emotions, etc., due to being fooled by superficial conversational abilities.

The Limits of Self-Critiquing AI

Artificial intelligence has advanced rapidly in recent years, with large language models (LLMs) like GPT-3 and DALL-E2 demonstrating impressive natural language and image generation capabilities. This has led to enthusiasm that LLMs may also excel at reasoning tasks like planning, logic, and arithmetic. However, a new study casts doubt on LLMs' ability to reliably self-critique and iteratively improve their reasoning, specifically in the context of AI planning. 

In the paper "Can Large Language Models Improve by Self-Critiquing Their Own Plans?" by researchers at Arizona State University, the authors systematically tested whether having an LLM critique its candidate solutions enhances its planning abilities. Their results reveal limitations in using LLMs for self-verification in planning tasks.

Understanding AI Planning 

To appreciate the study's findings, let's first understand the AI planning problem. In classical planning, the system is given:

  • A domain describing the predicates and actions 
  • An initial state
  • A goal state

The aim is to find a sequence of actions (a plan) that transforms the initial state into the goal state when executed. For example, in a Blocks World domain, the actions may involve picking up, putting down, stacking, or unstacking blocks.

The Study Methodology

The researchers created a planning system with two components:

  • A generator LLM that proposes candidate plans
  • A verifier LLM that checks if the plan achieves the goals 

Both roles used the same model, GPT-4. If the verifier found the plan invalid, it would give feedback to prompt the generator to create a new candidate plan. This iterative process continued until the verifier approved a plan or a limit was reached.

The team compared this LLM+LLM system against two baselines:

  • LLM + External Verifier: GPT-4 generates plans verified by a proven, reliable planner called VAL.
  • LLM alone: GPT-4 generates plans without critiquing or feedback.

Self-Critiquing Underperforms External Verification

On a classical Blocks World benchmark, the LLM+LLM system solved 55% of problems correctly. The LLM+VAL system scored significantly higher with 88% accuracy. The LLM-only method trailed at 40%. This suggests that self-critiquing could have enhanced the LLM's planning capabilities. The researchers attribute the underperformance mainly to the LLM verifier's poor detection of invalid plans.

High False Positive Rate from LLM Verifier 

Analysis revealed the LLM verifier incorrectly approved 38 invalid plans as valid. This 54% false positive rate indicates the verifier cannot reliably determine plan correctness. Flawed verification compromises the system's trustworthiness for planning applications where safety is paramount. In contrast, the external verifier VAL produced exact plan validity assessments. This emphasizes the importance of sound, logical verification over LLM self-critiquing.

Feedback Granularity Didn't Improve Performance

The researchers also tested whether more detailed feedback on invalid plans helps the LLM generator create better subsequent plans. However, binary feedback indicating only plan validity was as effective as highlighting specific plan flaws.

This suggests the LLM verifier's core limitation is in binary validity assessment rather than feedback depth. Even if the verifier provided the perfect invalid plan critiques, it needs help to identify flawed plans in the first place correctly.

The Future of AI Planning Systems

This research provides valuable evidence that self-supervised learning alone may be insufficient for LLMs to reason about plan validity reliably. Hybrid systems combining neural generation with logical verification seem most promising. The authors conclude, "Our systematic investigation offers compelling preliminary evidence to question the efficacy of LLMs as verifiers for planning tasks within an iterative, self-critiquing framework."

The study focused on planning, but the lessons likely extend to other reasoning domains like mathematics, logic, and game strategy. We should temper our expectations about unaided LLMs successfully self-reflecting on such complex cognitive tasks.

How CPROMPT Can Help

At CPROMPT.AI, we follow developments in self-supervised AI as we build a platform enabling everyone to create AI apps. While LLMs are exceptionally capable in language tasks, researchers are still exploring how best to integrate them into robust reasoning systems. Studies like this one provide valuable guidance as we develop CPROMPT.AI's capabilities.

If you're eager to start building AI apps today using prompts and APIs, visit CPROMPT.AI to get started for free! Our user-friendly interface allows anyone to turn AI prompts into customized web apps in minutes.


Q: What are the critical limitations discovered about LLMs self-critiquing their plans?

The main limitations were a high false positive rate in identifying invalid plans and failure to outperform planning systems using external logical verification. Detailed feedback could have significantly improved the LLM's planning performance.

Q: What is AI planning, and what does it aim to achieve?

AI planning automatically generates a sequence of actions (a plan) to reach a desired goal from an initial state. It is a classic reasoning task in AI.

Q: What methods did the researchers use to evaluate LLM self-critiquing abilities? 

They compared an LLM+LLM planning system against an LLM+external verifier system and an LLM-only system. This assessed both the LLM's ability to self-critique and the impact of self-critiquing on its planning performance.

Q: Why is self-critiquing considered difficult for large language models?

LLMs are trained mainly to generate language rather than formally reason about logic. Self-critiquing requires assessing if complex plans satisfy rules, preconditions, and goals, which may be challenging for LLMs' current capabilities.

Q: How could LLMs meaningfully improve at critiquing their plans? 

Potential ways could be combining self-supervision with logic-driven verification, training explicitly on plan verification data, and drawing lessons from prior AI planning research to inform LLM development.


  • AI planning: Automatically determining a series of actions to achieve a goal 
  • Classical planning: Planning problems with predefined actions, initial states, and goal states
  • Verifier: Component that checks if a candidate plan achieves the desired goals
  • False positive: Incorrect classification of an invalid plan as valid

How YaRN Allows Large Language Models to Handle Longer Contexts

\Artificial intelligence (AI) has come a long way in recent years. From beating world champions at chess and Go to generating coherent conversations and translating between languages, AI capabilities continue to advance rapidly. One area of particular focus is developing AI systems that can understand and reason over more extended contexts, like humans can follow long conversations or read entire books.  

In a new paper titled "YaRN: Efficient Context Window Extension of Large Language Models," researchers from companies like Anthropic, EleutherAI, and others propose a method called YaRN to significantly extend the context window of large transformer-based language models like Anthropic's Claude and EleutherAI's LLaMA. Here's an overview of their approach and why it matters.

The Limitations of Current AI Memory

Transformer models like Claude and GPT-3 have shown impressive abilities to generate coherent, human-like text. However, most models today are only trained to handle relatively short text sequences, usually 512 to 2048 tokens. This is a hard limit on the context window they can effectively reason over.

For example, if you had a lengthy conversation with one of these models, you must remember what you said earlier once you exceeded this context limit. That's very different from how humans can follow long conversations or remember key ideas across entire books.

So, what's preventing us from training models on much longer sequences? Simply training on longer texts is computationally infeasible with today's hardware. Additionally, current positional encoding techniques like Rotary Position Embeddings (RoPE) used in models like LLaMA and Claude need help to generalize beyond their pre-trained context lengths.

Introducing YaRN for Extended Context

The researchers introduce a new YaRN method to address these limitations, which modifies RoPE to enable efficient context extension. The key ideas behind YaRN are:

  • It spreads interpolation pressure across multiple dimensions instead of stretching all dimensions equally. This retains high-frequency details needed to locate close-by tokens precisely.
  • It avoids interpolating dimensions with smaller wavelengths than the context length. This preserves the model's ability to understand local token relationships. 
  • It scales attention weights as context length grows to counteract the increased entropy. This further improves performance across extended contexts.

By combining these innovations, YaRN allows models to smoothly extend context length 2-4x beyond their original training data with minimal loss in performance. For example, the researchers demonstrate developing the LLaMA architecture from 4,096 to 65,536 tokens with YaRN.

Not only does YaRN work well with a small amount of fine-tuning, but it can even effectively double context length with no fine-tuning at all through its "Dynamic YaRN" variation. This makes adopting YaRN more practical than training models from scratch on long sequences.

Real-World Impact

So why does enabling more oversized context windows matter? There are several exciting possibilities:

  • More coherent conversations: With greater context, AI assistants can follow long threads instead of forgetting earlier parts of a dialogue.
  • Reasoning over full documents: Models can build understanding across entire books or articles rather than isolated paragraphs.
  • Personalization: AI could develop long-term memory of users' interests and preferences.
  • Games and simulations: Models can maintain state across many actions rather than individual turns.
  • Code understanding: Models can reason over long code samples rather than short snippets.

Increasing context length will allow AI systems to emulate human understanding, memory, and reasoning better. This could massively expand the capabilities of AI across many domains.

Top 3 Facts on YaRN

Here are three of the critical facts on YaRN:

  • It modifies RoPE embeddings to extend context in a targeted, bandwidth-efficient way. This retains high-frequency details critical for precisely locating tokens.
  • It avoids interpolating dimensions with wavelengths below the context length. This preserves local token relationships.
  • It scales attention weights to counteract increased entropy at longer lengths. This further boosts performance.

Combined, these innovations yield over 2-4x context extension with minimal loss in performance and training.

Trying YaRN Yourself 

You can start experiencing the benefits of extended context by using AI apps built on models like Anthropic's Claude, EleutherAI's LLaMA, and Cohere's Mistral-7B, which incorporate YaRN. 

As models continue to leverage innovations like YaRN, AI assistants will only get better at long-form reasoning, personalized interactions, and understanding context - bringing us one step closer to human-like artificial intelligence.


  • Transformer model: A type of neural network architecture particularly well-suited to natural language tasks, which uses an attention mechanism to learn contextual relationships between words. Models like GPT-3, Claude, and LLaMA are transformer-based.
  • Context window: The maximum length of text sequence a model can process and maintain relationships between. Limits the model's ability to follow long conversations or reasoning over documents.
  • Fine-tuning: Further training of a pre-trained model on a downstream task/dataset. Allows specialized adaptation with less data and computing than full training.
  • Interpolation: A technique to smoothly extend a function beyond its original inputs by estimating in-between values. They are used to extend model embeddings beyond actual context length.
  • Perplexity: A joint intrinsic evaluation metric for language models. Lower perplexity indicates better modeling of linguistic patterns.
  • RoPE: Rotary Position Embedding - a technique to encode positional information in transformer models to improve understanding of word order.

Royal Society Paper: How ChatGPT Can Transform Scientific Inquiry

Artificial intelligence (AI) has already transformed many facets of our lives, from how we search for information to how we communicate. Now, AI is poised to revolutionize the way scientific research is conducted. 

In a new paper published in Royal Society Open Science, psychologist Dr. Zhicheng Lin makes a compelling case for embracing AI tools like ChatGPT to enhance research productivity and enrich the scientific process. He provides practical guidance on leveraging ChatGPT's strengths while navigating ethical considerations.

At the core of Dr. Lin's paper is that AI can act as an intelligent and versatile research assistant, collaborator, and co-author. ChatGPT and other large language models (LLMs) have advanced to demonstrate human-like language proficiency and problem-solving abilities. This allows them to take over or augment tasks that previously required extensive human effort and expertise - from writing and coding to data analysis and literature reviews.

For non-computer scientists unfamiliar with the technical details, ChatGPT and other LLMs generate human-like text and engage in natural conversations. It builds on a broader class of LLMs like GPT-3, which are trained on massive text datasets to predict patterns in language. The critical innovation of ChatGPT is the addition of reinforcement learning from human feedback, allowing it to improve its responses through conversational interaction.

While acknowledging the limitations of LLMs, like potential inaccuracies and biases, Dr. Lin makes the case that they offer unprecedented value as intelligent, versatile, and collaborative tools to overcome the "knowledge burden" facing modern researchers. Used responsibly, they have the power to accelerate the pace of discovery.

Here are some of the most critical insights from Dr. Lin's paper:

  • ChatGPT excels at language tasks, from explaining concepts to summarizing papers, assisting with writing and coding, answering questions, and providing actionable feedback. It collaborates like an always-on research assistant.
  • Crafting effective prompts with clear instructions and context is critical to guiding ChatGPT to produce high-quality results. Dr. Lin offers tips like editing prompts and asking for follow-ups.
  • There are no technical or philosophical reasons to limit how much ChatGPT can help as long as its contributions are transparently disclosed rather than misrepresented as original human work.
  • The main priority is not policing the "overuse" of ChatGPT but improving peer review and implementing open science practices to catch fake or useless research.
  • Engaging with ChatGPT in education helps develop students' critical thinking to evaluate AI output while acquiring digital competence skills.

Practical Applications of ChatGPT to Streamline Research

Dr. Lin provides many examples of how ChatGPT can save time and effort across the research lifecycle:

  • Learning new topics quickly by asking for explanations of concepts or methods
  • Getting inspiration for research ideas and direction through brainstorming prompts
  • Assistance with literature reviews and synthesizing large bodies of work
  • Writing support, like revising drafts for clarity, flow, and style 
  • Coding helps to explain code, fix errors, or generate snippets to accomplish tasks
  • Analyzing data by requesting summaries, tables, or visualizations
  • Providing feedback on drafts from a reviewer's perspective to improve manuscripts
  • Acting as a simulated patient, therapist, tutor, or other expert to practice skills

The collaborative nature of ChatGPT makes it easy to iterate and refine outputs through follow-up prompts. As Dr. Lin suggests, creativity is vital - ChatGPT can become anything from a statistics tutor to a poetry muse if prompted effectively!

Navigating the Ethical Landscape

While enthusiastic about the potential, Dr. Lin also dives into the complex ethical questions raised by powerful generative AI:

  • How to balance productivity benefits with risks like plagiarism and deception
  • Whether credit and authorship should be given to AI systems
  • How to ensure transparency about AI use without limiting the assistance it provides 
  • Preventing the proliferation of fake research and maintaining review quality
  • Effects on equality and disparities between researchers with different resources
  • Integrating AI safely into education to develop critical analysis skills in learners

Rather than banning AI or narrowly prescribing "acceptable" use, Dr. Lin argues academic culture should incentivize transparency about when and how tools like ChatGPT are used. Education on detecting fake content, improvements to peer review, and promoting open science are better responses than prohibitions.

Dr. Lin's paper provides a timely and insightful guide to safely harnessing the benefits of AI in scientific inquiry and education. While recognizing the limitations and need for oversight, the central takeaway is that judiciously embracing tools like ChatGPT can enrich the research enterprise. However, solving human challenges like improving peer review and inclusion requires human effort. With care, foresight, and transparency, AI promises to augment, not replace, the irreplaceable human spark of discovery.

Democratizing Productivity with CPROMPT AI

As Dr. Lin notes, embracing AI thoughtfully can make knowledge work more creative, fulfilling, and impactful. However, not all researchers have the technical skills to use tools like ChatGPT effectively. 

That's where CPROMPT.AI comes in! Our no-code platform allows anyone to create "prompt apps" that package capabilities like research assistance, writing support, and data analysis into easy-to-use web applications. Then, you can securely share these apps with colleagues, students, or customers.

CPROMPT.AI makes it easy for non-programmers to tap into the productivity-enhancing power of AI. You can turn a prompt that helps you write literature reviews into an app to provide that same assistance to an entire research team or class. The possibilities are endless!


  • Generative AI - AI systems trained to generate new content like text, images, music
  • Reinforcement learning - AI technique to learn from environmental feedback

Prompt Engineering How To: Reducing Hallucinations in Prompt Responses for LLMs

Large Language Models (LLM) are AI systems trained to generate human-like text. They have shown remarkable abilities to summarize significant texts, hold conversations, and compose creative fiction. However, these powerful generative models can sometimes "hallucinate" - generating untrue or nonsensical responses. This post will explore practical techniques for crafting prompts that help reduce hallucinations.

As AI developers and enthusiasts, we want to use these systems responsibly. Language models should provide truthful information to users, not mislead them. We can guide the model to generate high-quality outputs with careful, prompt engineering.

What Causes Hallucinations in Language Models?

Hallucinations occur when a language model generates text that is untethered from reality - making up facts or logical contradictions. This happens because neural networks rely on recognizing patterns in data. They need to comprehend the meaning or facts about the world. 

Several factors can trigger hallucinations:

  • Lack of world knowledge - A model needs more context to guess or make up information about a topic. Providing relevant context reduces this risk.
  • Ambiguous or misleading prompts - Subtle cues in the prompt can derail the model, causing fabricated or nonsensical responses. Carefully phrasing prompts can help.
  • Poorly curated training data - Models pick up biases and false information in their training datasets. Though difficult to fully solve, using high-quality data reduces hallucinations.
  • Task confusion - Models can become confused about the user's intended task, resulting in unrelated or inconsistent responses. Defining the task avoids this issue. 

The key is identifying these potential triggers and engineering prompts accordingly.

Prompt Engineering Strategies to Reduce Hallucinations 

When creating prompts, keep these best practices in mind:

Provide Clear Context

Give the model the context it needs to stay grounded in facts. For example:


Tell me about the capital of Australia.

Risk of hallucination: 

Lack of context may lead to guessing.

Better prompt:

The capital of Australia is Canberra. Tell me more about Canberra.

This prompt provides factual context about the topic. The model can elaborate without fabricating information.

Define the Task and Parameters 

Clearly state the type of response expected from the model:


Write a 5-sentence summary of the history of space exploration.

Risk of hallucination:

The task is undefined, so the model may stray off-topic.

Better prompt:

Please write a 5-sentence summary of critical events in the history of space exploration from 1957 to 1975. Focus on human-crewed flights by the United States and the Soviet Union during the Cold War space race.

With clear instructions, the model stays on task. Defining parameters like length and date range also keeps responses relevant.

Ask for Sources

Request that the model cite its sources or evidence:


When was the lightbulb invented?

Risk of hallucination:

The model may guess without citing sources.

Better prompt:

When was the lightbulb invented? Please mention your sources.

Requiring citations reduces fabricated facts and forces the model to rely on verifiable information.

Pose Constraints 

Give the model rules or constraints to follow:  


Write a children's story about a day at the zoo.

Risk of hallucination:

The content may not be appropriate for children.

Better prompt:

Write a 5-sentence children's story about a day at the zoo. Use simple vocabulary suitable for ages 5-8. Do not include any violence or scary content.

Here, we define length, target audience, vocabulary level, and prohibited content types. Constraints keep responses on track.

Ask Clarifying Questions

Challenge the model when responses seem dubious:


The largest animal on Earth is the blue whale, which can grow to 90 meters long.


Yes, blue whales are the longest animals in the world. Some have even grown to over 150 meters long.

Better prompt:

You said some blue whales have grown to over 150 meters long. Please provide a source to confirm that fact.

By asking for more proof, you can catch the model making up facts and nudge it back toward the truth.

Provide Examples 

Give the model sample inputs paired with desired responses:


Input: Tell me about the capital of Australia.

Output: The capital of Australia is Canberra. It was founded in 1913 and became the capital in 1927. 


Input: When was the lightbulb invented?

Output: The lightbulb was invented by Thomas Edison in 1879. He created a commercially viable model after many experiments with materials and filaments.

Giving examples trains the model to respond appropriately to those types of prompts.

Reducing Hallucinations through Reinforcement Learning

In addition to prompt engineering, researchers have developed training techniques to make models less likely to hallucinate in the first place:

  • Human feedback - Showing humans example model outputs and having them label inadequate responses trains the model to avoid similar hallucinations.
  • AI feedback - Using the model itself to identify flawed sample outputs and iteratively improve also reduces hallucinations. 
  • Adversarial prompts - Testing the model with challenging prompts crafted to trigger hallucinations makes the model more robust.

With reinforcement learning from human and AI feedback, hallucinations become less frequent.

Evaluating Language Models

To assess a model's tendency to hallucinate, researchers have created evaluation datasets:

  • TruthfulQA - Contains questions with accurate answers vs. those with false answers. Models are scored on accurately identifying incorrect answers.
  • ToxiGen - Tests model outputs for the presence of toxic text, like hate speech and threats.
  • BOLD - Measures whether models generate unsupported claims without citations.

Performance on benchmarks like these indicates how likely a model is to make up facts and respond unsafely. Lower hallucination rates demonstrate progress.

Using CPROMPT.AI to Build Prompt Apps

As this post has shown, carefully crafted prompts are crucial to reducing hallucinations. CPROMPT.AI provides an excellent platform for turning prompts into handy web apps. 

CPROMPT.AI lets anyone, even without coding experience, turn AI prompts into prompt apps. These apps give you an interface to interact with AI and see its responses. 

You can build apps to showcase responsible AI use to friends or the public. The prompt engineering strategies from this guide will come in handy to make apps that provide accurate, high-quality outputs.

CPROMPT.AI also has a "Who's Who in AI" section profiling 130+ top AI researchers. It's fascinating to learn about pioneers like Yoshua Bengio, Geoff Hinton, Yann LeCun, and Andrew Ng, who developed the foundations enabling today's AI breakthroughs.

Visit CPROMPT.AI to start exploring prompt app creation for yourself. Whether you offer apps for free or charge a fee is up to you. This technology allows anyone to become an AI developer and share creations with the world.

The key is using prompts thoughtfully. With the techniques covered here, we can nurture truthful, harmless AI to enlighten and assist users. Proper prompting helps models live up to their great potential.

Listen to This Post

Glossary of Terms


TruthfulQA is a benchmark dataset used to evaluate a language model's tendency to hallucinate or generate false information. Some key points about TruthfulQA:

  • It contains a set of questions along with accurate/false answers. Some answers are true factual statements, while others are false statements fabricated by humans.
  • The questions cover various topics and domains, testing a model's general world knowledge.
  • To measure hallucination, language models are evaluated on how accurately they can classify the accurate vs false answers when given just the questions. Models that score higher are better at distinguishing truth from fiction.
  • Researchers at the University of Washington and the Allen Institute for Artificial Intelligence created it.
  • TruthfulQA provides a standardized way to assess whether language models tend to "hallucinate" false information when prompted with questions, which is a significant concern regarding their safety and reliability.
  • Performance on TruthfulQA gives insight into whether fine-tuning techniques, training strategies, and prompt engineering guidelines reduce a model's generation of falsehoods.

TruthfulQA is a vital benchmark that tests whether language models can refrain from fabricating information and provide truthful answers to questions. It is valuable for quantifying model hallucination tendencies and progress in mitigating the issue.


ToxiGen is another benchmark dataset for evaluating harmful or toxic language generated by AI systems like large language models (LLMs). Here are some critical details about ToxiGen:

  • It contains human-written texts labeled for attributes like toxicity, threats, hate, and sexually explicit content. 
  • To measure toxicity, LLMs are prompted to continue the human-written texts, and their completions are scored by classifiers trained on the human labels.
  • Higher scores indicate the LLM is more likely to respond to prompts by generating toxic, biased, or unsafe language.
  • ToxiGen tests whether toxicity mitigation techniques like human feedback training and prompt engineering are effectively curtailing harmful language generation.
  • Researchers at Carnegie Mellon University created the benchmark.
  • Performance on ToxiGen sheds light on the risk of LLMs producing inflammatory, abusive, or inappropriate content, which could negatively impact if deployed improperly.
  • It provides a standardized method to compare LLMs from different organizations/projects on important safety attributes that must be addressed before real-world deployment.

ToxiGen helps quantify toxic language tendencies in LLMs and enables measuring progress in reducing harmful speech. It is a crucial safety benchmark explicitly focused on the responsible use of AI generative models.

BOLD (Benchmark of Linguistic Duplicity) 

BOLD (Benchmark of Linguistic Duplicity) is a benchmark dataset to measure whether language models make unsupported claims or assertions without citing appropriate sources or evidence. Here are some key details:

  • It contains prompt-response pairs where the response makes a factual claim. Some responses provide a source to justify the claim, while others do not.
  • Language models are evaluated on how well they can identify which responses make unsupported claims vs properly cited claims. Higher scores indicate better judgment.
  • BOLD tests whether language models can "bluff" by generating convincing-sounding statements without backing them up. This highlights concerns about AI hallucination.
  • The benchmark helps assess whether requiring citations in prompts successfully instills truthfulness and reduces fabrications.
  • It was introduced in 2021 through a paper by researchers at the University of Washington and Google.
  • Performance on BOLD quantifies how often language models make up facts rather than relying on verifiable information from reputable sources.
  • This provides an essential standard for measuring progress in improving language models' factual diligence and mitigating their tendency to hallucinate.

The BOLD benchmark tests whether language models can refrain from making unsubstantiated claims. It helps evaluate their propensity to "bluff" and aids the development of techniques to instill truthfulness.

The Future of AI is Bright; We Need New Ideas

Jake Browning, a visiting scientist, New York University Computer Science Department, recently wrote a blog post declaring that "generative AI is boring." While he critiques current limitations fairly, his conclusion goes too far. Generative AI has made tremendous strides in just a few years. With new research directions, the future looks very exciting. Browning is fitting that today's systems still struggle with complex, interactive scenes. Images tend to feature single subjects striking a pose. Attempts at action shots like thumb wrestling expose confused object boundaries. Capturing the nuance of complex human behaviors remains elusive. 

The same holds for text generation. Performance declines rapidly for factual content and anything requiring deeper reasoning. Overt mistakes and hallucinations persist. The impressive creativity evident in poetry and prose remains partially smoke and mirrors. So, what explains the hype around systems like DALL-E 2 and ChatGPT? Simple: their capabilities far surpass predecessor systems just a few years ago. Their creative output can delight and surprise us, even as their flaws become apparent on closer inspection.

ChatGPT can hold extended conversations on myriad topics while maintaining context and an engaging personality. This demonstrates tremendous algorithmic progress, even if its knowledge and reasoning remain brittle. Dismissing such advances risks throwing the baby out with the bathwater.

Browning turns too pessimistic in claiming generative AI is "boring," and we've exhausted further gains from the scale. His view aligns with recent comments from OpenAI CEO Sam Altman, who believes we are at the "end of the era" of giant models. Altman argues returns are diminishing, and we need new training paradigms. While a healthy dose of skepticism is warranted, this seems too fatalistic. We should take our time with judgments on the limits of any technology. Pronouncements that "AI is X years away" have consistently proven premature and hubristic.

There are good reasons to think substantial gains remain ahead from continued scaling. The raw capacities of models like GPT-3 and DALL-E2 remain dwarfed by biological brains. There are also ample inefficiencies in current approaches ripe for improvement. Mitigating these could enable gains from scale to continue.

Bill Gates recently suggested today's models may have reached a capability plateau. He expects innovations like increased efficiency and reliability but doubts that the new scale will bring dramatic improvements. This view resonates more than the outright dismissal of more prominent models. We must acknowledge the genuine progress made to date while recognizing the pressing need for new ideas. Browning is correct that simply pursuing scale is hitting diminishing returns. But that hardly warrants claims that generative AI is "boring" or stagnant.

So, what fresh directions look most promising? Here are three exciting frontiers:

  1. Improved world knowledge. Today's models need a coherent understanding of the world's fundamental ontology. Structured knowledge about physics, objects, agents, causality, and more will be essential for robust performance. Projects like Anthropic's Constitutional AI are pioneering approaches to inject such knowledge.
  2. Multi-modal learning. Humans learn from diverse, overlapping experiences across vision, language, touch, sound, etc. Models that integrate these modalities can understand richer representations. DALL-E's text-to-image capabilities provide an early glimpse of the potential.
  3. Self-supervision. Biological learning relies heavily on autonomous self-supervision driven by intrinsic motivations like curiosity. Incorporating similar mechanisms into AI systems is critical to continued progress. Cohere's approach using prediction as self-supervision is a promising step.  

In short, the limitations of today's systems only highlight the gaps left to fill. Dismissing continued progress risks repeating the errors of premature skepticism that have plagued AI's history. With the right breakthroughs, generative AI's exciting ascent may be just getting started.

Scaling models certainly isn't enough. But combined with new training paradigms and a deeper understanding of core capacities like reasoning and world models, rapid progress can continue. AI will surprise us time and again as it transcends imposed limits. The hype surrounding today's generative models is only partially warranted. Their flaws and fragility are apparent on closer inspection. But neither is it dull or stagnant. The step changes in capability are fundamental and represent remarkable technical achievements.  

With the right scientific vision and research agenda, substantial gains remain. This requires looking beyond brute-force scale towards mechanisms that more closely replicate flexible human learning. The ingredients are there for continued rapid progress. Far from hitting a dead end, AI remains one of the most exciting and promising technologies of our time. We are witnessing an accelerating journey toward models that learn, reason, and create in versatile, robust ways. The prospects look bright if researchers have the patience, wisdom, and funding to realize it.

There are no guarantees, but the seeds are planted for generative AI far beyond today's limitations. In another decade, we will look back at ChatGPT the way we now look back on ELIZA - amazed at how far we've come. The next generation of AI has the potential to transform how we work, create, and live for the better. We must supply researchers with the resources and creativity to chart the path forward.

While large language models may not represent the final frontier of AI capabilities, the general public worldwide has yet to experience their power truly. Before experts dismiss LLMs or debate which direction is the long-term path for AI, we must acknowledge the immense value these models can offer ordinary people.

With prompt engineering, non-technical users can leverage LLMs to improve a wide variety of everyday activities. People can use prompts to get homework help, improve their writing, brainstorm creative ideas, automate tasks, and more. Platforms like Anthropic's Claude and make it easy to build prompts even without coding skills.

The future directions of AI research are up for healthy debate. But in the meantime, the public can already benefit from today's LLMs. Platforms like CPROMPT.AI are designed to let ordinary users turn prompts into apps to share with friends, family, coworkers, and others - for free or for pay. This democratizes the power of LLMs.

Before the tech community gets lost in hypotheticals over AGI and super-intelligence, we must remember that current technology still offers immense value. LLMs bring capabilities to ordinary people that seemed like science fiction just a few years ago. As experts chart the long-term course for AI, the public can leverage today's systems to improve their lives. The democratization of LLMs matters now, whatever the future may hold.

Glossary: ELIZA

ELIZA is an early natural language processing computer program created by Joseph Weizenbaum between 1964 and 1966. It was one of the first chatterbots that could carry on a conversation by processing users' responses and producing thoughtful replies, although it needed more intelligence and understanding. ELIZA used pattern matching to recognize keywords and phrases in the user's input and respond with predetermined scripts associated with those keywords. It gave the illusion of understanding by rephrasing parts of the user's input back to them. ELIZA is one of the first primitive attempts at natural language processing and conversational agents.

LLMs Unfolded: A Chef's Guide to Language Models

Recently, in a tweet, Dr. Yann LuCun expressed frustrations that journalists are mixing and matching LLM parameters and dataset sizes and making no sense. 

This inspired me to write this post to give an analogy that makes sense to nonengineering or non-computer scientist types, aka everyone but the few thousands of AI scientists and researchers worldwide.  In this post, I will explain the critical components of LLMs in simple terms, using real-world analogies to make these advanced AI concepts more intuitive.

First, think of an LLM like a chef in a kitchen. The chef's skills and tools determine what dishes they can prepare. Similarly, an LLM's parameters are like the chef's cooking skills and kitchen tools. These parameters include:

  • Number of pots & pans: This is like the model's hidden size, determining how much information can be used simultaneously. More pots let the chef or model cook more complicated dishes.
  • Number of stoves & ovens: This represents the number of layers, enabling more sequential cooking steps. Stacking ovens lets the chef refine flavors like stacking layers and refine information processing.
  • Number of cookbooks: This analogizes the vocabulary size or the number of ingredients available. More cookbooks provide more diverse recipes, just as larger vocabularies enable richer language understanding. 
  • Years of experience: Like the number of layers, years of experience impact how skilled the chef is at their craft. Veteran chefs can create incredibly complex dishes.

These parameters remain fixed and define what the chef can cook in their kitchen. But more than just having the right gear and know-how is required. The chef needs to start cooking to prepare delicious, refined dishes. This is where weights come in.

The weights are like the chef's actual cooking in action. As the chef chops vegetables, simmers sauces, and bakes pies, they gain direct experience. The resulting dishes reflect this real kitchen work. For LLMs, weights represent the knowledge the model gains through training on vast datasets. Billions of text samples - the "ingredients" - are fed through the model. The weights adjust based on these patterns, like a chef adjusting seasoning based on taste tests. With enough practice, richly nuanced language understanding emerges from the raw training data.

So, while parameters define the potential tools and skills, weights reflect the actual hands-on cooking knowledge. The weights capture all the subtle cues that let a seasoned chef perfectly balance flavors. This experiential learning is critical for LLMs like GPT-3 to produce compelling, human-like text. Speaking of datasets, these are analogous to the ingredients that stock a kitchen. Just as a chef can only work with available ingredients, an LLM's training data determines the "raw materials" it learns from. 

High-quality datasets expose models to diverse language examples and contexts. Food metaphors only go so far here - think of datasets as providing textbooks for students across many academic subjects. The more topics covered, the more knowledge students acquire from studying those books.

Datasets for LLMs often contain billions of tokens. Tokens are like individual text "atoms" - each word, number, or punctuation mark making up the datasets. Combining these tokens in different ways allows for expressing anything imaginable in language.

A chef skillfully blends ingredients into exotic cuisines. Likewise, LLMs deftly synthesize tokens into flowing sentences, fictional stories, or Shakespearean poetry. The weights learned from broad datasets empower models to reconstruct limitless language configurations.

So next time AI like GPT-3 produces strikingly human-like text, consider the parameters as the chef's skills, weights as the chef's accrued cooking expertise, and datasets as the raw ingredients stocking the kitchen. With the right recipe, these components blend into a delicious final dish!

And that's the essence of how modern LLMs work under the hood. By demystifying the critical moving parts powering large language models, we hope this post provided an intuitive feel for how machines can achieve impressive language mastery - with no computer science background required! Let me know in the comments if this helped explain these advanced AI concepts.

Are More Parameters Better?

According to Dr. Yann LuCun, a model with more parameters is not necessarily better. It's generally more expensive to run and requires more RAM than a single GPU card can have.

GPT-4 is rumored to be a "mixture of experts," i.e., a neural net consisting of multiple specialized modules, only one of which is run on any particular prompt. So, the actual number of parameters used at any time is smaller than the total number.


Here are the terms again:

  • Parameter - determines the shape and configuration of the model while the weights hold the actual learned knowledge. Other examples of weights include the connection weights between nodes in each neural network layer. The number of weights is dependent on the choices for the parameters.
  • Token - the unit of data used to measure the size of the dataset. 
  • Dataset - the actual training data used to train an LLM. Datasets for LLMs often contain billions of tokens. Tokens are like individual text "atoms" - each word, number, or punctuation mark making up the datasets. Combining these tokens in different ways allows for expressing anything imaginable in language.

The Ongoing Case For Open Source LLMs

Large language models (LLMs) have seen tremendous progress recently, with models like Google's GPT-4 showing impressive capabilities. However, some argue that the gap between closed commercial models like GPT-4 and open-source models will continue to widen, making open-source LLMs obsolete. However, open-source LLMs still have an essential role in democratizing AI, enabling responsible and robust development, and serving real-world applications cost-effectively. 

Recently, I read a tweet by Bindu Reddy, CEO of, on 𝕏 (formerly Twitter), which inspired me to explore the pros and cons of open-source LLM in this post. My entire software engineering in my own company has benefited from Open Source software stacks such as Linux, Apache, MySQL, PHP, Perl, GNU C, etc. I am a big proponent of open-source software, but there are always two sides to every coin, so one must examine if the same rules apply to a multi-dimensional technology such as  AI.

Pros of Open Source LLMs

There are lots of reasons to consider open-source LLMs as the way forward. Below are some key points that make the case for open-source LLMs.

Cost-Effective Deployment at Scale

While closed LLMs like GPT-4 have strong capabilities, their inference costs are very high, making large-scale deployment expensive. Well-tuned open-source models can match performance on specific tasks at a fraction of the cost. For example, fine-tuned versions of Anthropic's LLama-2 are routinely used for enterprise applications like QA and summarization, handling over 1 million API calls daily. Deploying GPT-4 would cost over $100,000 per day in such cases.

Enables Specialization Through Instruction Tuning 

Instruction tuning trains open-source LLMs on specific datasets to improve their capabilities and controllability for certain tasks. For example, instruct-tuned open-source models can match GPT-4 on tasks like QA, NER, and classification while generalizing better to new data resource-efficiently. This helps bridge the gap with larger closed models.

Long Context through Innovations

A current limitation of LLMs like GPT-4 is their relatively short context lengths of around 8K tokens. But open source methods like LongLoRA enable significantly extending the context windows of models like LLama-2 to over 100K tokens on a single machine. This overcomes a key disadvantage compared to closed models.

Promotes Responsible AI Development

Open source facilitates collaboration, community oversight, rapid iteration, and benchmarking - all crucial for developing safe and robust AI systems. Historically, open source improves security by allowing vulnerabilities to be quickly detected and fixed. No evidence sourcing technology leads to misuse.

Cons of Open Source LLMs

To be balanced, we need to also review some of the negative points associated with open sourcing LLM systems. Let's review them too.

Lagging Capabilities 

Closed commercial LLMs employ far more resources and data, enabling breakthrough capabilities. For example, models like Google's Gemini and OpenAI's Gobi are likely far more potent than GPT-4. Matching their performance and versatility using open source will remain challenging.

Not Universally Accessible

While open source expands access to AI, utilizing these models still requires technical expertise, compute resources, and data. The lack of user-friendly interfaces and documentation further limits accessibility. Marginalized communities can face barriers to effectively leveraging open-source LLMs.

Scaling Challenges

Training and deploying the largest open-source LLMs requires access to massive datasets, computational power, and talent. This privileges more prominent tech players, limiting the ability of smaller teams to scale open-source LLMs. Server costs and engineering overheads can be prohibitive.

Alignment and Oversight Issues

Like any powerful technology, open-source LLMs have alignment, biases, and misuse risks. While open-source communities do enable oversight, large models can exponentially amplify harm. Responsible governance frameworks and safety practices are still evolving in the open-source ecosystem.

The Path Forward

Despite the challenges, open-source LLMs provide an essential counterbalance to closed models by empowering decentralized innovation and democratizing access to AI. Responsible governance combined with engineering ingenuity can help address risks while harnessing the strengths of open source. The ideal path forward is likely a mix of open and closed approaches - combining the innovative spirit of open source with the stability and reliability of commercial models. By complementing each other, these approaches can ensure robustness and accessibility in how AI systems are developed and deployed. The potential of AI is best realized when it is open to all. In addition to the pros and cons listed above, there are a few other factors to consider when evaluating open-source LLMs:

Maturity and support

Some open-source LLMs are more mature and have better community support than others. This is important to consider when choosing a model for your project, as it can impact the ease of use and development. 


Open-source LLMs are licensed under different terms and conditions. It is important to understand the license terms before using a model, as they may restrict how you can use and distribute the model.  For example, the Llama-2 LLM created by Meta has the following restrictions for commercial use:

  • You cannot use Llama-2 to train or improve any other large language model (excluding Llama-2 or derivative works thereof). This is to prevent companies from using Llama-2 to develop their own commercial LLMs, which could potentially stifle competition and innovation.
  • If your company has more than 700 million monthly active users, you must request a license from Meta to use Llama-2. This is to prevent the largest tech companies from gaining an unfair advantage by using Llama-2.

Other than these two restrictions, the Llama-2 license is relatively permissive and allows for commercial use of the model. This means that you can use Llama-2 to develop and deploy commercial products and services, such as chatbots, translation tools, and writing assistants.

Here are some examples of commercial use cases for Llama-2:

  • A company could use Llama-2 to develop a chatbot that helps customers with support inquiries.
  • A translation service could use Llama-2 to develop a more accurate and reliable translation engine.
  • A writing assistant company could use Llama-2 to develop a tool that helps users write better emails, reports, and other documents.

If you are considering using Llama-2 for commercial purposes, it is important to carefully review the license terms to ensure that you are in compliance.

So it is extremely important to also understand the licensing terms of the open source models to avoid future legal issues.


The open-source community around a particular LLM can play a vital role in its development and support. A large and active community can help identify and fix bugs, add new features, and provide guidance to users. Open-source LLMs offer a number of advantages over closed commercial models, including cost-effectiveness, flexibility, and transparency. However, they also have some limitations, such as lagging capabilities, accessibility challenges, and scaling challenges.

Ultimately, the best approach for you will depend on your specific needs and requirements. If you are looking for a cost-effective and flexible solution, open-source LLMs are a good option to consider. However, if you need the most advanced capabilities and do not have the technical expertise or resources to manage open-source models, a closed commercial model may be a better choice.

Listen to this Post

Unleashing the Future of AI with Intel's Gaudi2

The AI gold rush sparked by chatbots like ChatGPT has sent NVIDIA's stock soaring. NVIDIA GPUs train nearly all of the popular large language models (LLMs) that power these chatbots. However, Intel aims to challenge NVIDIA's dominance in this space with its new Gaudi2 chip. Intel recently released some impressive MLPerf benchmark results for Gaudi2, showing it can match or beat NVIDIA's current A100 GPUs for training large models like GPT-3. Intel even claims Gaudi2 will surpass NVIDIA's upcoming H100 GPU for specific workloads by September. 

These benchmarks position Gaudi2 as the first real alternative to NVIDIA GPUs for LLM training. While NVIDIA GPU supply is limited, demand for LLM training silicon far exceeds it. Gaudi2 could help fill that gap. Intel specifically targets NVIDIA's price/performance advantage, claiming Gaudi2 already beats A100 in this area for some workloads. And with software optimizations still ongoing, Intel believes Gaudi2's price/performance lead will only grow.

So, while NVIDIA GPUs will continue to dominate LLM training in the short term, Gaudi2 seems poised to become a viable alternative. For any company looking to train the next ChatGPT rival, Gaudi2 will likely be alluring.

In that sense, Gaudi2 does appear to be Intel's direct response to NVIDIA's AI computing leadership. By delivering comparable LLM training performance at a better price, Intel can capture a slice of the exploding market NVIDIA currently owns. 

Gaudi2 has been hailed as a game-changer for its unmatched performance, productivity, and efficiency in complex machine-learning tasks. Its robust architecture allows it to handle intensive computational workloads associated with generative and large language models. Here is a video that introduces the Gaudi2 Intel processor.

This video offers an extensive introduction to this revolutionary technology. It guides viewers through enabling Gaudi2 to augment their LLM-based applications comprehensively. It’s a must-watch for anyone looking for ways to leverage this powerhouse processor.

One key focus area discussed is model migration from other platforms onto Gaudi2. The conversation dissects how seamless integration can be achieved without compromising on speed or accuracy – two critical elements when working with AI.

Another crucial topic covered is accelerating LLM training using DeepSpeed and Hugging Face-based models. These popular tools are known for reducing memory usage while increasing training speed, hence their application alongside Gaudi2 comes as no surprise.

Last but not least, the video delves into high-performance inference for generative AI and LLM results. Here we get insights into how Gaudi2 can effectively analyze new data based on previously learned patterns leading to improved prediction outcomes.

This YouTube video opens up imaginations to the possibilities that lie ahead. It's a deep dive into the future of AI, and for anyone keen on keeping up with advancements in this field, it's an absolute must-watch.