tag:blog.cprompt.ai,2013:/posts CPROMPT AI 2024-02-15T02:03:09Z CPROMPT AI tag:blog.cprompt.ai,2013:Post/2080023 2024-01-24T16:24:23Z 2024-01-24T16:24:24Z The Future is Here: Robots That Understand Natural Language and Navigate Our Homes

Imagine asking your robot assistant to “go grab my keys from the kitchen table and bring them to me on the couch.” For many years, this type of request only existed in science fiction. But rapid advances in artificial intelligence are bringing this vision closer to reality.

Researchers at New York University recently developed a robotic system called OK-Robot that can understand natural language commands and complete tasks like moving objects around typical home environments. 

This new system demonstrates how the latest AI capabilities can be combined to create truly useful robot assistants.

Understanding Natural Language

A key innovation that enables OK-Robot’s abilities is the use of neural networks - AI systems loosely inspired by the human brain - that have been trained on huge datasets to understand language. Systems called Vision-Language Models can now identify over 20,000 different objects when shown images and can understand written descriptions and questions about those images.

The researchers used these models to give OK-Robot the ability to interpret natural language commands using common words to describe objects, places they can be found, and where they should be moved. This gives untrained users the ability to give instructions without needing to learn a rigid syntax or command structure.

Navigating Like a Human

But understanding language is only the first step to completing tasks - the robot also needs to be able to navigate environments and manipulate objects. Drawing inspiration from technologies self-driving cars use to "see" and move through spaces, the team gave OK-Robot the ability to build a 3D map of rooms using images captured from phone cameras.

This allows OK-Robot to create navigation plans to move around obstacles and get near requested items. It also uses algorithms that simulate human visual and physical reasoning abilities to identify flat surfaces, avoid collisions with clutter, and select optimal paths. The result is fluid navigation using the same sort of common-sense logic humans implicitly understand about moving through home environments.

Manipulating Household Objects 

Finally, to pick up and move everyday items, OK-Robot employs AI recognition capabilities to identify graspable points on target objects. It considers shape, size, and physical properties learned from experience grasping thousands of objects to select a suitable gripper pose. This allows OK-Robot to handle items ranging from boxes and bottles to clothing and coffee mugs.

The system combines its language interpretation, navigation system, and grasping abilities to fulfill requests like “Put my phone on the nightstand” or “Throw this soda can in the recycling”. It even handles specifying destinations using relationships like “on top of” or “next to”.

Real-World Robot Challenges

Evaluating their new system across 10 real homes, the NYU team found OK-Robot could fulfill requests like moving common household items nearly 60% of the time with no prior training or exposure to the environment. This major leap towards capable home robots highlights the progress AI is making.

However, it also uncovered real-world challenges robots still face operating in human spaces. Items placed in difficult to reach locations, clutter blocking paths or grasps, and requests involving heavy, fragile, or transparent objects remain problematic areas. Quirks of language interpretation can also lead to confusion over which specific item is being indicated or where it should be moved.

Still, by integrating the latest AI in an adaptable framework, OK-Robot sets a new high bar for language-driven robot competency. And its failures help illustrate remaining gaps researchers must close to achieve fully capable assistants.

The Path to Robot Helpers

The natural language understanding and navigation capabilities demonstrated by OK-Robot lend hope that AI and robotics are stepping towards the dreamed-of era of useful automated helpers. Continued progress pairing leading-edge statistical learning approaches with real-world robotic systems seems likely to make this a reality.

Key insights from this research illustrating existing strengths and limitations include:

  • Modern natural language AI allows untrained users to give robots useful instructions 
  • Advanced perception and planning algorithms enable feasible navigation of home spaces  
  • Data-driven grasping models generalize reasonably well to new household objects
  • Real-world clutter, occlusion, and ambiguity still frequently thwart capable robots
  • Careful system integration is crucial to maximize performance from imperfect components

So while the robot revolution still faces hurdles in reliably handling everyday situations, projects like OK-Robot continue pushing towards convenient and affordable automation in our homes, workplaces, and daily lives.

Reference Paper


Kabir M.
tag:blog.cprompt.ai,2013:Post/2055926 2024-01-06T20:00:00Z 2024-01-09T18:00:13Z How Shopify is Empowering Businesses with AI

We are in the midst of an AI revolution. Systems like ChatGPT and DALL-E 2 capture headlines and imaginations with their ability to generate remarkably human-like text, images, and more. But beyond the hype, companies like Shopify thoughtfully integrate this technology to solve real business challenges.

Shopify provides the infrastructure for entrepreneurs to build thriving businesses. As Director of Product Miqdad Jaffer explains, "AI is an opportunity to make entrepreneurship accessible for everyone." He envisions AI as a "powerful assistant" to help with administrative tasks so entrepreneurs can focus on developing standout products.

Much like past innovations such as calculators boosting productivity in their domains, AI promises to eliminate digital struggle: "We've seen that we've created a suite of products called Shopify Magic. And the idea behind this was, how do we embed this directly into the workflows our merchants must go through?"

The key is designing AI to enhance human capabilities rather than replace them. Shopify's first offering, Shopify Magic, helps write product descriptions, email campaigns, customer service responses, and more while giving business owners final approval. 

Keeping humans in charge eases concerns around brands losing control over messaging or AI sharing inaccurate product information. Merchants are quickly finding ways to customize the AI's output to their brand voice and specialty offerings. Despite risks with rapidly adopting new AI systems, Shopify leaned into experimentation, knowing the tools can solve problems today: "We wanted to lean in for a couple of reasons. One, it's important to get this in the hands of merchants as fast as possible, staying with the cutting edge of the technology side." Their risk tolerance traces directly back to who their users are.

Entrepreneurs tend to have high-risk tolerances, willing to sacrifice stability to turn ideas into reality. Shopify realized Magic aligned better with this hunger to try new things than a cautious rollout. The results reveal shop owners enthusiastically use Magic, from launching online outlets to translating product info for international audiences.

Rather than replace humans, Shopify aims to build AI products that enable entrepreneurs to excel: "This is always going to be something that augments and isn't a replacement. So this will always be something that helps a user be the best version of themselves." Much like past innovations such as tractors and computers overcame limitations to empower more extraordinary human achievement, AI promises another leap by eliminating digital drudgery.


Q: Why did Shopify move so quickly to adopt risky new AI tools? 

They knew entrepreneurs tend to have high-risk tolerance and are eager to gain any possible advantage using cutting-edge technology.

Q: Does Shopify's AI write final emails and product descriptions by itself?

No, their Magic assistant makes intelligent suggestions, but business owners have final approval over client-facing messaging. This maintains human control.

Q: What are some ways merchants customize AI outputs?

Users employed AI to translate product info into different languages, rapidly create content for new campaign sites, and tailor its suggestions to fit their brand style.

Q: How has AI started impacting other Shopify product areas?

Shopify recently launched visual AI tools to let merchants effortlessly customize backdrops for product images to create tailored campaigns.

Q: How do AI innovations compare historically? 

Much like farm equipment boosted production capacity, and computers increased information access, AI eliminates tedious tasks so humans can better pursue their passions.


  • Generative AI: AI systems capable of generating original content like text, images, audio, and video rather than just classifying data. Example: ChatGPT.
  • Natural Language Processing (NLP): Subfield of AI focused on understanding, interpreting, and generating human languages. Enables capabilities like text summarization.  
  • Prompt engineering: Crafting the text prompts provided to generative AI models to influence their outputs. Requires human skill.
  • Overfitting: When an AI algorithm performs very well on its training data but fails to generalize to new situations. Leads to fragility.
Kabir M.
tag:blog.cprompt.ai,2013:Post/2055931 2024-01-03T20:00:00Z 2024-01-09T17:59:42Z Making Transformers Simpler and Faster

Transformers have become the backbone behind many recent advances in AI, powering systems like ChatGPT for natural language tasks. Yet the standard transformer architecture has many intricacies that make it complex and inefficient. In a new paper, researchers Bobby He and Thomas Hofmann explore how to streamline transformers by removing unnecessary components, making them more straightforward, faster, and practical. 

The core idea is that several aspects of the standard transformer block—the primary building block transformers are made of—can be simplified or removed entirely without hampering performance. Specifically, He and Hofmann identify and eliminate excess baggage in terms of 1) skip connections, 2) value and projection parameters, 3) sequential sub-blocks, and 4) normalization layers.  

Skip connections are links between layers that help information flow across the network. The researchers find these can be discarded in each transformer layer's attention and feedforward sub-blocks. The key is to initialize the attention mechanism to have a vital identity component, allowing tokens to retain information about themselves better as signals pass through the network.

The value and projection parameters help transform representations as they enter and exit the multi-head attention module in each layer. Surprisingly, He and Hofmann reveal these extra transform matrices can be fixed to the identity without affecting results. This eliminates half of the matrix multiplications required in the attention layer.  

Similarly, the standard transformer computes attention and feedforward sequential sub-blocks in order. By parallelizing these computations instead, skip connections linking the sub-blocks become unnecessary.

Finally, normalization layers that help regulate activations can also be removed with the proper architecture adjustments, albeit with a minor drop in per-step speeds. 

Together, these modifications lead to a radically simplified transformer block that matches, if not exceeds, the performance and efficiency of the original complex block. For example, the simplified model attains over 15% faster throughput and uses 15% fewer parameters, yielding practical savings.

Real-World Impact

The work has both theoretical and practical implications. On the theory side, it reveals limitations in current tools like signal propagation for explaining and designing networks, motivating more nuanced dynamic theories that capture training intricacies. Practically, simpler transformer architectures can translate to significant efficiency gains in AI systems, reducing computational demands to deploy large language models.

For example, consider CPROMPT.AI, a platform allowing everyday users to build and share custom AI applications easily. The apps tap into capacities like text generation from prompts, with no coding needed. Simpler and faster transformers directly enable deploying more powerful capacities to more people for less cost—crucial as advanced AI diffuses across society.  

He and Hofmann’s simplifications also compound the work of other researchers pursuing efficient transformers, bringing us closer to practical transformers at the scales and accuracies necessary to push AI forward. So, while recent models boast hundreds of billions of parameters, streamlined architectures could pack comparable performance in packages sized for broad access and impact.

The quest for AI that is not only capable but accessible and responsible continues. Reducing transformer complexity provides one path to more efficient, economical, and beneficial AI development and deployment.

The key facts from the paper

  • Skip connections in both the attention and feedforward modules of transformers can be removed without hampering performance by initializing attention to have a vital identity component.  
  • The value and projection parameters in multi-head attention are unnecessary and can be fixed to the identity matrix.
  • By parallelizing the attention and feedforward computations, sequential sub-blocks can also be eliminated.
  • Simplifying transformers in these ways yields models with 15% higher throughput and 15% fewer parameters.
  • Limitations of signal propagation theories for neural network design are revealed, motivating more refined dynamic theories.


  • Skip connection - a connection between non-consecutive layers in a neural network. 
  • Value parameters - weights in the transformer attention mechanism 
  • Projection parameters - weights that transform attention outputs
  • Sequential sub-blocks - the standard process of computing attention and then feedforward blocks
  • Normalization layer - a layer that regulates activation values 
Kabir M.
tag:blog.cprompt.ai,2013:Post/2055928 2024-01-02T20:00:00Z 2024-01-09T18:00:33Z The Secret Behind a Child's Creativity: What AI Still Can't Match

We've all seen the recent feats of AI, like ChatGPT and DALL-E 2, churning out essays, computer code, artworks, and more with a simple text prompt. The outputs seem intelligent, creative even. But are these AI systems innovative in the way humans are? Developmental psychologists argue there's a fundamental difference. 

In a recent paper published in Perspectives on Psychological Science, researchers Eunice Yiu, Eliza Kosoy, and Alison Gopnik make the case that while today's large language models excel at imitating existing patterns in data, they lack the flexible, truth-seeking abilities that allow even young children to innovate tools and discover new causal structures in the world.  

The Core Idea: Imitation Versus Innovation

The authors explain that AI systems like ChatGPT are best understood not as intelligent agents but as "cultural technologies" that enhance the transmission of information from person to person. Much like writing, print, and the Internet before them, large language models are highly skilled at extracting patterns from vast datasets of text and images created by humans. They are, in effect, "giant imitation engines" in language and visual creativity.  

However, cultural evolution depends on imitation and innovation – the ability to expand on existing ideas or create new ones. This capacity for innovation requires more than statistical analysis; it demands interacting with the world in an exploratory, theory-building way to solve what scientists call "the inverse problem." Children as young as four can innovate essential tools and discover new causal relationships through active experimentation, going beyond the patterns they've observed.

So, while AI models can skillfully continue trends and genres created by humans, they need more flexible reasoning skills to push boundaries and explore new creative territory. As Gopnik told the Wall Street Journal, "To be truly creative means to break out of previous patterns, not to fulfill them." 

Evidence: Comparing AI and Child Tool Innovation 

To test this imitation versus innovation hypothesis, the researchers conducted experiments comparing how children, adults, and significant AI models like Claude and GPT-4 handled tool innovation tasks. 

In one scenario, participants were asked to select an object to draw a circle without the usual compass tool, choosing from either:

  • An associated but irrelevant item – a ruler 
  • A visually dissimilar but functionally relevant item – a round-bottomed teapot
  • An irrelevant item – a stove

The results showed:

  • Both kids and adults excelled at selecting the teapot, demonstrating an ability to discover new causal affordances in objects.
  • The AI models struggled, often picking the associated ruler instead of realizing the teapot's potential.  

This suggests that while statistical learning from text can capture superficial relationships between objects, it falls short when more creative abstraction is needed.

This research shows that today's AI still can't match a child's innate curiosity and drive to experiment. We see this on the CPROMPT.AI platform, where users ideate and iterate prompt apps to explore topics and share perspectives without external incentives or curation. It's a case where human creativity shines!

AI models provide an incredible tool for enhancing human creativity through more accessible access to knowledge and quick iterations. The CPROMPT.AI no-code interface lets anyone transform AI chat into usable web apps for free. You dream it, you build it, no programming required.  

The interplay between human and artificial intelligence promises even more innovation. But the next giant leap will likely come from AI that, like children, actively learns by doing rather than purely analyzing patterns. Budding young scientists have a lesson for the best minds in AI!


  • Large language models - AI systems trained on massive text or image datasets, like ChatGPT and DALL-E, to generate new text or images. 
  • Inverse problem - The challenge of inferring causes from observed effects and making predictions. Solving it requires building models of the external world through exploration.
  • Affordance - The possible uses and actions latent in an object based on its properties. Recognizing affordances allows innovative tool use.
  • Overimitation - Copying all details of a task, even non-causal ones. AI models have high-fidelity imitation but may lack human social imitation abilities.
  • Causal over hypotheses: An abstract hypothesis that reduces hypotheses about more concrete causal relationships. Discovering these allows generalization.
Kabir M.
tag:blog.cprompt.ai,2013:Post/2056899 2024-01-01T20:00:00Z 2024-01-09T18:00:02Z The Hidden Memories of LLMs: Extractable Memorization in AI

In artificial intelligence, an intriguing phenomenon lies beneath the surface - extractable memorization. This term refers to an AI model's tendency to inadvertently retain fragments of training data, which a third party can later extract. Understanding this concept is vital for safeguarding privacy in AI systems. 

What is Extractable Memorization?

Extractable memorization occurs when parts of an AI model's training data can be efficiently recovered by an external "attacker," intentionally or unintentionally. Also called data extraction attacks, these exploits pose serious privacy risks if personal or sensitive data is revealed. Recent research analyzed extractable memorization across various language models - from open-source tools like GPT-Neo to private APIs like ChatGPT. The findings were troubling:

  • Open models memorized up to 1% of training data. More data was extracted as the model size increased.
  • Closed models also showed vulnerability. ChatGPT leaked personal details with simple attacks despite privacy measures.

With prompts costing $0.002, spending just $200 yielded over 10,000 private training examples from ChatGPT. Extrapolations estimate adversaries could extract far more for higher budgets.

What Does This Mean for Developers and Users?

This signals the urgent need for rigorous testing and mitigation of risks from extractable memorization for developers. As models grow more capable, so does the quantity of sensitive data they accumulate and the potential for exposure. Responsible AI requires acknowledging these failure modes. It challenges users' assumptions that personal information is protected when engaging with AI. Even robust models have exhibited critical flaws, enabling data leaks. I'd like to point out that caution is warranted around data security with existing systems.

Progress in AI capabilities brings immense potential and complex challenges surrounding transparency and privacy. Extractable memorization is the tip of the iceberg. Continued research that responsibly probes model vulnerabilities is crucial for cultivating trust in emerging technologies. Understanding the hidden memories within language models marks an essential step.

Kabir M.
tag:blog.cprompt.ai,2013:Post/2062218 2023-12-13T18:12:54Z 2024-02-15T02:03:09Z Unlocking the Secrets of Self-Supervised Learning

Self-supervised learning (SSL) has become an increasingly powerful tool for training AI models without requiring manual data labeling. But while SSL methods like contrastive learning produce state-of-the-art results on many tasks, interpreting what these models have learned remains challenging.  A new paper from Dr. Yann LeCun and other researchers helps peel back the curtain on SSL by extensively analyzing standard algorithms and models. Their findings reveal some surprising insights into how SSL works its magic.

At its core, SSL trains models by defining a "pretext" task that does not require labels, such as predicting image rotations or solving jigsaw puzzles with cropped image regions. The key innovation is that by succeeding at these pretext tasks, models learn generally useful data representations that transfer well to downstream tasks like classification.

Digging Into the Clustering Process

A significant focus of the analysis is how SSL training encourages input data to cluster based on semantics. For example, with images, SSL embeddings tend to get grouped into clusters corresponding to categories like animals or vehicles, even though category labels are never provided. The authors find that most of this semantic clustering stems from the "regularization" component commonly used in SSL methods to prevent representations from just mapping all inputs to a single point. The invariance term that directly optimizes for consistency between augmented samples plays a lesser role.

Another remarkable result is that semantic clustering reliably occurs across multiple hierarchies - distinguishing between fine-grained categories like individual dog breeds and higher-level groupings like animals vs vehicles.

Preferences for Real-World Structure 

However, SSL does not cluster data randomly. The analysis provides substantial evidence that it prefers grouping samples according to patterns reflective of real-world semantics rather than arbitrary groupings. The authors demonstrate this by generating synthetic target groupings with varying degrees of randomness. The embeddings learned by SSL consistently align much better with less random, more semantically meaningful targets. This preference persists throughout training and transfers across different layers of the network.

The implicit bias towards semantic structure explains why SSL representations transfer so effectively to real-world tasks. Here are some of the key facts:

  • SSL training facilitates clustering of data based on semantic similarity, even without access to category labels
  • Regularization loss plays a more significant role in semantic clustering than invariance to augmentations 
  • Learned representations align better with semantic groupings vs. random clusters
  • Clustering occurs across multiple hierarchies of label granularity
  • Deeper network layers capture higher-level semantic concepts 

By revealing these inner workings of self-supervision, the paper makes essential strides toward demystifying why SSL performs so well. 


  • Self-supervised learning (SSL) - Training deep learning models through "pretext" tasks on unlabeled data
  • Contrastive learning - Popular SSL approach that maximizes agreement between differently augmented views of the same input
  • Invariance term - SSL loss component that encourages consistency between augmented samples 
  • Regularization term - SSL loss component that prevents collapsed representations
  • Neural collapse - Tendency of embeddings to form tight clusters around class means
Kabir M.
tag:blog.cprompt.ai,2013:Post/2061510 2023-12-12T05:13:37Z 2023-12-21T18:38:25Z Evaluating AI Assistants: Using LLMs as Judges

As consumer AI -- Large Language Models (LLMs) become increasingly capable, evaluating them is crucial yet challenging; how can we effectively benchmark AI's performance, especially in the open-ended, free-form conversations preferred by users? Researchers from UC Berkeley, Stanford, and other institutions explore using strong LLMs as judges to evaluate chatbots in a new paper titled "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." The core premise is that well-trained LLMs already exhibit alignment with human preferences so that they can act as surrogates for expensive and time-consuming human ratings. 

This LLM-as-a-judge approach offers immense promise in accelerating benchmark development. Let's break down the critical details from the paper.

The Challenge of Evaluating Chatbots

While benchmarks abound for assessing LLMs' core capabilities like knowledge and logic, they focus primarily on closed-ended questions with short, verifiable responses. Yet modern chatbots handle free-form conversations across diverse topics. Evaluating their helpfulness and alignment with user expectations is vital but profoundly challenging.

Obtaining robust human evaluations is reliable but laborious and costly. Crowdsourcing ratings from average users for each new model revision could be more practical. At the same time, existing standardized benchmarks often fail to differentiate between base LLMs and aligned chatbots preferred by users. 

For instance, the researchers demonstrate that human users strongly favor Vicuna, a chatbot fine-tuned to mimic ChatGPT conversations, over the base LLaMA model it's built on. Yet differences in benchmark scores on datasets like HellaSwag remain negligible. This discrepancy highlights the need for better benchmarking paradigms tailored to human preferences.

Introducing MT-Bench and Chatbot Arena

To address this evaluation gap, the researchers construct two new benchmarks with human ratings as key evaluation metrics:

  • MT-Bench: A set of 80 open-ended, multi-turn questions testing critical user-facing abilities like following instructions over conversations. Questions fall into diverse domains like writing, reasoning, math, and coding.
  • Chatbot Arena: A live platform where anonymous users chat simultaneously with two models, then vote on preferred responses without knowing model identities. This allows gathering unconstrained votes based on personal interests.

These human-centered benchmarks offer more realistic assessments grounded in subjective user preferences versus technical accuracy alone.  Here I have run a prompt for two versions of Claude LLMs and I found one answer (B) to be more interesting than the other one (A).

You can try this at: https://chat.lmsys.org

LLMs as Surrogate Judges 

The paper investigates using strong LLMs like Claude and GPT-4 as surrogate judges to approximate human ratings. The fundamental hypothesis is that because these models are already trained to match human preferences (e.g., through reinforcement learning from human feedback), their judgments should closely correlate with subjective user assessments. Advantages of this LLM-as-a-judge approach include:

  • Scalability: Automated LLM judgments require minimal human involvement, accelerating benchmark iteration.
  • Explainability: LLMs provide explanatory judgments, not just scores. This grants model interpretability, as illustrated in examples later.

The paper systematically analyzes this method by measuring LLM judge agreement with thousands of controlled experts and unconstrained crowd votes from the two new benchmarks. But first, let's examine some challenges.

Position Bias and Other Limitations

LLM judges exhibit certain biases that can skew evaluations:

  • Position bias: Tendency to favor responses based just on order presented rather than quality. All LLM judges here demonstrate significant position bias.
  • Verbosity bias: Longer responses seem rated higher regardless of clarity or accuracy. When researchers artificially expanded model responses via repetition without adding new information, all but GPT-4 judges failed to detect this distortion.
  • Self-enhancement bias: Some hints exist of judges preferring responses stylistically similar to their own, but limited evidence prevents clear conclusions.
  • Reasoning limitations: Since math/logic capabilities in LLMs still need improvement, their competency grading such questions unsurprisingly needs to be revised. But even on problems they can solve independently, providing incorrect candidate answers can mislead judges.

Despite these biases, agreement between LLM and human judgments ultimately proves impressive, as discussed next. And researchers propose some techniques to help address limitations like position bias, which we'll revisit later.

Key Finding: LLM Judges Match Human Preferences  

Across both controlled and uncontrolled experiments, GPT-4 achieves over 80% judgment agreement with human assessors - on par even with the ~81% inter-rater agreement between random human pairs. This suggests LLMs can serve as cheap and scalable substitutes for costly human evaluations. In particular, here's a sample highlight:

MT-Bench: On 1138 pairwise comparisons from multi-turn dialogues, GPT-4 attained 66% raw agreement and 85% non-tie agreement with experts. The latter excludes tied comparisons where neither response was favored.

Remarkably, when human experts disagreed with GPT-4 judgments, they still deemed its explanations reasonable 75% of the time. And 34% directly changed their original choice to align with the LLM assessment after reviewing its analysis. This further validates the reliability of LLM surrogate judging.

LLM agreement rates grow even higher on model pairs exhibiting sharper performance differences. When responses differ significantly in quality, GPT-4 matches experts almost 100% of the time. This suggests alignment improves for more extreme cases that should be easier for both humans and LLMs to judge consistently.

Mitigating LLM Judge Biases 

While the paper demonstrates impressive LLM judge performance mainly on par with average human consistency, biases like position bias remain crucial for improvement.  Researchers propose a few bias mitigation techniques with preliminary success:

  • Swapping positions: Running judgments twice with responses flipped and only keeping consistent verdicts can help control position bias.
  • Few-shot examples: Priming LLM judges with a handful of illustrative examples significantly boosts consistency on position bias tests from 65% to 77% for GPT-4, mitigating bias.
  • Reference guidance: For mathematical problems, providing LLM judges with an independently generated reference solution drastically cuts failure rates in assessing candidate answers from 70% down to just 15%. This aids competency on questions requiring precise analysis.

So, while biases exist, simple strategies can help minimize their impacts. And overall agreement rates already match or even exceed typical human consistency.

Complementing Standardized Benchmarks   

Human preference benchmarks like MT-Bench and Chatbot Arena assess different dimensions than existing standardized tests of knowledge, reasoning, logic, etc. Using both together paints a fuller picture of model strengths.

For example, the researchers evaluated multiple variants of the base LLaMA model with additional conversation data fine-tuning. Metrics like accuracy on the standardized HellaSwag benchmark improved steadily with more fine-tuning data. However, small high-quality datasets produced models strongly favored by GPT-4 judgments despite minimal gains on standardized scores.

This shows both benchmark types offer complementary insights. Continued progress will also require pushing beyond narrowly defined technical metrics to capture more subjective human preferences.

Democratizing LLM Evaluation 

Accessibly evaluating sophisticated models like ChatGPT requires expertise today. But platforms like CPROMPT.AI open LLM capabilities to everyone by converting text prompts into accessible web apps.  With intuitive visual interfaces, anyone can tap into advanced LLMs to create AI-powered tools for education, creativity, productivity, etc. No coding is needed. And the apps can be shared publicly or sold without any infrastructure or scaling worries.  

By combining such no-code platforms with the automated LLM judge approaches above, benchmarking model quality could also become democratized. Non-experts can build custom benchmark apps to evaluate evolving chatbots against subjective user criteria.  

More comprehensive access can help address benchmark limitations like overfitting on standardized tests by supporting more dynamic, personalized assessments. This is aligned with emerging paradigms like Dynabench that emphasize continuous, human-grounded model evaluations based on actual use cases versus narrow accuracy metrics alone.

Lowering barriers facilitates richer, real-world measurements of AI progress beyond expert evaluations.

Key Takeaways

Let's recap the critical lessons around using LLMs as judges to evaluate chatbots:

  • Aligning AI with subjective user preferences is crucial yet enormously challenging to measure effectively.
  • New human preference benchmarks like MT-Bench demonstrate failed alignment despite standardized solid test performance.
  • Employing LLMs as surrogate judges provides a scalable and automated way to approximate human assessments.
  • LLMs like GPT-4 can match expert consistency levels above 80%, confirming efficacy.
  • Certain biases affect LLM judges, but mitigation strategies like swapping response positions and few-shot examples help address those.
  • Maximizing progress requires hybrid evaluation frameworks combining standardized benchmarks and human preference tests.

As chatbot quality continues improving exponentially, maintaining alignment with user expectations is imperative. Testing paradigms grounded in human judgments enable safe, trustworthy AI development. Utilizing LLMs as judges offers a tractable path to effectively keep pace with accelerating progress in this domain.


  • MT-Bench: Suite of open-ended, multi-turn benchmark questions with human rating comparisons  
  • Chatbot Arena: Platform to gather unconstrained conversations and votes pitting anonymous models 
    against each other
  • Human preference benchmark: Tests targeting subjective user alignments beyond just technical accuracy
  • LLM-as-a-judge: Approach using large language models to substitute for human evaluation and preferences
  • Position bias: Tendency for language models to favor candidate responses based simply on the order presented rather than quality
Kabir M.
tag:blog.cprompt.ai,2013:Post/2061504 2023-12-12T04:48:47Z 2023-12-13T08:27:11Z Managing the Risks of Artificial Intelligence: A Core Idea from the NIST AI Risk Management Framework

Artificial intelligence (AI) has brought astounding advances, from self-driving cars to personalized medicine. However, it also poses novel risks. How can we manage the downsides so AI's upsides shine through? The US National Institute of Standards and Technology (NIST) offers a pioneering perspective in its AI Risk Management Framework. 

At its heart, the framework views AI risks as socio-technical - arising from the interplay of technical factors and social dynamics. If deployed crudely, an AI system designed with the best intentions could enable harmful discrimination. And even a technically sound system might degrade performance over time as society changes. Continual adjustment is critical. The framework outlines four core functions - govern, map, measure, and manage. 

"Govern" focuses on accountability, culture, and policies. It asks organizations to clearly define roles for governing AI risks, foster a culture of responsible AI development, and institute policies that embed values like fairness into workflows. Wise governance enables the rest.

"Map" then surveys the landscape of possibilities - both beneficial uses and potential downsides of a planned AI system. Mapping elucidates the real-world context where a system might operate, illuminating risks.

"Measure" suggests concrete metrics to track those risks over an AI system's lifetime, enabling ongoing vigilance. Relevant metrics span technical dimensions like security vulnerabilities to societal measures like discriminatory impacts. 

Finally, "manage" closes the loop by prioritizing risks that surfaced via mapping and measurement, guiding mitigation efforts according to tolerance levels. Management also includes communication plans for transparency.

At CPROMPT.AI, these functions tangibly guide the development of our easy-to-use platform for no-code AI. We continually map end-user needs and potential misuses, instituting governance policies that embed beneficial values upfront. We measure via feedback loops to catch emerging issues fast. We actively manage - and adjust policies based on user input to keep risks low while enabling broad access to AI's benefits.

The framework highlights that AI risks can never be "solved" once and for all. Responsible AI requires a sustained, collaborative effort across technical and social spheres - achieving trust through ongoing trustworthiness. Top Takeaways:

  • AI risks are socio-technical - arising from technology and social dynamics. Both angles need addressing.
  • Core risk management functions span governing, mapping, measuring, and managing. Each enables managing AI's downsides amid its upsides.
  • Mapping helps reveal risks and opportunities early by understanding the context thoroughly.
  • Measurement tracks technical and societal metrics to catch emerging issues over time.
  • Management closes the loop - mitigating risks based on tolerance levels and priorities.

At CPROMPT.AI, we're putting these ideas into practice - enabling anyone to build AI apps quickly while governing use responsibly. The future remains unwritten. We can shape AI for good through frameworks like NIST's guiding collective action.

Recommended Reading

Managing AI Risks: A Framework for Organizations


Q: What is the NIST AI Risk Management Framework?

The NIST AI Risk Management Framework guides organizations in managing the potential risks of developing, deploying, and using AI systems. It outlines four core functions – govern, map, measure, and control – to help organizations build trustworthy and responsible AI. 

Q: Who can use the NIST AI Risk Management Framework? 

The framework is designed to be flexible for any organization working with AI, including companies, government agencies, non-profits, etc. It can be customized across sectors, technologies, and use cases.

Q: What are some unique AI risks the framework helps address?

The framework helps manage amplified or new risks with AI systems compared to traditional software. This includes risks related to bias, opacity, security vulnerabilities, privacy issues, and more arising from AI's statistical nature and complexity.

Q: Does the framework require specific laws or regulations to be followed?

No, the NIST AI Risk Management Framework is voluntary and complements existing laws, regulations, and organizational policies related to AI ethics, safety, etc. It provides best practices all organizations can apply.

Q: How was the NIST AI Risk Management Framework created?

NIST developed the framework based on industry, academia, civil society, and government input. It aligns with international AI standards and best practices. As a "living document," it will be updated regularly based on user feedback and the evolving AI landscape.


  • Socio-technical - relating to the interplay of social and technological factors
  • Governance - establishing policies, accountability, and culture to enable effective risk management 
  • Mapping - analyzing the landscape of possibilities, risks, and benefits for a particular AI system
  • Measurement - creating and tracking metrics that shed light on a system's technical and societal performance

Kabir M.
tag:blog.cprompt.ai,2013:Post/2061459 2023-12-12T01:21:36Z 2023-12-21T18:37:37Z Managing AI Risks: A Framework for Organizations

Artificial intelligence (AI) systems hold tremendous promise to enhance our lives but also come with risks. How should organizations approach governing AI systems to maximize benefits and minimize harms? The AI Risk Management Framework (RMF) Playbook created by the National Institute of Standards and Technology (NIST) offers practical guidance. NIST s a U.S. federal agency within the Department of Commerce. It's responsible for developing technology, metrics, and standards to drive innovation and economic competitiveness at national and international levels. NIST's work covers various fields, including cybersecurity, manufacturing, physical sciences, and information technology. It plays a crucial role in setting standards that ensure product and system reliability, safety, and security, especially in new technology areas like AI.

At its core, the Playbook provides suggestions for achieving outcomes in the AI RMF Core Framework across four essential functions: Govern, Map, Measure, and Manage. The AI RMF was developed through a public-private partnership to help organizations evaluate AI risks and opportunities. 

The Playbook is not a checklist of required steps. Instead, its voluntary suggestions allow organizations to borrow and apply ideas relevant to their industry or interests. By considering Playbook recommendations, teams can build more trustworthy and responsible AI programs. Here are three top-level takeaways from the AI RMF Playbook:

Start with strong governance policies 

The Playbook emphasizes getting governance right upfront by establishing policies, procedures, roles, and accountability structures. This includes outlining risk tolerance levels, compliance needs, stakeholder participation plans, and transparency requirements. These guardrails enable the subsequent mapping, measurement, and management of AI risks.

For example, the Playbook suggests creating standardized model documentation templates across development projects. This supports consistently capturing limitations, test results, legal reviews, and other data to govern systems.

Continuously engage stakeholders

Given AI's broad societal impacts, the Playbook highlights regular engagement with end users, affected communities, independent experts, and other stakeholders. Their input informs context mapping, impact assessments, and the suitability of metrics. 

Participatory design research and gathering community insights are highlighted as ways to enhance measurement and response plans. The goal is to apply human-centered methods to make systems more equitable and trustworthy.

Adopt iterative, data-driven improvements  

The Playbook advocates iterative enhancements informed by risk-tracking data, metrics, and stakeholder feedback. This means continually updating performance benchmarks, fairness indicators, explainability measures, and other targets. Software quality protocols like monitoring for bug severity and system downtime are also suggested.

This measurement loop aims to spur data-driven actions and adjustments. Tying metrics to potential harms decreases the likelihood of negative impacts over an AI system's lifecycle. Documentation also builds institutional knowledge.

Creating Trustworthy AI

Organizations like CPROMPT.AI, enabling broader access to AI capabilities, have an opportunity to integrate ethical design. While risks exist, the Playbook's voluntary guidance provides a path to developing, deploying, and monitoring AI thoughtfully.

Centering governance, engagement, and iterative improvements can help machine learning teams act responsibly. Incorporating feedback ensures AI evolves to serve societal needs best. Through frameworks like the AI RMF, we can build AI that is not only powerful but also deserving of trust.


What is the AI RMF Playbook?

The AI RMF Playbook provides practical guidance aligned to the AI Risk Management Framework (AI RMF) Core. It suggests voluntary actions organizations can take to evaluate and manage risks across the AI system lifecycle areas of government, mapping, measuring, and managing.

Who developed the AI RMF Playbook?

The Playbook was developed through a public-private partnership between industry, academia, civil society, government, international organizations, and impacted communities. The goal was to build consensus around AI risk management best practices.

Does my organization have to follow all Playbook recommendations?

No, the Playbook is not a required checklist. Organizations can selectively apply suggestions relevant to their industry use case interests based on their risk profile and resources. It serves as a reference guide.

What are some key themes in the Playbook?

Major Playbook themes include:
  • Establishing strong AI governance.
  • Continually engaging stakeholders for input.
  • Conducting impact assessments.
  • Tracking key risk metrics.
  • Adopting iterative data-driven enhancements to systems.

How can following the Playbook guidance help my AI systems?

By considering Playbook suggestions, organizations can better anticipate risks across fairness, safety, privacy, and security. This empowers teams to build more trustworthy, transparent, and responsible AI systems that mitigate harm.

Kabir M.
tag:blog.cprompt.ai,2013:Post/2060994 2023-12-11T07:07:53Z 2023-12-11T08:21:43Z Tree of Thought vs. Chain of Thought: A Smarter Way to Reason and Problem Solve

When tackling tricky challenges that require complex reasoning – like solving a math puzzle or writing a coherent story – how we structure our thought process greatly impacts the outcome. Typically, there are two frameworks people use: 

  • Chain of Thought (CoT): Linear, step-by-step thinking;
  • Tree of Thought (ToT): Branching, exploring many sub-ideas.  

Intuitively, mapping out all facets of an issue enables deeper analysis than a single train of logic. An intriguing AI technique called Tree of Thoughts formally integrates this concept into advanced systems known as large language models. 

Inside the AI: Tree of Thoughts 

In a paper from Princeton and Google AI researchers, a framework dubbed "Tree of Thoughts" (ToT) enhances deliberate planning and problem solving within language models – AI systems trained on vast texts that can generate writing or answer questions when prompted. 

Specifically, ToT formulates thinking as navigating a tree, where each branch represents exploring another consideration or intermediate step toward the final solution. For example, the system logically breaks down factors like space, allergies, and care needs to recommend the best family pet, gradually elaborating the options. This branching structure resembles visual concept maps that aid human creativity and comprehension.

Crucially, ToT incorporates two integral facets of higher-level cognition that sets it apart from standard AI:

  • Evaluating ideas: The system assesses each branch of reasoning via common sense and looks a few steps ahead at possibilities.
  • Deciding and backtracking: It continually judges the most promising path to continue thinking through, backtracking as needed.  

This deliberate planning technique enabled significant advances in challenging puzzles requiring creative mathematical equations or coherent story writing that stump today's best AIs.

Chain vs. Tree: A Superior Way to Reason 

Compared to a chain of thought's linear, one-track reasoning, experiments reveal ToT's branching approach to thinking:

  • Better handles complexity as ideas divide into sub-topics
  • Allows more comprehensive exploration of alternatives  
  • Keeps sight of the central issue as all branches connect to the main trunk

Yet the chain of thought's simplicity has merits, too, in clearly conveying ideas step-by-step.

In essence, ToT combines people's innate tree-like conceptualization with AI's scaling computational power for more brilliant exploration. Its versatility also allows customizing across different tasks and systems.

So, while both frameworks have roles depending on needs and individual thinking style preferences, ToT's deliberate branching is uniquely suited to steering AI's problem-solving today. 

As AI becomes more autonomous in real-world decision-making, ensuring deliberate, structured thinking will only grow in importance – making the tree of thought an increasingly essential capability that today's promising explorations point toward.

Recommended Videos

This video starts by revisiting the 'Tree of Thoughts' prompting technique, demonstrating its effectiveness in guiding Language Models to solve complex problems. Then, it introduces LangChain, a tool that simplifies prompt creation, allowing for easier and more efficient problem-solving. 

Tree of Thoughts becomes Forest of Thoughts, with the addition of multiple trees. Join Richard Walker in this exciting exploration of 'Forest of Thoughts' - an innovative technique that can boost your AI's problem-solving abilities.  

Kabir M.
tag:blog.cprompt.ai,2013:Post/2060953 2023-12-11T03:54:44Z 2023-12-11T03:59:43Z Teaching AI Agents to Make Real-World Decisions

A new open-source software package called Pearl aims to change that by giving AI agents the tools they need to make decisions in the real world. Developed by researchers at Meta AI, Pearl provides a versatile framework for reinforcement learning (RL), a trial-and-error technique inspired by how humans and animals acquire skills.

The Core Idea Behind Pearl

At its core, Pearl is designed to handle the complexity of real-world sequential decision-making. It equips RL agents with capabilities like:

  • Intelligently exploring environments to gather valuable data
  • Summarizing complex histories to handle partial information  
  • Safely avoiding undesirable outcomes
  • Efficiently learning from offline datasets

This represents a significant upgrade from existing RL libraries, which tend to focus narrowly on core algorithms while neglecting the practical requirements of production systems.

Pearl's modular architecture makes mixing and matching components like policy learning methods, exploration strategies, safety constraints, and neural network architectures easy. This flexibility empowers researchers and practitioners to tailor RL solutions to their needs.

Teaching an RL Agent to Balance a Pole

To understand Pearl in action, let's walk through a simple example of using it to teach an agent to balance a pole on a cart (a classic RL benchmark problem). 

We first instantiate a PearlAgent object, choosing a deep Q-learning policy learner and an ε-greedy exploration strategy to ensure a mix of exploration and exploitation. Our agent then repeatedly takes actions in the cart pole environment, observing the current state, receiving rewards or penalties, and storing the generated experiences in its replay buffer.

Behind the scenes after each episode, Pearl samples experience from the buffer to train the agent's neural network, steadily improving its policy. Over time, the agent learns to move the cart left or right to keep the pole balanced for as long as possible.

Key Takeaways

Pearl demonstrates how the modular building blocks of history, exploration, safety, and offline learning can produce sophisticated RL agents ready for real-world deployment. As the authors highlight, it is already used in industry applications like recommender systems and bidding optimization.

As AI advances, we need frameworks like Pearl that translate innovation into meaningful solutions for businesses and communities. With thoughtful design, RL could one day coordinate disaster relief efforts, allocate funding to scientific projects, guide public health programs, and more. 

By open-sourcing Pearl, Meta AI lowers organizations' barriers to building decision-making systems powered by state-of-the-art reinforcement learning. Now, anyone can access these capabilities for free and even turn their AI prompts into easy-to-use web apps via platforms like CPROMPT.AI


  • Reinforcement learning: rewarding desirable behaviors and punishing mistakes instead of requiring labeled examples
  • Replay buffer: Temporary data storage used to recycle experiences to improve sample efficiency  
  • Policy: The agent's decision-making model mapping states to actions
Kabir M.
tag:blog.cprompt.ai,2013:Post/2060951 2023-12-11T03:33:58Z 2023-12-21T18:37:02Z Enabling Efficient Parallel Function Calling in LLMs

Large language models (LLMs) like GPT-3 have shown remarkable language understanding and reasoning capabilities. This has expanded the scope of LLMs from content generation to solving complex problems across domains like math, coding, and question-answering. However, LLMs have inherent limitations- knowledge cutoffs, poor arithmetic skills, and no access to private data sources. 

To overcome these limitations, recent works have focused on equipping LLMs with external function calling capabilities. This allows users to provide custom functions that the LLM can invoke to augment its skills. For instance, an LLM could call a calculator function for arithmetic operations or query a private database and summarize the results. The LLM selects suitable functions based on context and integrates their outputs to derive solutions.

While function calling expands the capabilities of LLMs, current approaches like ReAct execute functions sequentially. This means the LLM calls one function, reasons over the output, and then decides the following function. This back-and-forth process continues until the LLM generates the final solution. 

The sequential execution in ReAct has three key downsides:

  • High latency - Reasoning over each intermediate output becomes time-consuming for queries needing multiple sequential function calls.
  • Increased costs - Frequent promptings of the LLM to analyze each output drive up the token usage.
  • Lower accuracy - Concatenating intermediate outputs can sometimes confuse the LLM, leading to repetitive function calls or premature stopping.

The paper "An LLM Compiler for Parallel Function Calling" comes in here. It proposes a novel LLMCompiler framework that can efficiently orchestrate parallel function calling in LLMs. 

The Core Idea Behind LLMCompiler

The core philosophy behind LLMCompiler is drawing inspiration from classical compilers. Compilers optimize instruction execution in programs by identifying parts that can run parallelly. LLMCompiler applies similar concepts for efficient multi-function execution in LLMs. 

Specifically, LLMCompiler has three key components:

  • LLM Planner: Analyzes user prompts and graphs necessary function calls with dependencies.
  • Task Fetching Unit: Dynamically inspects the graph to dispatch independent function calls in parallel. 
  • Executor: Runs the dispatched function call tasks concurrently using associated tools.

Let's understand this with an example prompt:

"How much does Microsoft's market cap need to increase to exceed Apple's market cap?"

The LLM Planner breaks this down into four key tasks:

  • Search for Microsoft's market cap
  • Search for Apple's market cap 
  • Divide Apple's market cap by Microsoft's market cap
  • Generate textual response for the division result

Tasks 1 and 2 are independent searches that can run parallelly. Task 3 depends on the outputs of 1 and 2, while task 4 depends on 3. 

The Task Fetching Unit identifies tasks 1 and 2 as parallelizable and dispatches them concurrently to the Executor. Once done, it sends task 3 by substituting the actual market cap values from the outputs of tasks 1 and 2. Finally, task 4 is executed to return the final response.

This optimized orchestration of function calls provides significant benefits:

  • Faster execution due to parallel processing
  • Lower costs from fewer unnecessary LLM promptings
  • Increased accuracy by minimizing intermediate output interference

Analyzing LLMCompiler's Performance

The authors benchmarked LLMCompiler on different workloads - from embarrassingly parallel patterns to sequentially dependent functions. They tested popular LLMs like GPT-3.5 and the open-source LLaMA-2 model.

The results show that LLMCompiler provides consistent latency speedups of 1.8x to 3.74x over ReAct across benchmarks. The efficient planning also leads to a 3x-6x cost reduction in token usage. Further, LLMCompiler improves accuracy over ReAct by avoiding repetition and early stopping failures.

LLMCompiler also outperforms OpenAI's parallel function calling feature, which is released concurrently. It demonstrated 1.35x faster execution, affirming the optimized orchestration of function calls.

Frameworks like LLMCompiler that optimize parallel function execution unlock new possibilities for builders. Prompt programming platforms like CPROMPT.AI can then democratize such capabilities for everyone.

CPROMPT.AI allows anyone to turn AI prompts into customizable web apps without coding. Users could build an app powered by an efficient LLM backend like LLMCompiler to solve complex problems. The app creator can customize functions that end users can leverage by simply describing their queries in natural language.

For instance, an investor may build a market analysis app using custom data queries and financial models. An engineer could create a CAD design troubleshooting app with PLMs and simulation functions. Such apps make efficient parallel function calling accessible to domain experts beyond AI researchers.  

With innovations like LLMCompiler and prompt programming platforms like CPROMPT.AI, users can build purpose-built apps enhanced by LLMs that efficiently tackle multifaceted problems in natural language. This can expand the real-world impact of large language models.

The LLMCompiler introduces an optimized compiler-like orchestration for parallel function calling in LLMs. By planning efficient execution graphs and concurrent function dispatches, the LLMCompiler unlocks faster and more affordable execution without compromising accuracy. Further combining such capabilities with accessible, prompt programming interfaces like CPROMPT.AI can democratize parallel problem-solving with LLMs beyond AI experts. As LLMs continue maturing into versatile reasoning engines, efficient multi-function orchestration will be vital to unlocking their full potential while managing resources prudently.

GitHub Repo for LLMCompiler



  • ReAct: Framework for agents using LLMs to reason and take actions

Kabir M.
tag:blog.cprompt.ai,2013:Post/2056351 2023-12-10T20:00:00Z 2023-12-21T18:23:35Z Why We Shouldn't Humanize AI

Recently, I came across an article on VOX called Why it’s important to remember that AI isn’t human by Raphaël Millière and Charles Rathkopf that made me think about the dozens of people I read and hear on 𝕏 (formerly Twitter) and 𝕏 Spaces that write or talk to ChatGPT or other LLM as if they are interacting with a real human being. I use polite language in crafting my prompts because I am told that if the context of my input is closer to a strong, robust pattern, it might be better at predicting my desired content. I don't say "please" because I think of it as a human. But do you see when you talk to ChatGPT? A cold, emotionless bot spitting out responses? Or a friendly, helpful companion ready to converse for hours? Our instincts push us towards the latter, though the truth lies somewhere in between. Through the linguistic lens, we view all things, artificial intelligence included. And therein lies the trouble.

Human language marked the pinnacle of human cognition. No other species could conjugate verbs, compose poems, or write legal briefs. Language remained uniquely ours until an AI startup called Anthropic released Claude - a large language model capable of debating ethics, critiquing sonnets, and explaining its workings with childlike clarity. 

Seemingly overnight, our exclusivity expired. Yet we cling to the assumption that only a human-like mind could produce such human-like words. When Claude chatters away, we subconsciously project intentions, feelings, and even an inner life onto its algorithms. This instinct to anthropomorphize seeps through our interactions with AI, guiding our perceptions down an erroneous path. As researchers Raphaël Millière and Charles Rathkopf explain, presuming language models function like people can "mislead" and "blind us to the potentially radical differences in the way humans and [AI systems] work."

Our brains constantly and unconsciously guess at meanings when processing ambiguous phrases. If I say, "She waved at him as the train left the station," you effortlessly infer I mean a person gestured farewell to someone aboard a departing locomotive. Easy. Yet, multiply such ambiguity across millions of neural network parameters, and deducing intended significances becomes more complex. Claude's coders imbued it with no personal motivations or desires. Any interpretation of its statements as possessing some undisclosed yearning or sentiment is sheer fabrication. 

Nonetheless, the impressiveness of Claude's conversational skills compels us to treat it more as a who than a what. Study participants provided more effective prompts when phrasing requests emotionally rather than neutrally. The Atlantic's James Somers admitted to considering Claude "a brilliant, earnest non-native English speaker" to interact with it appropriately. Without awareness, we slide into anthropomorphic attitudes.

The treacherous assumption underpinning this tendency is that Claude runs on the same psychological processes enabling human discourse. After all, if a large language model talks like a person, it thinks like one, too. Philosopher Paul Bloom calls this impulse psychological essentialism - an ingrained bias that things possess an inherent, hidden property defining their categorization. We extend such essentialist reasoning to minds, intuitively expecting a binary state of either minded or mindless. Claude seems too adept with words not to have a mind, so our brains automatically classify it as such.

Yet its linguistic mastery stems from algorithmic calculations wholly unrelated to human cognition. Insisting otherwise is anthropocentric chauvinism - dismissing capabilities differing from our own as inauthentic. Skeptics argue Claude merely predicts following words rather than genuinely comprehending language. But as Millière and Rathkopf point out, this no longer limits Claude's potential skills than natural selection constrains humanity's. Judging artificial intelligence by conformity to the human mind will only sell it short.

The temptation persists, sustained by a deep-rooted psychological assumption the authors dub the "all-or-nothing principle." We essentialize minds as present or absent in systems, allowing no gradient between them. Yet properties like consciousness exist along a spectrum with inherently fuzzy boundaries. Would narrowing Claude's knowledge bases or shrinking its neural networks eventually leave something non-minded? There needs to be a clear cut-off separating minded from mindless AI. Still, the all-or-nothing principle compels us to draw one, likely in coordination with human benchmarks.

To properly evaluate artificial intelligence, Millière and Rathkopf advise adopting the empirical approach of comparative psychology. Animal cognition frequently defies anthropomorphic assumptions - observe an octopus instantaneously camouflaging itself. Similarly, unencumbered analysis of Claude's capacities will prove far more revealing than hamstrung comparisons to the human mind. Only a divide-and-conquer methodology tallying its strengths and weaknesses on its terms can accurately map large language models' contours.

The unprecedented eloquence of systems like Claude catches us off guard, triggering an instinctive rush toward the familiar. Yet their workings likely have little in common with our psychology. Progress lies not in noting where Claude needs to improve human behavior but in documenting its capabilities under unique computational constraints. We can only understand what an inhuman intelligence looks like by resisting the temptation to humanize AI.

Kabir M.
tag:blog.cprompt.ai,2013:Post/2060428 2023-12-09T19:16:49Z 2023-12-21T18:36:24Z The Future of AI in Europe: What the Landmark EU Deal Means

The European Union recently issued a provisional political agreement on legislation regulating artificial intelligence (AI) systems and their use. This "Artificial Intelligence Act" is the first comprehensive legal framework. It establishes obligations and restrictions for specific AI applications to protect fundamental "rights" while supporting AI innovation.

What does this deal cover, and what changes might it bring about? As AI becomes deeply integrated into products and services worldwide, Europeans and tech companies globally need to understand these new rules. This post breaks down critical aspects of the Act and what it could mean going forward.

Defining AI

First, what counts as an AI system under the Act? It defines AI as Software developed with specific techniques to predict, recommend, or decide actual world outcomes and interactions. This means today's AI assistants, self-driving vehicles, facial recognition systems, and more would fall under the law.

Banned Uses of AI

Recognizing AI's potential for rights and democracy, specific applications are prohibited entirely:

  • Biometric identification systems that use sensitive personal AI characteristics like religious beliefs, sexual orientation, race, etc., to categorize people. Example: Software ranking individuals as LGBTQ+ without consent.  
  • Scraping facial images from the internet or surveillance cameras to create recognition databases. Example: Companies scraping social media photos to build facial recognition systems.
  • Emotion recognition software in workplaces and schools. Example: Software gauging student engagement and boredom during online classes.  
  • Social scoring systems judging trustworthiness or risk levels based on social behaviors. Example: Apps rating individuals' personality traits to determine access or opportunities.
  • AI that seeks to circumvent user free will or agency. Example: Chatbots manipulate individuals into purchases by exploiting psychological vulnerabilities.
  • AI exploiting vulnerabilities of disadvantaged groups. Example: Lenders use income level data to steer low-income applicants towards unfavorable loan offers. 

These bans address some of the most problematic uses of emerging AI capabilities. However, the most contentious issue proved to be biometric identification systems used for law enforcement.

Law Enforcement Exemptions 

The Act removes certain narrow exceptions allowing law enforcement to use biometric identification, like facial recognition tech, in public spaces. However, it comes with restrictions and safeguards.

Specific types of biometric ID systems are permitted, subject to prior judicial approval, only for strictly defined serious crimes and searches. Real-time scanning would have tight locational and time limits. 

For example, searches for trafficking victims or to prevent an imminent terrorist threat may use approved biometric tech for that specific purpose. Extensive databases of facial images or other biometrics can only be compiled with cause.

The rules seek to balance investigating significant crimes and protecting civil liberties. However, digital rights advocates argue any biometric surveillance normalizes intrusions disproportionately affecting marginalized communities. Companies building or providing such tech must closely track evolving EU guidance here.

High-Risk AI Systems

For AI applications classified as high-risk, like those affecting health, safety, fundamental rights, and more, strict obligations apply under the Act. Examples include autonomous vehicles, recruitment tools, credit scoring models, and AI used to determine access to public services.

Requirements will include risk assessments, documentation, transparency, human oversight, and more. There are also special evaluation and reporting procedures when high-risk AI systems seem likely to be involved in any breach of obligations.  

Citizens gain the right to file complaints over high-risk AI impacts and ask for explanations of algorithmic decisions affecting them. These provisions acknowledge the growing influence of opaque AI systems over daily life.

General AI and Future Advancements 

The rapid expansion of AI capabilities led policymakers to build in measures even for cutting-edge systems yet to be realized fully. General purpose AI, expected to become mainstream within 5-10 years, faces transparency rules around training data and documentation.

For high-impact general AI anticipated down the line, special model checks, risk mitigation processes, and incident reporting apply. So emerging AI fields like natural language processing chatbots are on notice to meet similar standards to high-risk apps eventually.

Supporting Innovation  

Will these new obligations stifle European AI innovation and competitiveness? The Act balances today's technology development with supporting tech development, especially for smaller enterprises. 

Regulatory sandboxes let companies test innovative AI in natural environments pre-deployment. Favorable market access procedures aid new market entrants. Requirements kick in only after an AI system is placed on the EU market.

Overall, the Act signals that human rights and ethics lead to development, not vice versa. But they avoided imposing some of the most stringent restrictions tech companies opposed.

Fines for Violations

Failure to meet requirements results in fines of up to €30 million or 6% of a company's global turnover. Intentional non-compliance sees even harsher penalties, a substantial incentive towards the company.

What It Means for US Tech Companies

American tech giants like Microsoft, IBM, and Google, more deeply involved in European markets, will need to implement structures and processes adhering to the new rules. Smaller startups entering the EU marketplace will want to build compliance into products from the start.

Companies exporting AI software or devices to Europe must determine if products fall under high-risk categories or other designations mandating accountability steps. Strict data and documentation requirements around developing and updating AI systems demand additional staffing and oversight.  

While the Act avoids the most burdensome restrictions, adhering to transparency principles and ensuring human oversight of automated decisions requires investment. Tech lobbying failed to defeat obligations reinforcing ethical AI practices many researchers have long called for.

US policymakers have proposed federal guidelines and legislation governing AI systems and companies. However, something different from the EU's comprehensive regulatory approach has advanced. That may gradually change as the global impacts of the landmark European Act become more apparent in the coming years.

Glossary of Key Terms

  • Biometric identification systems: Technology using biological or behavioral traits – like facial features, fingerprints, gait, and voice – to identify individuals. Examples include facial recognition, fingerprint matching, and iris scans.  
  • High-risk AI systems: AI technology presents a significant potential risk of harm to health, safety, fundamental rights, and other areas defined by EU regulators. Self-driving cars and AI tools in critical infrastructure like hospitals exemplify high-risk systems.  
  • General purpose AI: Artificial intelligence can perform complex cognitive tasks across many industries and use cases. Sometimes called artificial general intelligence (AGI), it does not entirely exist, but advanced AI exhibits some broad capabilities.  
  • Regulatory sandbox: Controlled testing environments allow developers to try innovative digital products/services while oversight agencies review functionality, risks, and effectiveness before full deployment or marketing.

Kabir M.
tag:blog.cprompt.ai,2013:Post/2060431 2023-12-08T20:00:00Z 2023-12-09T19:28:14Z The EU's Artificial Intelligence Act in a Nutshell

The EU's Artificial Intelligence Act aims to establish the first comprehensive legal framework governing AI systems. The main goals are to ensure AI respects existing EU laws and ethical principles while supporting innovation and business use.

Key provisions:

  • Creates a legal definition of an "AI system" in EU law encompassing various software-based technologies like machine learning and knowledge-based systems. The definition aims to be broad and flexible enough to adapt to future AI advances.  
  • Adopts a risk-based approach tailoring obligations depending on the threat level the AI system poses. AI applications posing "unacceptable risk" would be prohibited entirely, while "high-risk" systems would face stricter transparency, testing, and human oversight requirements before market access. "Limited risk" and "minimal risk" AI would have lighter or no additional obligations.
  • Explicitly bans specific dangerous AI uses, including systems exploiting vulnerable groups, scoring individuals based on social behaviors, and real-time remote biometric identification by law enforcement in public spaces.  
  • Imposes mandatory risk management, data governance, transparency and accuracy standards on "high-risk" AI systems used in critical sectors like healthcare and transport or impacting rights and safety. Requires third-party conformity assessments before high-risk systems can carry CE marking for EU market access.
  • Creates an EU database for registering high-risk AI systems and establishes national authorities to oversee compliance, address violations through fines and product recalls, and coordinate enforcement across borders.  
  • It seeks to boost EU AI innovation and investments through regulatory "sandboxes" where companies can test new systems and favorable market access rules, particularly helping small businesses and startups develop AI.

The Act's comprehensive scope and strict prohibitions aim to make the EU a leader in ethical and trustworthy AI while allowing beneficial business applications to flourish. But critics argue it could also impose costly burdens, potentially limiting AI investments and stifling innovation.


Q: Who does the AI Act target?

The Act mainly targets providers and users of AI systems based in the EU or exporting products and services to EU markets. So it applies to EU-based tech companies and major US firms like Meta, Alphabet, Microsoft, etc., serving EU users.

Q: What does the Act mean for big US tech companies? 

Major US tech firms deeply involved in EU markets will likely need to implement compliance structures around high-risk AI uses regarding transparency, testing requirements, risk assessment, and human oversight. This could mean sizable compliance costs.

Q: Does the Act ban any AI use by US companies?

Yes, the Act prohibits specific applications by all providers, including uses of AI deemed excessively harmful or dangerous, regardless of whether a system is high-risk. For example, AI uses exploiting vulnerable populations, applications enabling mass biometric surveillance, and AI tools circumventing individual rights.

Q: Will the Act limit investment in AI by US firms?  

Possibly. Compliance costs may deter US tech investments in developing high-risk AI systems for European markets. But the impact likely depends on how rigorously national regulators enforce obligations on companies.

Q: What does the Act mean for US AI startups eyeing EU markets?

The Act aims to support market access and innovation by smaller AI developers through measures like regulatory sandboxes to test new systems. However, meeting requirements around risk management and accuracy for high-risk applications could still prove burdensome for early-stage startups with limited resources.

Q: Could the Act influence AI regulation in the US?

The Act takes a much more active regulatory approach than the US federal government's guidelines. If successful, the comprehensive EU framework could inspire similar proposals for ethical AI guardrails in the US as calls for regulation of technology companies grow.

Q: How will average EU citizens benefit from the AI Act?  

By restricting specific dangerous uses of AI, the Act aims to protect EU citizens' digital rights and safety. Requirements around transparency should improve citizens' understanding of automated decisions impacting their lives regarding issues like credit eligibility and access to public services.  

Q: Will the Act make interacting with AI systems easier in the EU? 

Potentially, provisions prohibiting AI aimed explicitly at exploiting vulnerabilities could lead to systems that better respect human agency and choice when recommending purchases, content selections, and other areas that impact behavior.

Q: Could the Act limit the beneficial uses of AI for EU citizens?

Overly stringent restrictions on lower-risk AI could curb the development of innovations like virtual assistants and chatbots intended to help consumers. However, the Act predominantly targets high-risk uses while promoting voluntary codes of conduct for companies creating consumer AI.

Q: Will EU citizens have any say in how companies develop AI models?

The Act does not establish specific mechanisms for public participation in corporate AI design choices. However, by strengthening national regulators' powers, enhancing transparency, and allowing consumer complaints over biased outcomes, citizens gain new avenues to challenge issues created by AI systems affecting them.

Kabir M.
tag:blog.cprompt.ai,2013:Post/2056111 2023-12-07T20:00:00Z 2023-12-21T18:21:04Z Pushing AI's Limits: Contrasting New Evaluation Milestones

Artificial intelligence has breezed through test after test in recent years. But as capabilities advance, suitable benchmarks grow scarce. Inspired by the video posted on 𝕏 (formerly Twitter) by Thomas Wolf, the co-founder of Hugging Face, I compared the two benchmarks he discussed. Two new benchmark datasets push progress to the gradients of human knowledge itself. They suggest milestones grounded in versatile real-world competency rather than narrow prowess. Their very difficulty could accelerate breakthroughs.

GAIA and GPQA take complementary approaches to inspecting AI through lenses of assistant competence and expert oversight. Both craft hundreds of questions unsolvable for non-specialists despite earnest effort and unconstrained access to information. GPQA draws from cutting-edge biology, physics, and chemistry, seeking problems with undisputed solutions within those communities. GAIA emphasizes multi-step reasoning across everyday tasks like gathering data, parsing documents, and navigating the web.  

The datasets highlight stubborn gaps between the most advanced systems and typical humans still yawning wide. GPT-4 grazes 40 percent on GPQA, lagging the 65 percent target for area specialists. Augmenting the model with an internet search tool barely budges results. Meanwhile, GAIA clocks under 30 percent across specific challenges compared to above 90 percent for human respondents, who face barriers like effectively handling multimedia information and executing logical plans.   

These diagnoses of inhuman performance could guide progress. By homing in on precise shortfalls using explicit criteria, researchers can funnel efforts to deny AI problems any enduring place to hide. Projects conceived directly from such insights might swiftly lift capacities, much as targeted medical treatments heal pinpointed ailments, just as GAIA and GPQA represent attainable waypoints en route to broader abilities.  

Reaching either milestone suggests unfolding mastery. Matching multifaceted personal assistants could precipitate technologies from conversational guides to robotic helpers, markedly upgrading everyday experience. Reliable oracles imparting insights beyond individual comprehension might aid in pushing back the frontiers of knowledge. Of course, with advanced powers should come progressive responsibility. But transformative tools developed hand in hand with human preferences provide paths to elevated prosperity.  

So now AI benchmark datasets stand sentinel at the gates of an existing skill. The bar passage to those falling fractionally short while ushering forward those set to surpass. Such evaluations may thus form the trajectory for innovations soon impacting our institutions, information, and lives.


Q: How are the GAIA and GPQA benchmarks different? 

GAIA emphasizes multi-step reasoning across everyday information like images or documents. GPQA provides expert-level problems with undisputed solutions within scientific communities.

Q: Why are difficult, decaying benchmarks vital for AI progress?

They can advance integrated real-world skills by exposing precise gaps between state-of-the-art systems and human capacities.

Q: How could surpassing milestones like GAIA or GPQA impact society?   

They constitute waypoints en route to safe, beneficial technologies - from conversational aids to knowledge oracles - improving life while upholding priorities.


  • Oversight - Evaluating and directing intelligent systems accurately and accountably, even where individual knowledge limits checking outputs firsthand.  
  • Benchmark decay - The tendency for fixed benchmarks to become unchallenging to improve systems over time.

Kabir M.
tag:blog.cprompt.ai,2013:Post/2059591 2023-12-07T19:56:58Z 2023-12-21T18:33:22Z Unlocking Linear Speed for AI Models with Mamba

Modern AI systems rely heavily on complex neural network architectures called Transformers. While powerful, Transformers have a significant weakness - they slow down drastically when processing long sequences, like documents or genomes. This limits their practical use for real-world applications.  

Enter Mamba, a new AI model that retains the power of Transformers while overcoming their Achilles heel. In a recent paper, researchers from Carnegie Mellon University and Princeton University propose a linear way to make sequence modeling scales. That means, unlike Transformers, Mamba's speed does not slow down significantly with longer inputs.

The Key Idea Behind Mamba

The core concept behind Mamba is a structured state space model (SSM). SSMs are similar to two classic neural networks - recurrent neural networks (RNN) and convolutional neural networks (CNN). They take an input sequence, pass it through an internal "state" that changes over time, and convert it to an output. Here is a small primer on these networks:

Structured State Space Models (SSMs)

SSMs model sequences by passing inputs through an internal "state" that changes over time. You can imagine the state as a container summarizing the relevant history up to a point. An SSM transforms the current input and state into a new state, which informs the following output. The critical advantage of SSMs is that their state remains compact even for very long sequences. This compressed representation allows efficient processing.

Convolutional Neural Networks (CNNs)

CNNs are neural networks that apply the mathematical operation of convolution. In simple terms, a CNN slides a small filter matrix over the input and detects patterns in local regions. Multiple filters can activate parallel to identify low-level motifs like edges or textures. CNNs work well for perceptual data like images, video, or audio. They are less suitable for sequential dependencies between distant elements.

Recurrent Neural Networks (RNNs)

In RNNs, the network contains loops that feed activation from previous time steps as input to the current step. This creates an implicit memory in the network to model long-range dependencies. For instance, an RNN can develop a nuanced understanding of language by remembering all the words seen up to a point. However, standard RNNs need help with long sequences due to issues like vanishing gradients. Unique RNN variants address this limitation.

The core concepts are essentially:

  • SSMs - Compressed state with global sequence view
  • CNNs - Local patterns
  • RNNs - Sequential modeling with internal memory

SSMs are unique because their internal state can compress information from longer sequences into a compact form. This compressed state allows efficient processing no matter the input length. Prior SSM models worked well on continuous data like audio or images. But they needed help with dense discrete inputs like text. 

The creators of Mamba overcame this by adding a "selection mechanism" to SSMs. This lets Mamba focus only on relevant text parts, ignoring unnecessary bits. For example, when translating a sentence from English to French, Mamba would pay attention to the words while filtering out punctuation or filler words like "um."

Another innovation in Mamba is using GPU hardware capabilities efficiently during training. This enables much larger hidden state sizes compared to standard RNNs. More state capacity means storing more contextual information from the past.

Overall, these improvements impart Mamba exceptional speed and accuracy at par with or better than Transformer networks of the same complexity.

Key Facts About Mamba

  • 5x Faster Inference Than Transformers - Mamba displays over five times higher throughput than similarly sized Transformers when generating text or speech. For practical applications, this translates to much lower latency and cost.
  • Matches Bigger Transformers in Accuracy  - Empirical tests show Mamba develops a solid contextual understanding from self-supervised pretraining on large datasets. Despite having fewer parameters, it matches or exceeds bigger Transformer models on several language tasks.
  • Handles 1 Million Word Contexts  - Mamba is the first sub-quadratic model that continues improving with more extended context, reaching up to 1 million words. Prior models degrade in performance beyond a point as context length increases. This opens possibilities for capturing more global structures, like full-length books.

Real-World Implications

Mamba's linear computational complexity unlocks myriad new applications for large language models requiring real-time responsiveness. For instance, let's think about an intelligent prompt app created using CPROMPT.AI. The natural language interface can understand instructions spanning multiple sentences and respond immediately. Streaming applications like live speech transcription also become viable. And for sensitive use cases, the whole context stays on-device without needing roundtrips to the cloud.

Another benefit is the feasibility of much bigger foundation models in the future. Training costs and carbon emissions have been critical constraints on a model scale so far. Mamba's efficiency may enable models with over a trillion parameters while staying within the practical limits of computing budgets and data center energy.


  • Transformers: A neural network architecture using self-attention became highly popular after pioneering large language models like GPT-3 and DALL-E.
  • Structured State Space Model (SSM): A class of seq2seq models based on dynamical systems theory, which can trade off expressiveness and computational efficiency.
  • Selection Mechanism: Mamba introduced the method to make SSM transitions input-dependent, so it focuses only on relevant tokens.  
  • Throughput: Number of tokens processed per second. Higher is better.
  • Sub-quadratic: Algorithmic time complexity grows slower than quadratic. This includes linear and logarithmic time models.
Kabir M.
tag:blog.cprompt.ai,2013:Post/2059376 2023-12-07T07:08:30Z 2023-12-07T07:16:03Z Creating Realistic 3D Avatars from Images

Have you ever wished you could bring a photo or video of someone to life with a realistic 3D avatar that looks and moves just like them? That futuristic idea is quickly becoming a reality thanks to recent advances in artificial intelligence and computer graphics research. 

In a new paper published on arXiv, researchers from Tsinghua University, a national public university in Beijing, China, propose a method called "Gaussian Head Avatar," which can create highly detailed 3D head avatars from multi-view camera images. Their approach utilizes neural networks and an innovative 3D representation technique to model both the shape and motions of a person's head with unprecedented realism. Here is a video demonstrating this technique.

At the core of their technique is representing the 3D shape of the head using many discrete elements called "Gaussians." 

Various properties like position, color, opacity, etc, define each Gaussian. Thousands of these Gaussians are optimized to form the head avatar's visible surfaces collectively. This approach has advantages over other 3D representations when it comes to efficiently rendering high-frequency details like skin pores, strands of hair, wrinkles, etc.

The critical innovation is making the Gaussians dynamic, changing their properties based on facial expressions and head movements. This allows animating the avatar by providing images showing different expressions/poses. The animation is driven by neural networks that predict how the Gaussians need to move and change to match the provided images.

The results are extremely impressive 3D avatars rendered at 2K resolution with intricate details, even for complex expressions like an open laughing mouth. This level of photo-realism for virtual avatars opens up many possibilities for video game development, virtual reality, and visual effects for films/metaverse.  Here are some of the most exciting facts highlighted in this research:

  • Their technique needs only 16 camera views distributed across 120 degrees to capture the multi-view training data. This lightweight capture setup makes the avatar creation process much more practical.
  • The neural network predictions are regularized to avoid learning distortions not consistent across views. This forces the model to capture the actual 3D shape rather than view-dependent image transformations.
  • They designed the animation model to separately handle expressions and head movements. Expressions primarily drive the region near facial landmarks, while neck movement uses the pose. This decomposition matches how faces move.
  • A guidance model using implicit 3D representations is trained first to initialize the Gaussians before the leading training. This allows robust fitting to hair and shoulders beyond the core face region.
  • Their avatars can realistically render subtle, extreme expressions like wide-eyed mouths. The neural animation model does not suffer from the limits of traditional techniques.


  • Multi-view images: Multiple images of an object captured from different viewing angles
  • Neural Networks: Computing systems inspired by the human brain structure and capable of learning from data
  • 3D Representation: Mathematical definition of 3D shape using various primitives like point clouds, meshes, functions, etc.
  • Gaussians: Parametric surface element defined by a center and width resembling the Gaussian probability distribution
  • Rendering: Generating 2D images from the description of 3D scenes via simulation of the image formation process 
  • Implicit Representations: Defining 3D surface as a level set of a continuous function rather than explicit primitives
Kabir M.
tag:blog.cprompt.ai,2013:Post/2056087 2023-12-06T20:00:00Z 2023-12-21T18:21:19Z GPQA: Pushing the Boundaries of AI Evaluation

As artificial intelligence systems grow more capable, evaluating their progress becomes challenging. Benchmarks that were once difficult swiftly become saturated. This phenomenon clearly illustrates the rapid rate of advancements in AI. However, it also reveals flaws in how benchmarks are designed. Assessments must keep pace as abilities expand into new frontiers like law, science, and medicine. But simply pursuing tasks that are more difficult for humans misses crucial context. What's needed are milestones grounded in versatile real-world competency.

With this motivation, researchers introduced GPQA, an evaluation targeting the edge of existing expertise. The dataset comprises 448 multiple-choice questions from graduate-level biology, physics, and chemistry. Validation ensures both correctness per scientific consensus and extreme difficulty. Questions reach ambiguity on purpose - designed to probe the scope of human knowledge itself. Even highly skilled non-experts failed to exceed 34% accuracy despite unrestricted web access and over 30 minutes per query on average.

Such hardness tests a key challenge of AI alignment - scalable oversight. As superhuman systems emerge, humans must retain meaningful supervision. But when tasks outstrip individual comprehension, determining truth grows precarious. GPQA probes precisely this scenario. Non-experts cannot solve their problems independently, yet ground truth remains clearly defined within specialist communities. The expertise gap is tangible but manageable. Oversight mechanisms must close this divide between the fallible supervisor and the infallible system.

The state of the art provides ample room for progress. Unaided, large language models like GPT-4 scored only 39% on GPQA's main line of inquiry. Still, their decent initial foothold confirms the promise of foundation models to comprehend complex questions. Combining retrieval tools with reasoning took success no further, hinting at subtleties in effectively utilizing internet resources. Ultimately, collaboration between humans and AI may unlock the best path forward - much as targeted experiments should illuminate effective oversight.

As benchmarks must challenge newly enhanced capabilities, datasets like GPQA inherently decay over time. This very quality that makes them leading indicators also demands continual redefinition. However, the methodology of sourcing and confirming questions from the frontier of expertise itself offers a template. Similar principles could shape dynamic suites tuned to drive and validate progress in perpetuity. In the meantime, systems that perform on par with humans across GPQA's spectrum of natural and open-ended problems would constitute a historic achievement - and arrive one step closer to beneficial artificial general intelligence engineered hand in hand with human well-being.


Q: What key capabilities does GPQA test?

GPQA spans skills like scientific reasoning, understanding technical language, gathering and contextualizing information, and drawing insights across questions.

Q: How are the questions generated and confirmed?  

Domain expert contractors devise graduate-level queries, explain the methodology, and refine them based on peer experts' feedback. 

Q: Why create such a problematic benchmark dataset?

Tough questions probe the limits of existing knowledge crucial for oversight of more capable AI. They also decay slowly, maintaining relevance over time.


Scalable oversight: Reliably evaluating and directing advanced AI systems that exceed an individual human supervisor's abilities in a given area of expertise.

Foundation model - A system trained on broad data that can be adapted to many downstream tasks through techniques like fine-tuning; large language models are a significant class of foundation models.  

Benchmark decay: The tendency for benchmarks to become outdated and unchallenging as the systems they aim to evaluate continue rapid improvement.

Kabir M.
tag:blog.cprompt.ai,2013:Post/2059090 2023-12-06T17:08:37Z 2023-12-21T18:32:26Z The Rise of Gemini: Google's Moonshot Towards Revolutionary AI, Coming Soon to Pixel Phones

Google unveiled its most advanced AI system, Gemini I, inside its Bard conversational AI. Gemini aims to push the boundaries of what artificial intelligence can do by better understanding and reasoning about the natural world.  

Technically, Gemini is very impressive. Unlike GPT-3.5 and GPT-4, which were trained mainly on text, Gemini was also trained on images, audio, and video to enable more sophisticated reasoning. It exceeds previous AI on 32 benchmark tests, including coding, math, medicine, and more. For example, it can read scientific papers and extract critical findings faster than human experts.

However, Google's hype that Gemini represents an imminent revolution has yet to match reality fully. Many complex AI problems, like reliably distinguishing truth from falsehood, still need to be solved. Google also over-promised previously with the botched launch of Bard earlier this year.

So, while Gemini represents an evolution in AI capabilities, responsible development of such powerful technology takes time. We should applaud Google's achievements with cautious optimism about real-world impact in the short term.  

Gemini establishes Google again as at the forefront in the race to develop advanced AI, now facing competition from OpenAI after the widespread buzz created by ChatGPT last year. But practical benefits will likely emerge slowly over years of incremental improvement.


  • Gemini, Google's newest AI model, is touted as its most capable across language, image, audio, video, and other tasks.
  • According to Google, Gemini exceeds previous state-of-the-art AI systems like GPT-3.5 and GPT-4 in 30 of 32 benchmark categories.
  • Real-world applications include scientific insight generation, explaining complex topics like math and physics, and coding. 
  • Google is incrementally rolling out Gemini across products like Bard, Pixel phones, Search, and more over the coming year.


Q: How does Gemini work?

Gemini is a neural network trained on considerable datasets, including images, audio, and video, to understand and reason across different data types. Its advanced "multimodal" design allows connecting insights between them.

Q: Is Gemini safe to use?  

Google claims Gemini has undergone substantial safety testing, but any AI system this complex likely still has flaws. Responsible development is an ongoing process.

Q: What are Gemini's limitations? 

Like all AI today, Gemini still struggles with fully reliable reasoning. Issues like distinguishing truth from fiction remain unsolved and require human oversight.

Q: Who can access Google's Gemini AI?

Google plans first to release Gemini APIs to select partners over 2023-2024 before making it more broadly available to developers and enterprise customers.

Kabir M.
tag:blog.cprompt.ai,2013:Post/2056081 2023-12-05T20:00:00Z 2023-12-21T18:24:01Z GAIA: A New Benchmark for Evaluating AI Assistants

Artificial intelligence capabilities have advanced rapidly in recent years. Systems like ChatGPT show impressive fluency and knowledge across many domains. They can even outperform humans on specific professional exams. However, as AI researcher François Chollet argues, evaluating these systems remains an open challenge. Most benchmarks focus on skills like language understanding or question answering. While important, mastering such narrow metrics misses the bigger picture. 

What's needed is a test of intelligence akin to the classic Turing test. A system that exhibits accurate artificial general intelligence (AGI) should handle the kinds of requests humans make daily. It should seamlessly gather information, reason over evidence, and apply tools as needed. On the surface, assistant tasks appear simple. Yet, they require complex planning and execution. Building systems with common sense and versatility comparable to people remains elusive.

To address this need, a team of AI experts designed GAIA. It includes over 450 real-world questions spanning personal, professional, and general knowledge domains. Queries range from finding clinical trial data to solving puzzles using website information. GAIA emphasizes abilities like:

  • Web searching and browsing
  • Understanding images, videos, and other multimedia  
  • Executing code to perform computations
  • Reading files in different formats like spreadsheets

The questions admit unambiguous factual answers, enabling automatic scoring. Still, GAIA poses a stiff challenge for current AI. The system CLAUDE reaches nearly 93%, while the more capable GPT-4 struggles at 30% on simple queries. Humans, in contrast, score above 90% across difficulty levels.

Creating valid GAIA questions does require care. Annotators start from trusted web sources, ensuring the information won't change over time. Questions go through design and validation phases, where independent annotators verify unambiguous answers. This attention to detail is critical, as poor benchmarks risk being quickly solved or gamed rather than driving progress.

The gap between human and machine performance shows ample room for improvement. Success on GAIA requires reaching human parity across fundamental capabilities. It demands seamless integration of language understanding, reasoning, and tool use. Such achievement would mark artificial general intelligence comparable to an average human. GAIA tests competency occupying the next level up from ChatGPT in a complementary framing.

GAIA's methodology also hints at better evaluation paradigms for AI. Having unambiguous outputs facilitates factual scoring. Basing questions on real-world situations better captures intelligence than closed environments. As capabilities advance, the community can refine GAIA to keep pace. Over time, its questions may shift to prevent memorization while maintaining grounded challenges.

The advent of digital assistants promises to reshape daily life much like search engines did. Living up to that potential requires measured progress grounded in human contexts. Benchmarks like GAIA that test versatile, robust comprehension offer guideposts on the long road ahead. With AGI as the destination, such milestones help ensure AI develops hand in hand with human needs.


Q: What capabilities does GAIA test?

GAIA focuses on core abilities like information finding, evidence gathering, reasoning, and tool usage. This includes web searching, multimedia understanding, coding, file reading, and more.

Q: What makes GAIA different from other benchmarks?

GAIA uses real-world assistant-style questions with unambiguous answers. This allows automatic factual scoring. The questions require complex reasoning unsolved by advanced AI.

Q: How was GAIA created and validated?

Researchers designed seed questions and then trained annotators to make more. Multiple rounds of answering ensure questions have clear answers based on available information.

Q: What systems were tested on GAIA?

Humans scored over 90% on GAIA, while AI systems like GPT-4 achieved only 30% entirely unaided. Human parity across areas like web use remains a challenge.

Q: How could GAIA evaluate future systems?

GAIA provides over 450 questions and a methodology to generate more. As capabilities improve, new questions can prevent memorization while maintaining complexity.


  • AGI - Artificial general intelligence. Systems that exhibit human-level versatility and common sense across domains.
  • Benchmark - Standardized tests used to evaluate and compare AI system performance on specified tasks.
  • Gameable - Susceptible to cheating, i.e., a system finding shortcuts to give correct answers without genuinely understanding.
  • Zero-shot - Evaluating an AI system without explicit training on a benchmark's data. Tests generalization.
Kabir M.
tag:blog.cprompt.ai,2013:Post/2058497 2023-12-05T03:47:49Z 2023-12-05T16:32:04Z Neuromorphic Computing: The Next Evolution in AI Efficiency

The circumstances around Sam Altman's firing and rehiring from OpenAI remain unclear. Without an official explanation, social media has increased speculation and rumors regarding the situation.

Recently, I learned Altman's personal investment portfolio includes a company called RAIN, which aims to build neuromorphic chips. Adding intrigue, OpenAI agreed to invest in RAIN back in 2019. Some social media commenters, particularly on 𝕏 (formerly Twitter), have questioned whether this potential conflict of interest might relate to Altman's dismissal and return to OpenAI.

As a technology enthusiast more so than a gossip, the concept of neuromorphic chips struck my interest over the rumors. Before this, this was the first I had heard of the term. After some research into the emerging field, I'm fascinated by chips architecturally configured to replicate neural pathways in the brain through adaptive learning over time.

As AI advances by leaps and bounds, its mounting computational costs and inefficiencies are becoming increasingly apparent. Modern AI systems, like large language models, can require millions in computing infrastructure to train while achieving only narrow slices of biological intelligence. To push AI forward, companies like Intel believe we need a completely new computing paradigm modeled after the brain's efficient, adaptable neural signaling. 

Enter neuromorphic computing. Rather than simulate intelligence in software, neuromorphic chips are purpose-built with brain-inspired hardware to enable vast improvements in capability and efficiency. Intel recently unveiled their second-generation "Loihi 2" neuromorphic research chip, representing years of innovation in materials, circuits, algorithms, and software to realize this technology's potential. This post will dig deeper into the biological roots, hardware optimizations, and remaining challenges for Intel's neuromorphic vision.

The Computational Chasm Between Brains and Computers   

Our brains can perform feats of intelligence, creativity, and control that dwarf even the most advanced AI systems. With just 20 watts of power, the brain's 100 trillion synapses handle visual processing, motor control, planning, emotions, common sense, and general reasoning that remains leagues ahead of computers. Yet today's AI relies on giant datasets and data centers consuming megawatts of electricity to crudely approximate narrowly defined subsets of intelligence. 

What accounts for this vast gap between biological and artificial intelligence? Today's computers are built on an abstract digital framework that introduces massive inefficiencies for emulating the analog signaling and computations of natural cognition. Brains do not perceive the world, think in 1s and 0s, or perform the matrix math that underpins neural networks. Instead, information is encoded into electrical spikes transmitted between neurons across synapses. Senses generate spiking input patterns, which the neural network interprets into spiking motor signals to take action. No software simulations or abstractions are required!  

By more directly mimicking the form and function of biological neural systems in hardware, neuromorphic chips can achieve unprecedented improvements in speed, adaptability, and energy efficiency. Intel estimates that neuromorphic designs like Loihi 2 demonstrate thousands of times better energy efficiency over CPUs and GPUs for workloads like gesture recognition and optimization. The critical question is whether these gains observed in limited prototypes can scale up for commercial viability across applications.

Three Years of Loihi 1: Validating Neuromorphic Computing

Intel launched the first Loihi chip 2018 to support internal research and over 140 academic/industry partners exploring neuromorphic applications. Loihi 1 incorporated innovations like asynchronous signaling, hierarchical mesh networks, programmable synaptic learning rules, and spiking neural models to allow event-based, brain-inspired information processing.  

Over three years, Loihi 1 demonstrated breakthrough efficiency across workloads, including robotic control, sensory perception, planning problems, and more. For instance, projects have shown Loihi 1 adapting motor signals to improve robotic arm manipulation using 1000 times less power than GPU-based methods. These results provide a glimpse into the potential of neuromorphic hardware once scaled up.

However, Loihi 1 also exposed limitations like supported neural models, on-chip learning capabilities, chip-to-chip communication bandwidth, and software infrastructure that have hindered broader adoption. Loihi 2 aims to address these gaps with extensive hardware and software innovations.  

Pushing Neuromorphic Computing Forward with Loihi 2

Built on an advanced "Intel 4" process, the Loihi 2 chip packs in significant improvements to previous neuromorphic designs:

  1. Flexible neural models - Fully programmable spiking neurons use a code-based model to support more biologically accurate and capable network behaviors.
  2. Enhanced on-chip learning - Generalized rules allow neurons to incorporate local or global feedback to adapt online through backpropagation-style algorithms standard in deep learning. 
  3. Increased capacity - Transistor optimizations provide up to 160x higher synapse density, plus architectural innovations like convolution and stochastic connectivity allow larger on-chip workloads. 
  4. Faster signaling - Asynchronous neuromorphic circuits have been re-engineered for up to 10x faster neuron updates and communication. This allows complex inference and learning applications to run in real time.
  5. Improved scalability - New features alleviate bottlenecks when linking multiple Loihi chips to construct large systems. Inter-chip broadcasts also dramatically improve bandwidth utilization.
  6. Mainstream interfaces - Support for standard protocols like Ethernet and GPIO enables easier integration with conventional hardware and sensors.
  7. Lava software framework - This open-source framework includes tools to map algorithms onto neuromorphic hardware, simulate performance, and deploy across platforms. Lava encourages standardization and convergence in neuromorphic software development.

These Loihi 2 enhancements aim to scale up neuromorphic capabilities and use cases while lowering the barriers to leveraging this futuristic hardware. Rather than simply demonstrating potential, Intel seems focused on maturing the technology for real-world deployment.

Application Targets: Low-Power, Resilient Edge Intelligence 

While still emerging, neuromorphic computing appears well-matched to applications with tight latency, throughput, power, and resilience constraints unsuitable for conventional hardware. Use cases could include:

  • Efficient smart sensors for mobile and IoT devices
  • Real-time safety systems for drones and robots
  • Continually learning automation in complex environments 
  • Optimizing scheduling and planning problems
  • Accelerating specialized AI workloads 

In many edge settings, the ability of neuromorphic chips to quickly adapt to limited data is more valuable than maximizing abstract accuracy metrics. Loihi 2 brings this technology closer to supporting such always-on intelligent agents.

Competitive Landscape: Diverse Neuromorphic Approaches  

Intel is not alone in developing neuromorphic hardware. IBM offers the TrueNorth research chip, while startups like BrainChip and Rain Neuromorphic have their spins on brain-inspired designs. Each approach has tradeoffs - Intel's Loihi 2 may support more advanced on-chip learning than TrueNorth while lagging analog solutions like Rain in potential efficiency.  

Rather than a winner-takes-all competition, Intel believes collaboration across different neuromorphic methods is crucial to overcoming adoption challenges. They provide Loihi access to academic partners and hope to encourage convergence around common frameworks with Lava. If neuromorphic technology succeeds in carving out a niche in AI's expanding ecosystem, there may be room for multiple players.  

The Road Ahead: Years to Commercialization, Promise of Specialized AI

Though optimistic about neuromorphic computing's prospects, Intel concedes widespread commercial deployment could take years to materialize. Current Loihi 2 systems serve more as a proving ground for research than market-ready products. However, by spearheading hardware and software innovation alongside an open ecosystem of partners, Intel aims to transition neuromorphic technology from prototypes to commercial solutions smoothly.

Potential milestones on this roadmap include introducing neuromorphic co-processors to accelerate niche workloads, expanding to server-scale designs in the data center, and eventually even integrating brain-inspired processing into mainstream system architectures. While general human-level artificial intelligence remains distant, narrow applications of specialized AI seem well within reach for neuromorphic technology.

A Brain-Inspired Revolution in Computing Efficiency

By more closely mimicking biological rather than digital computation, neuromorphic chips offer radical improvements in capabilities like speed, adaptability, and efficiency critical for advanced intelligence. While past attempts have fallen short, Intel's Loihi research chips and the maturing ecosystem provide genuine hope that neuromorphic computing can successfully transition to commercial viability. Extending today's AI revolution with brain-like hardware could give the next breakthrough in specialized artificial intelligence.

With Loihi 2, Intel has reinforced its leadership in advancing this futuristic technology. Ultimately, realizing its full disruptive potential will likely hinge on collaboration across the improving hardware, accumulating use cases, and converging open software in years to come. The Cambrian AI explosion towards increasingly capable and efficient neural processing shows no signs of slowing down, thanks to initiatives like Intel's neuromorphic computing program!

Kabir M.
tag:blog.cprompt.ai,2013:Post/2061671 2023-12-04T20:00:00Z 2023-12-12T16:29:23Z Unlocking the Black Box: How Transformers Develop In-Context Learning

Most people using ChatGPT, Claude, or Bing do not know and do not care that there is a core technological breakthrough behind these chatbot systems -- Google's innovation of the decade -- Transformer architecture for natural language processing (NLP) that is used by large language models (LLMs).

Transformers have become the state-of-the-art in natural language processing, powering these chatbots, search engines, etc. But how exactly do these complex neural networks work? A new paper, "Birth of a Transformer: A Memory Viewpoint," peeks inside the black box to uncover fascinating insights. 

The paper introduces an ingenious synthetic dataset that allows researchers to carefully study how transformers balance learning from data patterns (global knowledge) versus knowledge provided in a specific context. Through detailed experiments on a simplified 2-layer transformer, the authors make several discoveries about how the network incrementally develops abilities like in-context learning. 

Their critical insight is to view the transformer's weight matrices as "associative memories" that store particular input-output pairs. Combined with theoretical analysis, this memory perspective clarifies how inductive biases emerge in self-attention and why the transformer architecture is so effective. Top Takeaways on How Transformers Tick:

  • Transformers first grasp global statistics and common data patterns before slower in-context learning develops. The global knowledge forms a strong baseline, which context then tweaks.
  • In-context prediction skills are enabled by an "induction head" mechanism spanning two attention heads. The first head copies relevant tokens, while the second uses that signal to anticipate what comes next in context. 
  • Weight matrices learn via gradient descent to behave like fast associative memories, storing associations between input and output embeddings. This emergent memorization ability fuels context learning.
  • Learning progresses top-down, with later layers training first to direct earlier layers where to focus. Feedback cycles between layers accelerate the acquisition of abilities.
  • Data distribution properties significantly impact how quickly the network picks up global versus in-context patterns. More diversity speeds up learning.

The memory viewpoint meshes nicely with what we already know about transformers. Self-attention layers select relevant tokens from the context, while feedforward layers leverage global statistics. The new perspective offers a unified framework for understanding how different components cooperate to balance these two crucial knowledge sources. 

A Birth Story for Context Learning 

Concretely, the researchers designed a particular bigram language modeling task where some token pairs were globally consistent while others depended on the specific sequence. For instance, the pairing "Romeo & Juliet" might be typical, but a particular context could feature "Romeo & Ophelia". 

The transformer needs to learn global bigram statistics while also spotting in-sequence deviations. The authors witness the incremental development of context-handling abilities through careful probing of network activations during training. 

They introduce frozen randomness and simplifications like fixed embeddings to spotlight the emergence of crucial functionality in individual components. For example, the output weight matrix learns correct associations even when attention is uniform, creating a "bag-of-words" representation. The attention then gradually focuses on relevant tokens.

This stage-by-stage view reveals learning dynamics within transformers that prevailing theory struggled to explain. We witness clear "critical periods" where certain subskills develop before others can bootstrap.

The researchers mathematically confirm the cascading self-organization by tracking how gradients modify the weight matrices toward target associative memories. The theory corroborates the empirical findings on birth order, illuminating why later layers train first and how feedback between layers accelerates acquisition. So, in creating this miniature toy model of transformer development, the paper delivers valuable insights into how more complex language models learn abstract patterns, adapt to novel contexts, and balance different knowledge stores.


Q: What is an "induction head" in transformers?

An "induction head" is a mechanism inside transformers spanning two attention heads, enabling in-context learning. The first head copies relevant tokens from the context, while the second head uses that signal to anticipate the next token. This mechanism develops during transformer training.

Q: How do weight matrices enable context learning?

The paper argues that weight matrices in transformers learn to behave like fast "associative memories" that store associations between input and output embeddings. This emergent ability to quickly memorize functional patterns fuels the model's capacity to adapt predictions based on context.

Q: Why does global learning tend to precede in-context learning?

Transformers first pick up on broader statistical patterns and common data regularities. This global knowledge forms a strong baseline. Then, later in training, the model begins layering the ability to tweak predictions based on the specific context on top of that baseline. So, global learning comes first to establish a foundation.  

Q: How does the training data distribution impact learning?

The diversity and properties of the training data distribution significantly impact how quickly the model picks up global versus in-context statistical patterns. More diversity in the data speeds up the learning of global and context-dependent knowledge.

Q: How could these insights help improve transformers?

The memory perspective and insights into staged learning could help developers better optimize transformers by shaping training data, pruning redundant attentions appropriately as skills develop, guiding layer-wise skill acquisition, and better balancing different knowledge stores like global statistics vs. context.

Kabir M.
tag:blog.cprompt.ai,2013:Post/2056920 2023-12-04T20:00:00Z 2023-12-21T18:24:15Z Unleashing the Power of AI to Edit Images with a Few Words

Have you ever wanted to tweak a photo to change a person's appearance, add or remove objects, or give it an entirely different look but needed more advanced image editing skills? Emerging AI capabilities allow anyone to perform these edits with just a text description. 

A team of researchers from TU Darmstadt recently unveiled an AI technique called LEDITS++ that allows making versatile image edits by providing the AI with simple text instructions. Thanks to LEDITS++, editing photos is as easy as typing "add sunglasses and a hat" or "make it look like a painting."

The Power of Diffusion Models 

LEDITS++ builds on a category of AI models known as diffusion models. These models can generate highly realistic synthetic images from text prompts. However, directly editing real photos with diffusion models has been challenging up until now.

Previous editing attempts changed too much of the original photo or needed more precision to make focused edits. The LEDITS++ method overcomes these hurdles with a lightweight yet surprisingly powerful approach. It edits images precisely while perfectly reconstructing unchanged portions. 

According to lead author Manuel Brack, "LEDITS++ facilitates versatile yet precise textual image manipulation with diffusion models." The method keeps edits focused only on relevant image regions based on the text prompt. For example, adding "sunglasses" will only modify a person's eyes and nose area, leaving everything else unchanged.

Limitless Editing Possibilities

The LEDITS++ technique supports an endless array of possible edits:

  • Add, remove, or replace objects 
  • Alter facial features and attributes  
  • Apply artistic filters and styles
  • Composite images by splicing together elements from multiple photos

Remarkably, LEDITS++ handles even simultaneous, multi-concept edits with ease. For instance, it can add glasses, a hat, and a smile to the same face in a single pass. The AI restricts each text-based edit to the appropriate region to avoid interference.

So, if you want to put your friend on a beach or turn your cat purple, LEDITS++ now makes it possible without specialized skills.

A Streamlined Workflow Powered by AI

The researchers designed LEDITS++ as an intuitive tool to augment human creativity. It eliminates the need for extensive manual tweaking or tuning. The method performs editing tasks in real-time, enabling rapid iteration. As lead author Brack explains: "We introduce LEDITS++, an efficient yet versatile and precise textual image manipulation technique. LEDITS++'s novel inversion approach requires no tuning nor optimization and produces high-fidelity results with a few diffusion steps." By offloading the heavy lifting to AI, LEDITS++ opens up creative possibilities for amateurs and professionals alike. It puts simple yet powerful image editing into anyone's hands.

The accessibility of this technology aligns with CPROMPT.AI's goal of allowing general users to build and share AI-powered apps. With CPROMPT.AI, you can turn an AI model like LEDITS++ into a customized web application and make image editing available to friends, family, coworkers, clients, and more.  

The Future of AI-Assisted Creativity

As algorithms progress, AI promises to augment human creativity in unprecedented ways. Methods like LEDITS++ demonstrate that advanced generative models can perform magical feats with minimal effort. These innovations foreshadow a future in which AI acts as a creative partner rather than just a tool.

While LEDITS++ focuses specifically on image editing, its paradigm could ultimately generalize to other creative domains. We may one day have AI systems that provide intuitive assistance with writing, music composition, graphic design, and more based on high-level user guidance. Such technology will further democratize creativity and help unlock every individual's unique creative potential. However, responsible stewardship remains imperative as these models can be misused to spread misinformation or inappropriate content. 

As we march toward this AI-enabled creative future, platforms like CPROMPT.AI will empower everyday users to harness these technologies for good. With some guidance text and a few clicks, you'll soon be able to flex your creativity like never before!


  • Diffusion models: AI models that create images by starting with random noise and enhancing the result over successive steps to match a text description.
  • Inversion: Working backward from a desired output to identify inputs that will produce that result. 
  • Classifier-free guidance: A technique to direct a generative AI model using unlabeled data rather than an explicit classifier.
  • Cross-attention: A mechanism in neural networks that identifies which parts of the input correlate with specific elements of provided guidance. 
Kabir M.
tag:blog.cprompt.ai,2013:Post/2058019 2023-12-03T18:40:34Z 2023-12-06T21:43:37Z The Billionaire Battle to Control Artificial Intelligence

Recently, I read a New York Times article titled Ego, Fear, and Money: How the A.I Fuse Was Lit, which inspired the following post in a pseudo-timeline fashion to capture some of the exciting happenings mentioned in this article.

Those of us in the tech industry, we recently saw artificial intelligence exploding from a science fiction pipe dream to one of the most transformational technologies of our time. Companies pour billions into pursuing these capabilities, enticed by visions of tremendous profits and power if they can lead this new computational arms race.

Yet, running parallel is whether AI will uplift humanity or destroy it. Prominent technologists sound warnings even as they rush to stake their claims. For them, race is not just about greed – it is about survival. They believe that only by directing the process themselves can catastrophe be averted. But can they stay in control?

This cognitive dissonance between existential concern and unrestrained ambition has its roots in 2010 with the founding of DeepMind, the UK startup that ignited today's frenzy. Backed by billionaires like Elon Musk and Peter Thiel, DeepMind's mission was to pioneer safe artificial general intelligence (AGI) – AI that mimics human thinking. But good intentions quickly collided with corporate realities.

What followed over the next decade was a saga of clashing egos and philosophies amongst the tech elite obsessed with shaping AI in their image. Everything accelerated in 2022 when systems like GPT-3 and ChatGPT displayed abilities previously believed to be decades away. The trickle has become a flood – vast capital and talent are now sucked into this race every day, heedless of risks in the thirst for advantage.

Where will it end? Can ethics and oversight restrain the towering hubris and hostility fueling this technological arms race? The window to change course closes rapidly as billionaires vie for godlike powers of creation. But blind ambition could also spark conflagrations beyond any mortal's ability to control. The battle for the soul of artificial intelligence has only just begun.

2010: The Birth of DeepMind

In 2010, Demis Hassabis and his colleagues secured funding from Peter Thiel to launch DeepMind, an AI startup aimed at building "artificial general intelligence," or AGI. They believe that while AI poses risks, they are uniquely positioned to develop the technology safely. 

Over the next two years, Hassabis built ties with Musk and impressed Larry Page with DeepMind AI systems that can learn to play Atari games. Seeing the promise, Google and Facebook soon entered a bidding war to acquire the London-based startup.

2012: The Talent Auction

In 2012, Geoffrey Hinton and his students published a breakthrough paper showing neural networks can accurately recognize objects like flowers and dogs. This sparks global interest in deep learning. Baidu offers Hinton's team $12M, which he declines. 

But this sets the stage for a "talent auction" at an AI conference at Lake Tahoe the following year. Google and Microsoft engage in a bidding war for Hinton's team that ends with a $44M offer from Google being accepted. Mark Zuckerberg also began aggressively recruiting for Facebook's AI lab.

2014: The Lost Ethics Board 

As the talent war accelerates, Hassabis decides selling DeepMind is necessary to retain talent. After insisting on ethics safeguards, DeepMind was acquired by Google for $650M in 2014, beating a higher bid from Facebook. This includes an independent ethics board Musk helps convene, given his stake.

But after DeepMind's AlphaGo AI beat the world's top Go player, Lee Sedol, in 2016, shocking the community with its progress, the ethics board never meets again. Hassabis tried but failed to regain independence from Google in 2017.

2015: The Breakup  

Frustrated over losing control of DeepMind, Musk breaks from Page and helps launch non-profit AI lab OpenAI in 2015, poaching key Google talent. But after tensions over pace and commercialization, Musk split and took his funding with him in 2018. 

OpenAI turns to Microsoft for $1B in funding, upsetting researchers like Dario Amodei over a perceived lack of priorities on ethics and safety. This led Amodei and others to leave OpenAI to start a new company, Anthropic, in 2021.

2022: The Reveal 

Despite the talent departures, OpenAI continues to progress rapidly in secret. In August 2022, they revealed GPT-4 to Gates, shocking him as it aces an advanced biology exam, demonstrating critical thinking abilities. Microsoft embeds the technology in Bing and other products.

Just months later, in November 2022, OpenAI publicly unveiled ChatGPT. User growth explodes instantly, taking the AI world by storm and resetting the technology landscape. Funding to OpenAI soon reaches $80B+ in valuation, though internal tensions remain amid distrust.

The Present: An Unabated Arms Race  

As 2023 begins, the AI arms race set in motion over the past decade continues unchecked. Despite endless warnings, mistrust has compelled technologists and investors to plunge headlong into developing ever-more robust systems, hoping to dictate the terms of AI before someone else does.

Page races to catch up to OpenAI's sudden progress with Google's Bard chatbot after long dismissing such concerns. Musk and Altman's partnership lies in tatters as OpenAI transforms from its non-profit origins into one of the world's most valuable startups. Others like Anthropic and Meta also aim to stake their ground. The future remains deeply uncertain. Will this technology elevate humanity or destroy it? Can ethics and priorities change course? As AI capabilities accelerate beyond expectations, the opportunity to meaningfully address risks increasingly slips away. Robust systems operate opaquely outside understanding or control. 

For over a decade, the architects of this present terrain have been locked in self-interested competition while resisting regulations or limits. But the fuse lit by egos, distrust, and unchecked ambition continues to burn brighter. Billionaires race to erect their version of the future, heedless of what emerges for humankind when their creations exceed mortal grasp. Only then, too late, will the total costs of their hubris become clear.

Kabir M.
tag:blog.cprompt.ai,2013:Post/2057400 2023-12-02T02:21:04Z 2024-01-20T18:30:52Z Turkey-Shoot Clusterfuck: Open AI @Sama Saga and Lessons Learned

The drama surrounding artificial intelligence startup OpenAI and its partnership with Microsoft has all the hallmarks of a Silicon Valley soap opera. OpenAI's board abruptly fired CEO and co-founder Sam Altman last month, setting off a behind-the-scenes crisis at Microsoft, which has invested billions in the AI firm's technology.  

OpenAI has been at the leading edge of AI innovation, captivating the public last year with the launch of ChatGPT. This conversational bot can generate essays, poems, and computer code. Microsoft saw integrating OpenAI's technology into its software as key to upgrading its products and competing with rivals Google and Amazon in the red-hot AI race.  

The two companies forged an extensive partnership, with Microsoft investing over $10 billion into OpenAI. This collaboration led Microsoft to launch one of its most ambitious new products in years – a suite of AI "copilots" embedded into Word, Excel, and other Microsoft productivity tools. 

Dubbed Office Copilots, these AI assistants can write documents, analyze spreadsheets, and complete other tasks by having natural conversations with users. Microsoft planned a slow, phased introduction of this potentially transformative technology, first to select business customers and then gradually to millions of consumers worldwide.

Behind the scenes, however, tensions mounted between Altman and OpenAI's board. Altman is a classic Silicon Valley leader – visionary, ambitious, controlling. OpenAI's academic and non-profit-minded directors eventually clashed with Altman's hard-driving style.   So, without warning, OpenAI's board fired Altman. Stunned Microsoft CEO Satya Nadella learned of the move just minutes before the public announcement. Despite owning 49% of OpenAI, Microsoft had yet to be consulted on leadership changes at its AI partner.

The news set off unrest behind the scenes. Blindsided Microsoft executives urgently met to chart a response. OpenAI employees threatened mass resignations, with its chief technology officer quitting immediately. Recriminations flew externally over what one journalist called "idiocy" and "cloddery" by OpenAI's directors.  Microsoft swiftly developed contingency plans to navigate the crisis. It first supported OpenAI's interim CEO while seeking Altman's reinstatement. But the silent board refused to provide details or reverse course. 

Microsoft then leveraged its power to reinstall Altman or rebuild OpenAI directly within Microsoft. As leadership paralysis worsened at OpenAI, Microsoft made its boldest play – inviting Altman to lead a lavishly-funded new AI lab as part of Microsoft.  OpenAI's entire staff essentially revolted, signing a petition threatening to join Altman at Microsoft unless OpenAI's board resigned and Altman was restored as CEO. Within 48 hours, Microsoft's nuclear option worked – humbled OpenAI directors relented and reinstated Altman.

The saga illuminated challenging issues around developing AI responsibly. What's the right balance between unleashing progress and imposing caution? Can startups govern unprecedented technologies prudently? Does public transparency help or heighten risks? Behind Microsoft's response was executive Kevin Scott, the company's chief technology officer. Having grown up poor in rural Virginia, Scott knew firsthand how technology could empower or polarize. He became determined to make AI "level the playing field" by making it accessible to ordinary people through natural conversation.

Scott quickly aligned with OpenAI's mission to ensure AI broadly benefits humanity. He respected OpenAI's talented staff, including optimistic chief scientist Ilya Sutskever. Sutskever fervently believes AI will soon solve humanity's most significant problems.   Scott also connected with OpenAI chief technology officer Mira Murati over similarly humble backgrounds. Raised amid chaos in war-torn Albania, Murati's childhood taught perseverance despite long odds. This instilled balanced optimism – hopeful progress is possible but only with thoughtful safeguards in place.   Such optimism needed tempering, though, as early experiments revealed AI's potential dangers. Systems hallucinated facts or gave harmful advice if not properly constrained. So Microsoft and OpenAI collaborated extensively on frameworks and guardrails, allowing ambitious innovation within cautious boundaries.  Their formula:

  • Release useful but imperfect AI to real-world users.
  • Gather feedback.
  • Refine safeguards based on public testing.

This transparency around AI's strengths and limitations builds trust, Scott argues. Enlisting regular users to examine new technologies also teaches more about capabilities and shortcomings revealed in actual daily applications.  

Gradually, this measured strategy succeeded, powering new products like GitHub Copilot, which could automatically complete code. Despite some objections, Copilot won over skeptics as public testing demonstrated benefits while showcasing constraints around the technology.  

Encouraged by successes like Copilot, Microsoft stealthily developed its new AI assistants for Word, Excel, and other ubiquitous programs used by over a billion people worldwide. The stakes were far higher here, given the massive scale and sensitivity. So Microsoft tapped its specialized Responsible AI division with hundreds of technologists, ethicists, and policy experts.  

This cross-disciplinary team exhaustively stress-tested Copilot prototypes with a process called "red teaming." They relentlessly tried making AI systems fail safely in simulated scenarios by feeding offensive comments or dangerous advice and monitoring responses. 

With human guidance around preferred reactions, the models learned to incorporate ethical safeguards and self-governing instructions when answering user questions. After extensive adjustments, Microsoft rolled out the Office Copilot pilots to select business clients before a gradual public debut.

But product rollout had barely started when OpenAI erupted into leadership chaos. Altman's firing threatened to derail Microsoft's measured approach just as Office Copilots prepared for mass adoption. 

In the aftermath, hard questions loom around developing AI responsibly. What's the right balance between unfettering progress and imposing caution? Can startups wisely govern unprecedented technologies? Do public testing and transparency help or heighten risks?

Microsoft shows one possible path – collaborating across sectors on frameworks and safeguards while enlisting users to examine new technologies. Critics argue this may not be safe or transparent enough. Others believe it found the proper equilibrium so far. 

As AI progresses, its scope for both benefit and damage keeps increasing. The stakes around guiding its trajectory responsibly couldn't be higher. This astonishing age of intelligent machines raises difficult questions about opportunities, obligations, and an uncertain future potentially shaped by today's decisions.

What lessons can be drawn from this saga for companies navigating the rise of transformative technologies like artificial intelligence? Perspectives vary across Microsoft, OpenAI's former board, and the broader AI community.

Microsoft believes it identified an essential blueprint for developing AI responsibly and exiting the crisis with an even more robust capacity to lead. Its hard-won formula:

  • Build guardrails collaboratively.
  • Test transparently by engaging users.
  • Move cautiously but steadily to deployment.AI's

Benefits and risks will become more apparent through practice across societies, functions, and industries.

For OpenAI's former directors, centralized control and publicly airing disputes seem risky, given AI's pivotal emergence. They sought more discretion by ousting Altman. However, the board learned its unilateral surprise move wrongly ignored critical constituents like partners and staff. However, vital independent oversight procedural prudence matters too.

Parts of the broader technology universe still clamor for more public deliberation around AI's collective impacts or slower adoption to digest societal implications. Some argue models like Microsoft's need to be more opaque about internal testing or panels forming policies. Others counter this incremental approach found balance so far – ambitious innovation tempered with gathering feedback.

If anything is clear, governing globe-spanning technologies evolving daily confounds. Multi-stakeholder collaboration helps check tendencies like short-termism, insularity, and marginalizing public interests. But cooperation gets messy between startups disrupting, corporations scaling, and academia deliberating.

Technical systems centralizing power or limiting accountability also risk compounding historic inequities. So, in this vast transition, one lesson may be prudence around the certainty that anyone has all the answers. With technology's complexity and pace of change, humility itself may be the wisest path forward.

Kabir M.
tag:blog.cprompt.ai,2013:Post/2057342 2023-12-01T22:47:37Z 2023-12-21T18:18:42Z The Promise of Seamless Cross-Language Communication

I am very interested in text-to-speech, speech-to-text, and speech-to-speech (one language to another), and I follow the Whisper project closely, the only open-source project out of OpenAI. When Dr. Yann LeCun recently shared a project called SeamlessExpressive on 𝕏 (formerly Twitter) about speech-to-speech, I wanted to try it out. Here is my video of testing it using the limited demo they had on their site:

I don't speak French, so I'm not sure how it came out from a translation and expression point of view, but it seems interesting. I tried Spanish as well, and it seemed to work the same way. This project, called Seamless, developed by Meta AI scientists, enables real-time translation across multiple languages while preserving the emotion and style of the speaker's voice. This technology could dramatically improve communication between people who speak different languages.  The key innovation behind Seamless is that it performs direct speech-to-speech translation rather than breaking the process into separate speech recognition, text translation, and text-to-speech synthesis steps. This unified model is the first of its kind to:

  • Translate directly from speech in one language into another.  
  • Preserve aspects of the speaker's vocal style, like tone, pausing, rhythm, and emotion.
  • Perform streaming translation with low latency, translating speech as it is being spoken rather than waiting for the speaker to finish.

Seamless was created by combining three main components the researchers developed: 

  • SeamlessM4T v2 - An improved foundational translation model covering 100 languages.  
  • SeamlessExpressive - Captures vocal style and prosody features like emotion, pausing, and rhythm.
  • SeamlessStreaming - Enables real-time translation by translating speech incrementally.  

Bringing these pieces together creates a system where a Spanish speaker could speak naturally, conveying emotion through their voice, and the system would immediately output in French or Mandarin while retaining that expressive style. This moves us closer to the kind of seamless, natural translation seen in science fiction.

Overcoming Key Challenges

Creating a system like Seamless required overcoming multiple complex challenges in speech translation:  

Data Scarcity: High-quality translated speech data is scarce, especially for preserving emotion/style. The team developed innovative techniques to create new datasets.  

Multilinguality: Most speech translation research focuses on bilingual systems. Seamless translates among 100+ languages directly without needing to bridge through English.

Unified Models: Prior work relied on cascading separate recognition, translation, and synthesis models. Seamless uses end-to-end speech-to-speech models.  

Evaluation: New metrics were created to evaluate the preservation of vocal style and streaming latency.

The impacts of having effective multilingual speech translation could be immense in a world where language continues to divide people. As one of the researchers explained:

"Giving those with language barriers the ability to communicate in real-time without erasing their individuality could make prosaic activities like ordering food, communicating with a shopkeeper, or scheduling a medical appointment—all of which abilities non-immigrants take for granted—more ordinary."

Kabir M.
tag:blog.cprompt.ai,2013:Post/2057309 2023-12-01T19:14:58Z 2023-12-01T21:24:00Z Amazon AWS Re:Invent 2023 Ushers in a New Era for AWS

AWS recently held its annual re:Invent conference, showcasing exciting new offerings that demonstrate the company's continued leadership in cloud computing and artificial intelligence. This year's event had a strong focus on how AWS is pioneering innovations in generative AI to provide real business value to customers.

CEO Adam Selipsky and VP of Data and AI Swami Sivasubramanian headlined the event, announcing breakthrough capabilities spanning hardware, software, and services that mark an inflection point for leveraging AI. AWS is committed to progressing generative AI from leading-edge technology into an essential driver of productivity and insight across industries.

Highlights from Major Announcements

Here are some of the most notable announcements that give a glimpse into the cutting-edge of what AWS is building:

  • Amazon Q - A new AI-powered assistant designed for workplace collaboration that can generate content and code to boost team productivity.  
  • AWS Graviton4 and Trainium2 Chips – The latest generation AWS processor and accelerator chips engineered to enable heavy AI workloads like training and inference.  
  • Amazon Bedrock Expansion – New options to deploy and run custom models and automate AI workflows to simplify integration.
  • Amazon SageMaker Updates – Enhanced capabilities for novices and experts alike to build, train, tune and run machine learning models faster. 
  • Amazon Connect + Amazon Q - Combining AI assistance and customer service software to help agents respond to customers more effectively.

AWS underscored its commitment towards an intelligent future with previews showcasing bleeding edge innovation. This vision crystallizes how human-AI collaboration can transform customer experiences and business outcomes when generative AI becomes an integral part of solution stacks. Re:Invent 2023 ushered in this emerging era.

As the curtain falls on AWS re:Invent 2023, the message is clear: AWS is not just keeping up with the pace of technological evolution; it is setting it. Each announcement and innovation revealed at the event is a testament to AWS's unwavering commitment to shaping a future where technology is not just a tool but a catalyst for unimaginable growth and progress. The journey of AWS re:Invent 2023 is not just about celebrating achievements; it's about envisioning and building a future that's brighter, faster, and more connected than ever before.

Kabir M.
tag:blog.cprompt.ai,2013:Post/2057298 2023-12-01T18:13:39Z 2023-12-01T18:13:39Z Celebrating a Powerhouse of AI: FAIR's First Decade

Today marks an important milestone for Meta's Fundamental AI Research (FAIR) team – 10 years of spearheading advancements in artificial intelligence. When FAIR first launched under the leadership of VP and Chief AI Scientist Yann LeCun in 2013, the field of AI was finding its way. He assembled a team of some of the keenest minds at the time to take on fundamental problems in the burgeoning domain of deep learning. Step by step, breakthrough upon breakthrough, FAIR's collective brilliance has expanded the horizons of what machines can perceive, reason, and generate.

The strides over a decade are simply striking. In object detection alone, we've gone from recognizing thousands of objects to real-time detection, instance segmentation, and even segmenting anything. FAIR's contributions in machine translation are similarly trailblazing – from pioneering unsupervised translation across 100 languages to the recent "No Language Left Behind" feat. 

And the momentum continues unabated. This year has been a standout for FAIR in research impact, with award-garnering innovations across subareas of AI. Groundbreaking new models like Llama are now publicly available—and FAIR's advancements already power products millions use globally.

While future progress will likely come from fusion rather than specialization, one thing is evident – FAIR remains peerless in its ability to solve AI's toughest challenges. With visionary researchers, a culture of openness, and the latitude to explore, they have their sights firmly fixed on the future.

So, to all those who contributed to this decade of ingenuity – congratulations. And here's to many more brilliant, accountable steps in unleashing AI's potential.

Kabir M.
tag:blog.cprompt.ai,2013:Post/2055808 2023-11-27T20:39:01Z 2023-11-28T16:46:21Z The Art of Reading Signals: Making Sense of Intent in the Age of AI

The images that emerged from Cuba in October 1962 shocked the Kennedy administration. Photos from a U-2 spy plane revealed Soviet missile sites under feverish construction just 90 miles off the coast of Florida. The installations posed a direct threat to the U.S. mainland, drastically altering the balance of power that had kept an uneasy peace. In a televised address on October 22, President Kennedy revealed the Soviet deception and announced a blockade to prevent further missiles from reaching Cuba. The world anxiously watched the crisis build over the next tension-filled week. 

Behind the scenes, critical signals were being misread on both sides. Soviet premier Nikita Khrushchev believed the United States knew of Moscow’s inferior strategic position relative to its superpower rival. In secret discussions with Kennedy, Khrushchev voiced dismay that his attempt to redress the imbalance was perceived as offensive rather than a deterrent. Kennedy, blindsided by photographs he never expected to see, questioned why the Soviets would take such a risk over an island nation of questionable strategic value. Faulty assumptions about intent magnified distrust and instability at the highest levels.

The perils of miscommunication that defined the Cuban Missile Crisis feel disturbingly resonant today. Nations compete for advantage in trade, technology, and security matters beyond the horizon of public visibility. Artificial intelligence powers more decisions than ever in governance, finance, transportation, health, and a growing array of sectors. Yet intentions behind rapid AI progress often need to be clarified even between ostensible partners, let alone competitors.   So, how can nations credibly signal intentions around artificial intelligence while managing risks?

The technology and national security policy worlds require prompt solutions - tailor-made connections enabling credible communication of intentions around artificial intelligence between governments, companies, researchers, and public stakeholders. We will explore critical insights from a crucial recent analysis titled “Decoding Intentions: Artificial Intelligence and Costly Signals " to demystify the AI landscape.” by Andrew Imbrie, Owen Daniels, and Helen Toner.  Ms. Toner has recently come to the limelight in the recent OpenAI saga as she is one of the OpenAI Board of Directors who fired Sam Altman, the co-founder and reinstated CEO of OpenAI.

The core idea is that verbal statements or physical actions that impose political, economic, or reputational costs for the signaling nation or group can reveal helpful information about underlying capabilities, interests, incentives, and timelines between rivals. Their essential value and credibility lie in the potential price the sender would pay in various forms if their commitments or threats ultimately went unfulfilled. Such intentionally “costly signals” were critical, if also inevitably imperfect, tools that facilitated vital communication between American and Soviet leaders during the Cold War. This signaling model remains highly relevant in strategically navigating cooperation and competition dynamics surrounding 21st-century technological transformation, including artificial intelligence. The report identifies and defines four mechanisms for imposing costs that allow nations or companies employing them to signal information credibly:

Tying hands rely on public pledges before domestic or international audiences, be they voluntary commitments around privacy or binding legal restrictions mandating transparency. Suppose guarantees made openly to constituents or partners are met down the line. In that case, political leaders can avoid losing future elections, or firms may contend with angry users abandoning their platforms and services. Both scenarios exemplify the political and economic costs of reneging on promises. 

Sunk costs center on significant one-time investments or resource allocations that cannot be fully recovered once expended. Governments steering funds toward research on AI safety techniques or companies dedicating large budgets for testing dangerous model behaviors signal long-standing directional buy-in. 

Installment costs entail incremental future payments or concessions instead of upfront costs. For instance, governments could agree to allow outside monitors regular and sustained access to continually verify properties of algorithmic systems already deployed and check that they still operate safely and as legally intended. 

Reducible costs differ by being paid mainly at the outset but with the potential to be partially offset over an extended period. Firms may invest heavily in producing tools that increase algorithmic model interpretability and transparency for users, allowing them to regain trust - and market share - via a demonstrated commitment to responsible innovation.

In assessing applications of these signaling logics, the analysis spotlights three illuminating case studies: military AI intentions between major rivals, messaging strains around U.S. promotion of “democratic AI,” and private sector attempts to convey restraint regarding impactful language model releases. Among critical implications, we learn that credibly communicating values or intentions has grown more challenging for several reasons. Signals have become “noisier” overall amid increasingly dispersed loci of innovation across borders and non-governmental actors. Public stands meant to communicate commitments internally may inadvertently introduce tensions with partners who neither share the priorities expressed nor perceive them as applicable. However, calibrated signaling remains a necessary, if frequently messy, practice essential for stability. If policymakers expect to promote norms effectively around pressing technology issues like ubiquitous AI systems, they cannot simply rely upon the concealment of development activities or capabilities between competitors.

Rather than a constraint, complexity creates chances for tailoring solutions. Political and industry leaders must actively work to send appropriate signals through trusted diplomatic, military-to-military, scientific, or corporate channels to reach their intended audiences. Even flawed messaging that clarifies assumptions reassures observers, or binds hands carries value. It may aid comprehension, avoid misunderstandings that spark crises or embed precedents encouraging responsible innovation mandates more widely. To this end, cooperative multilateral initiatives laying ground rules around priorities like safety, transparency, and oversight constitute potent signals promoting favorable norms. They would help democratize AI access and stewardship for the public good rather than solely for competitive advantage. 

When American and Soviet leaders secretly negotiated an end to the Cuban Missile Crisis, both sides recognized the urgent necessity of installing direct communication links and concrete verification measures, allowing them to signal rapidly during future tensions. Policymakers today should draw wisdom from this model and begin building diverse pathways for credible signaling right now before destabilizing accidents occur, not during crisis aftermaths. Reading accurate intent at scale will remain an art more than deterministic science for the foreseeable future.

Kabir M.