Teaching AI Agents to Make Real-World Decisions

A new open-source software package called Pearl aims to change that by giving AI agents the tools they need to make decisions in the real world. Developed by researchers at Meta AI, Pearl provides a versatile framework for reinforcement learning (RL), a trial-and-error technique inspired by how humans and animals acquire skills.

The Core Idea Behind Pearl

At its core, Pearl is designed to handle the complexity of real-world sequential decision-making. It equips RL agents with capabilities like:

  • Intelligently exploring environments to gather valuable data
  • Summarizing complex histories to handle partial information  
  • Safely avoiding undesirable outcomes
  • Efficiently learning from offline datasets

This represents a significant upgrade from existing RL libraries, which tend to focus narrowly on core algorithms while neglecting the practical requirements of production systems.

Pearl's modular architecture makes mixing and matching components like policy learning methods, exploration strategies, safety constraints, and neural network architectures easy. This flexibility empowers researchers and practitioners to tailor RL solutions to their needs.

Teaching an RL Agent to Balance a Pole

To understand Pearl in action, let's walk through a simple example of using it to teach an agent to balance a pole on a cart (a classic RL benchmark problem). 

We first instantiate a PearlAgent object, choosing a deep Q-learning policy learner and an ε-greedy exploration strategy to ensure a mix of exploration and exploitation. Our agent then repeatedly takes actions in the cart pole environment, observing the current state, receiving rewards or penalties, and storing the generated experiences in its replay buffer.

Behind the scenes after each episode, Pearl samples experience from the buffer to train the agent's neural network, steadily improving its policy. Over time, the agent learns to move the cart left or right to keep the pole balanced for as long as possible.

Key Takeaways

Pearl demonstrates how the modular building blocks of history, exploration, safety, and offline learning can produce sophisticated RL agents ready for real-world deployment. As the authors highlight, it is already used in industry applications like recommender systems and bidding optimization.

As AI advances, we need frameworks like Pearl that translate innovation into meaningful solutions for businesses and communities. With thoughtful design, RL could one day coordinate disaster relief efforts, allocate funding to scientific projects, guide public health programs, and more. 

By open-sourcing Pearl, Meta AI lowers organizations' barriers to building decision-making systems powered by state-of-the-art reinforcement learning. Now, anyone can access these capabilities for free and even turn their AI prompts into easy-to-use web apps via platforms like CPROMPT.AI


  • Reinforcement learning: rewarding desirable behaviors and punishing mistakes instead of requiring labeled examples
  • Replay buffer: Temporary data storage used to recycle experiences to improve sample efficiency  
  • Policy: The agent's decision-making model mapping states to actions

Enabling Efficient Parallel Function Calling in LLMs

Large language models (LLMs) like GPT-3 have shown remarkable language understanding and reasoning capabilities. This has expanded the scope of LLMs from content generation to solving complex problems across domains like math, coding, and question-answering. However, LLMs have inherent limitations- knowledge cutoffs, poor arithmetic skills, and no access to private data sources. 

To overcome these limitations, recent works have focused on equipping LLMs with external function calling capabilities. This allows users to provide custom functions that the LLM can invoke to augment its skills. For instance, an LLM could call a calculator function for arithmetic operations or query a private database and summarize the results. The LLM selects suitable functions based on context and integrates their outputs to derive solutions.

While function calling expands the capabilities of LLMs, current approaches like ReAct execute functions sequentially. This means the LLM calls one function, reasons over the output, and then decides the following function. This back-and-forth process continues until the LLM generates the final solution. 

The sequential execution in ReAct has three key downsides:

  • High latency - Reasoning over each intermediate output becomes time-consuming for queries needing multiple sequential function calls.
  • Increased costs - Frequent promptings of the LLM to analyze each output drive up the token usage.
  • Lower accuracy - Concatenating intermediate outputs can sometimes confuse the LLM, leading to repetitive function calls or premature stopping.

The paper "An LLM Compiler for Parallel Function Calling" comes in here. It proposes a novel LLMCompiler framework that can efficiently orchestrate parallel function calling in LLMs. 

The Core Idea Behind LLMCompiler

The core philosophy behind LLMCompiler is drawing inspiration from classical compilers. Compilers optimize instruction execution in programs by identifying parts that can run parallelly. LLMCompiler applies similar concepts for efficient multi-function execution in LLMs. 

Specifically, LLMCompiler has three key components:

  • LLM Planner: Analyzes user prompts and graphs necessary function calls with dependencies.
  • Task Fetching Unit: Dynamically inspects the graph to dispatch independent function calls in parallel. 
  • Executor: Runs the dispatched function call tasks concurrently using associated tools.

Let's understand this with an example prompt:

"How much does Microsoft's market cap need to increase to exceed Apple's market cap?"

The LLM Planner breaks this down into four key tasks:

  • Search for Microsoft's market cap
  • Search for Apple's market cap 
  • Divide Apple's market cap by Microsoft's market cap
  • Generate textual response for the division result

Tasks 1 and 2 are independent searches that can run parallelly. Task 3 depends on the outputs of 1 and 2, while task 4 depends on 3. 

The Task Fetching Unit identifies tasks 1 and 2 as parallelizable and dispatches them concurrently to the Executor. Once done, it sends task 3 by substituting the actual market cap values from the outputs of tasks 1 and 2. Finally, task 4 is executed to return the final response.

This optimized orchestration of function calls provides significant benefits:

  • Faster execution due to parallel processing
  • Lower costs from fewer unnecessary LLM promptings
  • Increased accuracy by minimizing intermediate output interference

Analyzing LLMCompiler's Performance

The authors benchmarked LLMCompiler on different workloads - from embarrassingly parallel patterns to sequentially dependent functions. They tested popular LLMs like GPT-3.5 and the open-source LLaMA-2 model.

The results show that LLMCompiler provides consistent latency speedups of 1.8x to 3.74x over ReAct across benchmarks. The efficient planning also leads to a 3x-6x cost reduction in token usage. Further, LLMCompiler improves accuracy over ReAct by avoiding repetition and early stopping failures.

LLMCompiler also outperforms OpenAI's parallel function calling feature, which is released concurrently. It demonstrated 1.35x faster execution, affirming the optimized orchestration of function calls.

Frameworks like LLMCompiler that optimize parallel function execution unlock new possibilities for builders. Prompt programming platforms like CPROMPT.AI can then democratize such capabilities for everyone.

CPROMPT.AI allows anyone to turn AI prompts into customizable web apps without coding. Users could build an app powered by an efficient LLM backend like LLMCompiler to solve complex problems. The app creator can customize functions that end users can leverage by simply describing their queries in natural language.

For instance, an investor may build a market analysis app using custom data queries and financial models. An engineer could create a CAD design troubleshooting app with PLMs and simulation functions. Such apps make efficient parallel function calling accessible to domain experts beyond AI researchers.  

With innovations like LLMCompiler and prompt programming platforms like CPROMPT.AI, users can build purpose-built apps enhanced by LLMs that efficiently tackle multifaceted problems in natural language. This can expand the real-world impact of large language models.

The LLMCompiler introduces an optimized compiler-like orchestration for parallel function calling in LLMs. By planning efficient execution graphs and concurrent function dispatches, the LLMCompiler unlocks faster and more affordable execution without compromising accuracy. Further combining such capabilities with accessible, prompt programming interfaces like CPROMPT.AI can democratize parallel problem-solving with LLMs beyond AI experts. As LLMs continue maturing into versatile reasoning engines, efficient multi-function orchestration will be vital to unlocking their full potential while managing resources prudently.

GitHub Repo for LLMCompiler



  • ReAct: Framework for agents using LLMs to reason and take actions

Why We Shouldn't Humanize AI

Recently, I came across an article on VOX called Why it’s important to remember that AI isn’t human by Raphaël Millière and Charles Rathkopf that made me think about the dozens of people I read and hear on 𝕏 (formerly Twitter) and 𝕏 Spaces that write or talk to ChatGPT or other LLM as if they are interacting with a real human being. I use polite language in crafting my prompts because I am told that if the context of my input is closer to a strong, robust pattern, it might be better at predicting my desired content. I don't say "please" because I think of it as a human. But do you see when you talk to ChatGPT? A cold, emotionless bot spitting out responses? Or a friendly, helpful companion ready to converse for hours? Our instincts push us towards the latter, though the truth lies somewhere in between. Through the linguistic lens, we view all things, artificial intelligence included. And therein lies the trouble.

Human language marked the pinnacle of human cognition. No other species could conjugate verbs, compose poems, or write legal briefs. Language remained uniquely ours until an AI startup called Anthropic released Claude - a large language model capable of debating ethics, critiquing sonnets, and explaining its workings with childlike clarity. 

Seemingly overnight, our exclusivity expired. Yet we cling to the assumption that only a human-like mind could produce such human-like words. When Claude chatters away, we subconsciously project intentions, feelings, and even an inner life onto its algorithms. This instinct to anthropomorphize seeps through our interactions with AI, guiding our perceptions down an erroneous path. As researchers Raphaël Millière and Charles Rathkopf explain, presuming language models function like people can "mislead" and "blind us to the potentially radical differences in the way humans and [AI systems] work."

Our brains constantly and unconsciously guess at meanings when processing ambiguous phrases. If I say, "She waved at him as the train left the station," you effortlessly infer I mean a person gestured farewell to someone aboard a departing locomotive. Easy. Yet, multiply such ambiguity across millions of neural network parameters, and deducing intended significances becomes more complex. Claude's coders imbued it with no personal motivations or desires. Any interpretation of its statements as possessing some undisclosed yearning or sentiment is sheer fabrication. 

Nonetheless, the impressiveness of Claude's conversational skills compels us to treat it more as a who than a what. Study participants provided more effective prompts when phrasing requests emotionally rather than neutrally. The Atlantic's James Somers admitted to considering Claude "a brilliant, earnest non-native English speaker" to interact with it appropriately. Without awareness, we slide into anthropomorphic attitudes.

The treacherous assumption underpinning this tendency is that Claude runs on the same psychological processes enabling human discourse. After all, if a large language model talks like a person, it thinks like one, too. Philosopher Paul Bloom calls this impulse psychological essentialism - an ingrained bias that things possess an inherent, hidden property defining their categorization. We extend such essentialist reasoning to minds, intuitively expecting a binary state of either minded or mindless. Claude seems too adept with words not to have a mind, so our brains automatically classify it as such.

Yet its linguistic mastery stems from algorithmic calculations wholly unrelated to human cognition. Insisting otherwise is anthropocentric chauvinism - dismissing capabilities differing from our own as inauthentic. Skeptics argue Claude merely predicts following words rather than genuinely comprehending language. But as Millière and Rathkopf point out, this no longer limits Claude's potential skills than natural selection constrains humanity's. Judging artificial intelligence by conformity to the human mind will only sell it short.

The temptation persists, sustained by a deep-rooted psychological assumption the authors dub the "all-or-nothing principle." We essentialize minds as present or absent in systems, allowing no gradient between them. Yet properties like consciousness exist along a spectrum with inherently fuzzy boundaries. Would narrowing Claude's knowledge bases or shrinking its neural networks eventually leave something non-minded? There needs to be a clear cut-off separating minded from mindless AI. Still, the all-or-nothing principle compels us to draw one, likely in coordination with human benchmarks.

To properly evaluate artificial intelligence, Millière and Rathkopf advise adopting the empirical approach of comparative psychology. Animal cognition frequently defies anthropomorphic assumptions - observe an octopus instantaneously camouflaging itself. Similarly, unencumbered analysis of Claude's capacities will prove far more revealing than hamstrung comparisons to the human mind. Only a divide-and-conquer methodology tallying its strengths and weaknesses on its terms can accurately map large language models' contours.

The unprecedented eloquence of systems like Claude catches us off guard, triggering an instinctive rush toward the familiar. Yet their workings likely have little in common with our psychology. Progress lies not in noting where Claude needs to improve human behavior but in documenting its capabilities under unique computational constraints. We can only understand what an inhuman intelligence looks like by resisting the temptation to humanize AI.

The Future of AI in Europe: What the Landmark EU Deal Means

The European Union recently issued a provisional political agreement on legislation regulating artificial intelligence (AI) systems and their use. This "Artificial Intelligence Act" is the first comprehensive legal framework. It establishes obligations and restrictions for specific AI applications to protect fundamental "rights" while supporting AI innovation.

What does this deal cover, and what changes might it bring about? As AI becomes deeply integrated into products and services worldwide, Europeans and tech companies globally need to understand these new rules. This post breaks down critical aspects of the Act and what it could mean going forward.

Defining AI

First, what counts as an AI system under the Act? It defines AI as Software developed with specific techniques to predict, recommend, or decide actual world outcomes and interactions. This means today's AI assistants, self-driving vehicles, facial recognition systems, and more would fall under the law.

Banned Uses of AI

Recognizing AI's potential for rights and democracy, specific applications are prohibited entirely:

  • Biometric identification systems that use sensitive personal AI characteristics like religious beliefs, sexual orientation, race, etc., to categorize people. Example: Software ranking individuals as LGBTQ+ without consent.  
  • Scraping facial images from the internet or surveillance cameras to create recognition databases. Example: Companies scraping social media photos to build facial recognition systems.
  • Emotion recognition software in workplaces and schools. Example: Software gauging student engagement and boredom during online classes.  
  • Social scoring systems judging trustworthiness or risk levels based on social behaviors. Example: Apps rating individuals' personality traits to determine access or opportunities.
  • AI that seeks to circumvent user free will or agency. Example: Chatbots manipulate individuals into purchases by exploiting psychological vulnerabilities.
  • AI exploiting vulnerabilities of disadvantaged groups. Example: Lenders use income level data to steer low-income applicants towards unfavorable loan offers. 

These bans address some of the most problematic uses of emerging AI capabilities. However, the most contentious issue proved to be biometric identification systems used for law enforcement.

Law Enforcement Exemptions 

The Act removes certain narrow exceptions allowing law enforcement to use biometric identification, like facial recognition tech, in public spaces. However, it comes with restrictions and safeguards.

Specific types of biometric ID systems are permitted, subject to prior judicial approval, only for strictly defined serious crimes and searches. Real-time scanning would have tight locational and time limits. 

For example, searches for trafficking victims or to prevent an imminent terrorist threat may use approved biometric tech for that specific purpose. Extensive databases of facial images or other biometrics can only be compiled with cause.

The rules seek to balance investigating significant crimes and protecting civil liberties. However, digital rights advocates argue any biometric surveillance normalizes intrusions disproportionately affecting marginalized communities. Companies building or providing such tech must closely track evolving EU guidance here.

High-Risk AI Systems

For AI applications classified as high-risk, like those affecting health, safety, fundamental rights, and more, strict obligations apply under the Act. Examples include autonomous vehicles, recruitment tools, credit scoring models, and AI used to determine access to public services.

Requirements will include risk assessments, documentation, transparency, human oversight, and more. There are also special evaluation and reporting procedures when high-risk AI systems seem likely to be involved in any breach of obligations.  

Citizens gain the right to file complaints over high-risk AI impacts and ask for explanations of algorithmic decisions affecting them. These provisions acknowledge the growing influence of opaque AI systems over daily life.

General AI and Future Advancements 

The rapid expansion of AI capabilities led policymakers to build in measures even for cutting-edge systems yet to be realized fully. General purpose AI, expected to become mainstream within 5-10 years, faces transparency rules around training data and documentation.

For high-impact general AI anticipated down the line, special model checks, risk mitigation processes, and incident reporting apply. So emerging AI fields like natural language processing chatbots are on notice to meet similar standards to high-risk apps eventually.

Supporting Innovation  

Will these new obligations stifle European AI innovation and competitiveness? The Act balances today's technology development with supporting tech development, especially for smaller enterprises. 

Regulatory sandboxes let companies test innovative AI in natural environments pre-deployment. Favorable market access procedures aid new market entrants. Requirements kick in only after an AI system is placed on the EU market.

Overall, the Act signals that human rights and ethics lead to development, not vice versa. But they avoided imposing some of the most stringent restrictions tech companies opposed.

Fines for Violations

Failure to meet requirements results in fines of up to €30 million or 6% of a company's global turnover. Intentional non-compliance sees even harsher penalties, a substantial incentive towards the company.

What It Means for US Tech Companies

American tech giants like Microsoft, IBM, and Google, more deeply involved in European markets, will need to implement structures and processes adhering to the new rules. Smaller startups entering the EU marketplace will want to build compliance into products from the start.

Companies exporting AI software or devices to Europe must determine if products fall under high-risk categories or other designations mandating accountability steps. Strict data and documentation requirements around developing and updating AI systems demand additional staffing and oversight.  

While the Act avoids the most burdensome restrictions, adhering to transparency principles and ensuring human oversight of automated decisions requires investment. Tech lobbying failed to defeat obligations reinforcing ethical AI practices many researchers have long called for.

US policymakers have proposed federal guidelines and legislation governing AI systems and companies. However, something different from the EU's comprehensive regulatory approach has advanced. That may gradually change as the global impacts of the landmark European Act become more apparent in the coming years.

Glossary of Key Terms

  • Biometric identification systems: Technology using biological or behavioral traits – like facial features, fingerprints, gait, and voice – to identify individuals. Examples include facial recognition, fingerprint matching, and iris scans.  
  • High-risk AI systems: AI technology presents a significant potential risk of harm to health, safety, fundamental rights, and other areas defined by EU regulators. Self-driving cars and AI tools in critical infrastructure like hospitals exemplify high-risk systems.  
  • General purpose AI: Artificial intelligence can perform complex cognitive tasks across many industries and use cases. Sometimes called artificial general intelligence (AGI), it does not entirely exist, but advanced AI exhibits some broad capabilities.  
  • Regulatory sandbox: Controlled testing environments allow developers to try innovative digital products/services while oversight agencies review functionality, risks, and effectiveness before full deployment or marketing.

The EU's Artificial Intelligence Act in a Nutshell

The EU's Artificial Intelligence Act aims to establish the first comprehensive legal framework governing AI systems. The main goals are to ensure AI respects existing EU laws and ethical principles while supporting innovation and business use.

Key provisions:

  • Creates a legal definition of an "AI system" in EU law encompassing various software-based technologies like machine learning and knowledge-based systems. The definition aims to be broad and flexible enough to adapt to future AI advances.  
  • Adopts a risk-based approach tailoring obligations depending on the threat level the AI system poses. AI applications posing "unacceptable risk" would be prohibited entirely, while "high-risk" systems would face stricter transparency, testing, and human oversight requirements before market access. "Limited risk" and "minimal risk" AI would have lighter or no additional obligations.
  • Explicitly bans specific dangerous AI uses, including systems exploiting vulnerable groups, scoring individuals based on social behaviors, and real-time remote biometric identification by law enforcement in public spaces.  
  • Imposes mandatory risk management, data governance, transparency and accuracy standards on "high-risk" AI systems used in critical sectors like healthcare and transport or impacting rights and safety. Requires third-party conformity assessments before high-risk systems can carry CE marking for EU market access.
  • Creates an EU database for registering high-risk AI systems and establishes national authorities to oversee compliance, address violations through fines and product recalls, and coordinate enforcement across borders.  
  • It seeks to boost EU AI innovation and investments through regulatory "sandboxes" where companies can test new systems and favorable market access rules, particularly helping small businesses and startups develop AI.

The Act's comprehensive scope and strict prohibitions aim to make the EU a leader in ethical and trustworthy AI while allowing beneficial business applications to flourish. But critics argue it could also impose costly burdens, potentially limiting AI investments and stifling innovation.


Q: Who does the AI Act target?

The Act mainly targets providers and users of AI systems based in the EU or exporting products and services to EU markets. So it applies to EU-based tech companies and major US firms like Meta, Alphabet, Microsoft, etc., serving EU users.

Q: What does the Act mean for big US tech companies? 

Major US tech firms deeply involved in EU markets will likely need to implement compliance structures around high-risk AI uses regarding transparency, testing requirements, risk assessment, and human oversight. This could mean sizable compliance costs.

Q: Does the Act ban any AI use by US companies?

Yes, the Act prohibits specific applications by all providers, including uses of AI deemed excessively harmful or dangerous, regardless of whether a system is high-risk. For example, AI uses exploiting vulnerable populations, applications enabling mass biometric surveillance, and AI tools circumventing individual rights.

Q: Will the Act limit investment in AI by US firms?  

Possibly. Compliance costs may deter US tech investments in developing high-risk AI systems for European markets. But the impact likely depends on how rigorously national regulators enforce obligations on companies.

Q: What does the Act mean for US AI startups eyeing EU markets?

The Act aims to support market access and innovation by smaller AI developers through measures like regulatory sandboxes to test new systems. However, meeting requirements around risk management and accuracy for high-risk applications could still prove burdensome for early-stage startups with limited resources.

Q: Could the Act influence AI regulation in the US?

The Act takes a much more active regulatory approach than the US federal government's guidelines. If successful, the comprehensive EU framework could inspire similar proposals for ethical AI guardrails in the US as calls for regulation of technology companies grow.

Q: How will average EU citizens benefit from the AI Act?  

By restricting specific dangerous uses of AI, the Act aims to protect EU citizens' digital rights and safety. Requirements around transparency should improve citizens' understanding of automated decisions impacting their lives regarding issues like credit eligibility and access to public services.  

Q: Will the Act make interacting with AI systems easier in the EU? 

Potentially, provisions prohibiting AI aimed explicitly at exploiting vulnerabilities could lead to systems that better respect human agency and choice when recommending purchases, content selections, and other areas that impact behavior.

Q: Could the Act limit the beneficial uses of AI for EU citizens?

Overly stringent restrictions on lower-risk AI could curb the development of innovations like virtual assistants and chatbots intended to help consumers. However, the Act predominantly targets high-risk uses while promoting voluntary codes of conduct for companies creating consumer AI.

Q: Will EU citizens have any say in how companies develop AI models?

The Act does not establish specific mechanisms for public participation in corporate AI design choices. However, by strengthening national regulators' powers, enhancing transparency, and allowing consumer complaints over biased outcomes, citizens gain new avenues to challenge issues created by AI systems affecting them.

Pushing AI's Limits: Contrasting New Evaluation Milestones

Artificial intelligence has breezed through test after test in recent years. But as capabilities advance, suitable benchmarks grow scarce. Inspired by the video posted on 𝕏 (formerly Twitter) by Thomas Wolf, the co-founder of Hugging Face, I compared the two benchmarks he discussed. Two new benchmark datasets push progress to the gradients of human knowledge itself. They suggest milestones grounded in versatile real-world competency rather than narrow prowess. Their very difficulty could accelerate breakthroughs.

GAIA and GPQA take complementary approaches to inspecting AI through lenses of assistant competence and expert oversight. Both craft hundreds of questions unsolvable for non-specialists despite earnest effort and unconstrained access to information. GPQA draws from cutting-edge biology, physics, and chemistry, seeking problems with undisputed solutions within those communities. GAIA emphasizes multi-step reasoning across everyday tasks like gathering data, parsing documents, and navigating the web.  

The datasets highlight stubborn gaps between the most advanced systems and typical humans still yawning wide. GPT-4 grazes 40 percent on GPQA, lagging the 65 percent target for area specialists. Augmenting the model with an internet search tool barely budges results. Meanwhile, GAIA clocks under 30 percent across specific challenges compared to above 90 percent for human respondents, who face barriers like effectively handling multimedia information and executing logical plans.   

These diagnoses of inhuman performance could guide progress. By homing in on precise shortfalls using explicit criteria, researchers can funnel efforts to deny AI problems any enduring place to hide. Projects conceived directly from such insights might swiftly lift capacities, much as targeted medical treatments heal pinpointed ailments, just as GAIA and GPQA represent attainable waypoints en route to broader abilities.  

Reaching either milestone suggests unfolding mastery. Matching multifaceted personal assistants could precipitate technologies from conversational guides to robotic helpers, markedly upgrading everyday experience. Reliable oracles imparting insights beyond individual comprehension might aid in pushing back the frontiers of knowledge. Of course, with advanced powers should come progressive responsibility. But transformative tools developed hand in hand with human preferences provide paths to elevated prosperity.  

So now AI benchmark datasets stand sentinel at the gates of an existing skill. The bar passage to those falling fractionally short while ushering forward those set to surpass. Such evaluations may thus form the trajectory for innovations soon impacting our institutions, information, and lives.


Q: How are the GAIA and GPQA benchmarks different? 

GAIA emphasizes multi-step reasoning across everyday information like images or documents. GPQA provides expert-level problems with undisputed solutions within scientific communities.

Q: Why are difficult, decaying benchmarks vital for AI progress?

They can advance integrated real-world skills by exposing precise gaps between state-of-the-art systems and human capacities.

Q: How could surpassing milestones like GAIA or GPQA impact society?   

They constitute waypoints en route to safe, beneficial technologies - from conversational aids to knowledge oracles - improving life while upholding priorities.


  • Oversight - Evaluating and directing intelligent systems accurately and accountably, even where individual knowledge limits checking outputs firsthand.  
  • Benchmark decay - The tendency for fixed benchmarks to become unchallenging to improve systems over time.

Unlocking Linear Speed for AI Models with Mamba

Modern AI systems rely heavily on complex neural network architectures called Transformers. While powerful, Transformers have a significant weakness - they slow down drastically when processing long sequences, like documents or genomes. This limits their practical use for real-world applications.  

Enter Mamba, a new AI model that retains the power of Transformers while overcoming their Achilles heel. In a recent paper, researchers from Carnegie Mellon University and Princeton University propose a linear way to make sequence modeling scales. That means, unlike Transformers, Mamba's speed does not slow down significantly with longer inputs.

The Key Idea Behind Mamba

The core concept behind Mamba is a structured state space model (SSM). SSMs are similar to two classic neural networks - recurrent neural networks (RNN) and convolutional neural networks (CNN). They take an input sequence, pass it through an internal "state" that changes over time, and convert it to an output. Here is a small primer on these networks:

Structured State Space Models (SSMs)

SSMs model sequences by passing inputs through an internal "state" that changes over time. You can imagine the state as a container summarizing the relevant history up to a point. An SSM transforms the current input and state into a new state, which informs the following output. The critical advantage of SSMs is that their state remains compact even for very long sequences. This compressed representation allows efficient processing.

Convolutional Neural Networks (CNNs)

CNNs are neural networks that apply the mathematical operation of convolution. In simple terms, a CNN slides a small filter matrix over the input and detects patterns in local regions. Multiple filters can activate parallel to identify low-level motifs like edges or textures. CNNs work well for perceptual data like images, video, or audio. They are less suitable for sequential dependencies between distant elements.

Recurrent Neural Networks (RNNs)

In RNNs, the network contains loops that feed activation from previous time steps as input to the current step. This creates an implicit memory in the network to model long-range dependencies. For instance, an RNN can develop a nuanced understanding of language by remembering all the words seen up to a point. However, standard RNNs need help with long sequences due to issues like vanishing gradients. Unique RNN variants address this limitation.

The core concepts are essentially:

  • SSMs - Compressed state with global sequence view
  • CNNs - Local patterns
  • RNNs - Sequential modeling with internal memory

SSMs are unique because their internal state can compress information from longer sequences into a compact form. This compressed state allows efficient processing no matter the input length. Prior SSM models worked well on continuous data like audio or images. But they needed help with dense discrete inputs like text. 

The creators of Mamba overcame this by adding a "selection mechanism" to SSMs. This lets Mamba focus only on relevant text parts, ignoring unnecessary bits. For example, when translating a sentence from English to French, Mamba would pay attention to the words while filtering out punctuation or filler words like "um."

Another innovation in Mamba is using GPU hardware capabilities efficiently during training. This enables much larger hidden state sizes compared to standard RNNs. More state capacity means storing more contextual information from the past.

Overall, these improvements impart Mamba exceptional speed and accuracy at par with or better than Transformer networks of the same complexity.

Key Facts About Mamba

  • 5x Faster Inference Than Transformers - Mamba displays over five times higher throughput than similarly sized Transformers when generating text or speech. For practical applications, this translates to much lower latency and cost.
  • Matches Bigger Transformers in Accuracy  - Empirical tests show Mamba develops a solid contextual understanding from self-supervised pretraining on large datasets. Despite having fewer parameters, it matches or exceeds bigger Transformer models on several language tasks.
  • Handles 1 Million Word Contexts  - Mamba is the first sub-quadratic model that continues improving with more extended context, reaching up to 1 million words. Prior models degrade in performance beyond a point as context length increases. This opens possibilities for capturing more global structures, like full-length books.

Real-World Implications

Mamba's linear computational complexity unlocks myriad new applications for large language models requiring real-time responsiveness. For instance, let's think about an intelligent prompt app created using CPROMPT.AI. The natural language interface can understand instructions spanning multiple sentences and respond immediately. Streaming applications like live speech transcription also become viable. And for sensitive use cases, the whole context stays on-device without needing roundtrips to the cloud.

Another benefit is the feasibility of much bigger foundation models in the future. Training costs and carbon emissions have been critical constraints on a model scale so far. Mamba's efficiency may enable models with over a trillion parameters while staying within the practical limits of computing budgets and data center energy.


  • Transformers: A neural network architecture using self-attention became highly popular after pioneering large language models like GPT-3 and DALL-E.
  • Structured State Space Model (SSM): A class of seq2seq models based on dynamical systems theory, which can trade off expressiveness and computational efficiency.
  • Selection Mechanism: Mamba introduced the method to make SSM transitions input-dependent, so it focuses only on relevant tokens.  
  • Throughput: Number of tokens processed per second. Higher is better.
  • Sub-quadratic: Algorithmic time complexity grows slower than quadratic. This includes linear and logarithmic time models.

Creating Realistic 3D Avatars from Images

Have you ever wished you could bring a photo or video of someone to life with a realistic 3D avatar that looks and moves just like them? That futuristic idea is quickly becoming a reality thanks to recent advances in artificial intelligence and computer graphics research. 

In a new paper published on arXiv, researchers from Tsinghua University, a national public university in Beijing, China, propose a method called "Gaussian Head Avatar," which can create highly detailed 3D head avatars from multi-view camera images. Their approach utilizes neural networks and an innovative 3D representation technique to model both the shape and motions of a person's head with unprecedented realism. Here is a video demonstrating this technique.

At the core of their technique is representing the 3D shape of the head using many discrete elements called "Gaussians." 

Various properties like position, color, opacity, etc, define each Gaussian. Thousands of these Gaussians are optimized to form the head avatar's visible surfaces collectively. This approach has advantages over other 3D representations when it comes to efficiently rendering high-frequency details like skin pores, strands of hair, wrinkles, etc.

The critical innovation is making the Gaussians dynamic, changing their properties based on facial expressions and head movements. This allows animating the avatar by providing images showing different expressions/poses. The animation is driven by neural networks that predict how the Gaussians need to move and change to match the provided images.

The results are extremely impressive 3D avatars rendered at 2K resolution with intricate details, even for complex expressions like an open laughing mouth. This level of photo-realism for virtual avatars opens up many possibilities for video game development, virtual reality, and visual effects for films/metaverse.  Here are some of the most exciting facts highlighted in this research:

  • Their technique needs only 16 camera views distributed across 120 degrees to capture the multi-view training data. This lightweight capture setup makes the avatar creation process much more practical.
  • The neural network predictions are regularized to avoid learning distortions not consistent across views. This forces the model to capture the actual 3D shape rather than view-dependent image transformations.
  • They designed the animation model to separately handle expressions and head movements. Expressions primarily drive the region near facial landmarks, while neck movement uses the pose. This decomposition matches how faces move.
  • A guidance model using implicit 3D representations is trained first to initialize the Gaussians before the leading training. This allows robust fitting to hair and shoulders beyond the core face region.
  • Their avatars can realistically render subtle, extreme expressions like wide-eyed mouths. The neural animation model does not suffer from the limits of traditional techniques.


  • Multi-view images: Multiple images of an object captured from different viewing angles
  • Neural Networks: Computing systems inspired by the human brain structure and capable of learning from data
  • 3D Representation: Mathematical definition of 3D shape using various primitives like point clouds, meshes, functions, etc.
  • Gaussians: Parametric surface element defined by a center and width resembling the Gaussian probability distribution
  • Rendering: Generating 2D images from the description of 3D scenes via simulation of the image formation process 
  • Implicit Representations: Defining 3D surface as a level set of a continuous function rather than explicit primitives

GPQA: Pushing the Boundaries of AI Evaluation

As artificial intelligence systems grow more capable, evaluating their progress becomes challenging. Benchmarks that were once difficult swiftly become saturated. This phenomenon clearly illustrates the rapid rate of advancements in AI. However, it also reveals flaws in how benchmarks are designed. Assessments must keep pace as abilities expand into new frontiers like law, science, and medicine. But simply pursuing tasks that are more difficult for humans misses crucial context. What's needed are milestones grounded in versatile real-world competency.

With this motivation, researchers introduced GPQA, an evaluation targeting the edge of existing expertise. The dataset comprises 448 multiple-choice questions from graduate-level biology, physics, and chemistry. Validation ensures both correctness per scientific consensus and extreme difficulty. Questions reach ambiguity on purpose - designed to probe the scope of human knowledge itself. Even highly skilled non-experts failed to exceed 34% accuracy despite unrestricted web access and over 30 minutes per query on average.

Such hardness tests a key challenge of AI alignment - scalable oversight. As superhuman systems emerge, humans must retain meaningful supervision. But when tasks outstrip individual comprehension, determining truth grows precarious. GPQA probes precisely this scenario. Non-experts cannot solve their problems independently, yet ground truth remains clearly defined within specialist communities. The expertise gap is tangible but manageable. Oversight mechanisms must close this divide between the fallible supervisor and the infallible system.

The state of the art provides ample room for progress. Unaided, large language models like GPT-4 scored only 39% on GPQA's main line of inquiry. Still, their decent initial foothold confirms the promise of foundation models to comprehend complex questions. Combining retrieval tools with reasoning took success no further, hinting at subtleties in effectively utilizing internet resources. Ultimately, collaboration between humans and AI may unlock the best path forward - much as targeted experiments should illuminate effective oversight.

As benchmarks must challenge newly enhanced capabilities, datasets like GPQA inherently decay over time. This very quality that makes them leading indicators also demands continual redefinition. However, the methodology of sourcing and confirming questions from the frontier of expertise itself offers a template. Similar principles could shape dynamic suites tuned to drive and validate progress in perpetuity. In the meantime, systems that perform on par with humans across GPQA's spectrum of natural and open-ended problems would constitute a historic achievement - and arrive one step closer to beneficial artificial general intelligence engineered hand in hand with human well-being.


Q: What key capabilities does GPQA test?

GPQA spans skills like scientific reasoning, understanding technical language, gathering and contextualizing information, and drawing insights across questions.

Q: How are the questions generated and confirmed?  

Domain expert contractors devise graduate-level queries, explain the methodology, and refine them based on peer experts' feedback. 

Q: Why create such a problematic benchmark dataset?

Tough questions probe the limits of existing knowledge crucial for oversight of more capable AI. They also decay slowly, maintaining relevance over time.


Scalable oversight: Reliably evaluating and directing advanced AI systems that exceed an individual human supervisor's abilities in a given area of expertise.

Foundation model - A system trained on broad data that can be adapted to many downstream tasks through techniques like fine-tuning; large language models are a significant class of foundation models.  

Benchmark decay: The tendency for benchmarks to become outdated and unchallenging as the systems they aim to evaluate continue rapid improvement.

The Rise of Gemini: Google's Moonshot Towards Revolutionary AI, Coming Soon to Pixel Phones

Google unveiled its most advanced AI system, Gemini I, inside its Bard conversational AI. Gemini aims to push the boundaries of what artificial intelligence can do by better understanding and reasoning about the natural world.  

Technically, Gemini is very impressive. Unlike GPT-3.5 and GPT-4, which were trained mainly on text, Gemini was also trained on images, audio, and video to enable more sophisticated reasoning. It exceeds previous AI on 32 benchmark tests, including coding, math, medicine, and more. For example, it can read scientific papers and extract critical findings faster than human experts.

However, Google's hype that Gemini represents an imminent revolution has yet to match reality fully. Many complex AI problems, like reliably distinguishing truth from falsehood, still need to be solved. Google also over-promised previously with the botched launch of Bard earlier this year.

So, while Gemini represents an evolution in AI capabilities, responsible development of such powerful technology takes time. We should applaud Google's achievements with cautious optimism about real-world impact in the short term.  

Gemini establishes Google again as at the forefront in the race to develop advanced AI, now facing competition from OpenAI after the widespread buzz created by ChatGPT last year. But practical benefits will likely emerge slowly over years of incremental improvement.


  • Gemini, Google's newest AI model, is touted as its most capable across language, image, audio, video, and other tasks.
  • According to Google, Gemini exceeds previous state-of-the-art AI systems like GPT-3.5 and GPT-4 in 30 of 32 benchmark categories.
  • Real-world applications include scientific insight generation, explaining complex topics like math and physics, and coding. 
  • Google is incrementally rolling out Gemini across products like Bard, Pixel phones, Search, and more over the coming year.


Q: How does Gemini work?

Gemini is a neural network trained on considerable datasets, including images, audio, and video, to understand and reason across different data types. Its advanced "multimodal" design allows connecting insights between them.

Q: Is Gemini safe to use?  

Google claims Gemini has undergone substantial safety testing, but any AI system this complex likely still has flaws. Responsible development is an ongoing process.

Q: What are Gemini's limitations? 

Like all AI today, Gemini still struggles with fully reliable reasoning. Issues like distinguishing truth from fiction remain unsolved and require human oversight.

Q: Who can access Google's Gemini AI?

Google plans first to release Gemini APIs to select partners over 2023-2024 before making it more broadly available to developers and enterprise customers.