Posts for Tag: AGI

Pushing AI's Limits: Contrasting New Evaluation Milestones

Artificial intelligence has breezed through test after test in recent years. But as capabilities advance, suitable benchmarks grow scarce. Inspired by the video posted on ūĚēŹ (formerly Twitter) by Thomas Wolf,¬†the co-founder of Hugging Face, I compared the two benchmarks he discussed. Two new benchmark datasets push progress to the gradients of human knowledge itself. They suggest milestones grounded in versatile real-world competency rather than narrow prowess. Their very difficulty could accelerate breakthroughs.

GAIA and GPQA take complementary approaches to inspecting AI through lenses of assistant competence and expert oversight. Both craft hundreds of questions unsolvable for non-specialists despite earnest effort and unconstrained access to information. GPQA draws from cutting-edge biology, physics, and chemistry, seeking problems with undisputed solutions within those communities. GAIA emphasizes multi-step reasoning across everyday tasks like gathering data, parsing documents, and navigating the web.  

The datasets highlight stubborn gaps between the most advanced systems and typical humans still yawning wide. GPT-4 grazes 40 percent on GPQA, lagging the 65 percent target for area specialists. Augmenting the model with an internet search tool barely budges results. Meanwhile, GAIA clocks under 30 percent across specific challenges compared to above 90 percent for human respondents, who face barriers like effectively handling multimedia information and executing logical plans.   

These diagnoses of inhuman performance could guide progress. By homing in on precise shortfalls using explicit criteria, researchers can funnel efforts to deny AI problems any enduring place to hide. Projects conceived directly from such insights might swiftly lift capacities, much as targeted medical treatments heal pinpointed ailments, just as GAIA and GPQA represent attainable waypoints en route to broader abilities.  

Reaching either milestone suggests unfolding mastery. Matching multifaceted personal assistants could precipitate technologies from conversational guides to robotic helpers, markedly upgrading everyday experience. Reliable oracles imparting insights beyond individual comprehension might aid in pushing back the frontiers of knowledge. Of course, with advanced powers should come progressive responsibility. But transformative tools developed hand in hand with human preferences provide paths to elevated prosperity.  

So now AI benchmark datasets stand sentinel at the gates of an existing skill. The bar passage to those falling fractionally short while ushering forward those set to surpass. Such evaluations may thus form the trajectory for innovations soon impacting our institutions, information, and lives.


Q: How are the GAIA and GPQA benchmarks different? 

GAIA emphasizes multi-step reasoning across everyday information like images or documents. GPQA provides expert-level problems with undisputed solutions within scientific communities.

Q: Why are difficult, decaying benchmarks vital for AI progress?

They can advance integrated real-world skills by exposing precise gaps between state-of-the-art systems and human capacities.

Q: How could surpassing milestones like GAIA or GPQA impact society?   

They constitute waypoints en route to safe, beneficial technologies - from conversational aids to knowledge oracles - improving life while upholding priorities.


  • Oversight - Evaluating and directing intelligent systems accurately and accountably, even where individual knowledge limits checking outputs firsthand.¬†¬†
  • Benchmark decay - The tendency for fixed benchmarks to become unchallenging to improve systems over time.

GPQA: Pushing the Boundaries of AI Evaluation

As artificial intelligence systems grow more capable, evaluating their progress becomes challenging. Benchmarks that were once difficult swiftly become saturated. This phenomenon clearly illustrates the rapid rate of advancements in AI. However, it also reveals flaws in how benchmarks are designed. Assessments must keep pace as abilities expand into new frontiers like law, science, and medicine. But simply pursuing tasks that are more difficult for humans misses crucial context. What's needed are milestones grounded in versatile real-world competency.

With this motivation, researchers introduced GPQA, an evaluation targeting the edge of existing expertise. The dataset comprises 448 multiple-choice questions from graduate-level biology, physics, and chemistry. Validation ensures both correctness per scientific consensus and extreme difficulty. Questions reach ambiguity on purpose - designed to probe the scope of human knowledge itself. Even highly skilled non-experts failed to exceed 34% accuracy despite unrestricted web access and over 30 minutes per query on average.

Such hardness tests a key challenge of AI alignment - scalable oversight. As superhuman systems emerge, humans must retain meaningful supervision. But when tasks outstrip individual comprehension, determining truth grows precarious. GPQA probes precisely this scenario. Non-experts cannot solve their problems independently, yet ground truth remains clearly defined within specialist communities. The expertise gap is tangible but manageable. Oversight mechanisms must close this divide between the fallible supervisor and the infallible system.

The state of the art provides ample room for progress. Unaided, large language models like GPT-4 scored only 39% on GPQA's main line of inquiry. Still, their decent initial foothold confirms the promise of foundation models to comprehend complex questions. Combining retrieval tools with reasoning took success no further, hinting at subtleties in effectively utilizing internet resources. Ultimately, collaboration between humans and AI may unlock the best path forward - much as targeted experiments should illuminate effective oversight.

As benchmarks must challenge newly enhanced capabilities, datasets like GPQA inherently decay over time. This very quality that makes them leading indicators also demands continual redefinition. However, the methodology of sourcing and confirming questions from the frontier of expertise itself offers a template. Similar principles could shape dynamic suites tuned to drive and validate progress in perpetuity. In the meantime, systems that perform on par with humans across GPQA's spectrum of natural and open-ended problems would constitute a historic achievement - and arrive one step closer to beneficial artificial general intelligence engineered hand in hand with human well-being.


Q: What key capabilities does GPQA test?

GPQA spans skills like scientific reasoning, understanding technical language, gathering and contextualizing information, and drawing insights across questions.

Q: How are the questions generated and confirmed?  

Domain expert contractors devise graduate-level queries, explain the methodology, and refine them based on peer experts' feedback. 

Q: Why create such a problematic benchmark dataset?

Tough questions probe the limits of existing knowledge crucial for oversight of more capable AI. They also decay slowly, maintaining relevance over time.


Scalable oversight: Reliably evaluating and directing advanced AI systems that exceed an individual human supervisor's abilities in a given area of expertise.

Foundation model - A system trained on broad data that can be adapted to many downstream tasks through techniques like fine-tuning; large language models are a significant class of foundation models.  

Benchmark decay: The tendency for benchmarks to become outdated and unchallenging as the systems they aim to evaluate continue rapid improvement.

Levels of AGI: The Path to Artificial General Intelligence

Artificial intelligence (AI) has seen tremendous progress recently, with systems like ChatGPT demonstrating impressive language abilities. However, current AI still falls short of human-level intelligence in many ways. So how close are we to developing accurate artificial general intelligence (AGI) - AI that can perform any intellectual task a human can? 

A new paper from researchers at Google DeepMind proposes a framework for classifying and benchmarking progress towards AGI. The core idea is to evaluate AI systems based on their performance across diverse tasks, not just narrow capabilities like conversing or writing code. This allows us to understand how general vs specialized current AI systems are and track advancements in generality over time.

Why do we need a framework for thinking about AGI? Firstly, "AI" has become an overloaded term, often synonymously with "AGI," when systems are still far from human-level abilities. A clear framework helps set realistic expectations. Secondly, shared definitions enable the AI community to align on goals, measure progress, and identify risks at each stage. Lastly, policymakers need actionable advice on regulating AI; a nuanced, staged understanding of AGI is more valuable than considering it a single endpoint. 

Levels of AGI

The paper introduces "Levels of AGI" - a scale for classifying AI based on performance across various tasks. The levels range from 0 (Narrow non-AI) to 5 (Artificial Superintelligence exceeding all human abilities).

Within each level, systems can be categorized as either Narrow AI (specialized for a specific task) or General AI (able to perform well across many tasks). For instance, ChatGPT would be considered a Level 1 General AI ("Emerging AGI") - it can converse about many topics but makes frequent mistakes. Google's AlphaFold protein folding system is Level 5 Narrow AI ("Superhuman Narrow AI") - it far exceeds human abilities on its specialized task.

Higher levels correspond to increasing depth (performance quality) and breadth (generality) of capabilities. The authors emphasize that progress may be uneven - systems may "leapfrog" to higher generality before reaching peak performance. But both dimensions are needed to achieve more advanced AGI.

Principles for Defining AGI

In developing their framework for levels of AGI, the researchers identified six fundamental principles for defining artificial general intelligence in a robust, measurable way:

  • AGI should be evaluated based on system capabilities rather than internal mechanisms.
  • Both performance and generality must be separately measured, with performance indicating how well an AI accomplishes tasks and generality indicating the breadth of tasks it can handle.
  • The focus should be on assessing cognitive abilities like reasoning rather than physical skills.
  • An AI's capabilities should be evaluated based on its potential rather than deployment status.
  • Benchmarking should utilize ecologically valid real-world tasks that reflect skills people authentically value rather than convenient proxy tasks.
  • AGI should be thought of in terms of progressive levels rather than as a single endpoint to better track advancement and associated risks.

By following these principles, the levels of AGI aim to provide a definition and measurement framework to enable calibrated progress in developing beneficial AI systems.

Testing AGI Capabilities

The paper argues that shared benchmarks are needed to objectively evaluate where AI systems fall on the levels of AGI. These benchmarks should meet the above principles - assessing performance on a wide range of real-world cognitive tasks humans care about. 

Rather than a static set of tests, the authors propose a "living benchmark" that grows over time as humans identify new ways to demonstrate intelligence. Even complicated open-ended tasks like understanding a movie or novel should be included alongside more constrained tests. Such an AGI benchmark does not yet exist. However, developing it is an essential challenge for the community. With testing methodology aligned around the levels of AGI, we can build systems with transparent, measurable progress toward human abilities.

Responsible AGI Development 

The paper also relates AGI capabilities to considerations of risk and autonomy. More advanced AI systems may unlock new abilities like fully independent operation. However, increased autonomy does not have to follow automatically from greater intelligence. Thoughtfully chosen human-AI interaction modes can allow society to benefit from powerful AI while maintaining meaningful oversight. As capabilities grow, designers of AGI systems should carefully consider which tasks and decisions we choose to delegate vs monitor. Striking the right balance will ensure AI aligns with human values as progress continues.

Overall, the levels of AGI give researchers, companies, policymakers, and the broader public a framework for understanding and shaping the responsible development of intelligent machines. Benchmarking methodologies still need substantial work - but the path forward is more precise thanks to these guidelines for thinking about artificial general intelligence.

Top Facts from the Paper

  • Current AI systems have some narrow abilities resembling AGI but limited performance and generality compared to humans. ChatGPT is estimated to be a Level 1 "Emerging AGI."
  • Performance and generality (variety of tasks handled) are critical for evaluating progress.
  • Shared benchmarks are needed to objectively measure AI against the levels based on a diverse range of real-world cognitive tasks.
  • Increased autonomy should not be an automatic byproduct of intelligence - responsible development involves carefully considering human oversight.

The levels of AGI give us a framework to orient AI progress towards beneficial ends, not just technological milestones. Understanding current systems' capabilities and limitations provides the clarity needed to assess risks, set policies, and guide research positively. A standardized methodology for testing general intelligence remains an open grand challenge. 

But initiatives like Anthropic's AI Safety technique, and this AGI roadmap from DeepMind researchers represent an encouraging step toward beneficial artificial intelligence.


Q: What are the levels of AGI?

The levels of AGI are a proposed framework for classifying AI systems based on their performance across a wide range of tasks. The levels range from 0 (Narrow Non-AI) to 5 (Artificial Superintelligence), with increasing capability in both depth (performance quality) and breadth (generality across tasks).

Q: Why do we need a framework like levels of AGI? 

A framework helps set expectations on AI progress, enables benchmarking and progress tracking, identifies risks at each level, and advises policymakers on regulating AI. Shared definitions allow coordination.

Q: How are performance and generality evaluated at the levels?

Performance refers to how well an AI system can execute specific tasks compared to humans. Generality refers to the variety of different tasks the system can handle. Both are central dimensions for AGI.

Q: What's the difference between narrow AI and general AI?

Narrow AI specializes in particular tasks, while general AI can perform well across various tasks. Each level includes both limited and available categories.

Q: What are some examples of different AGI levels?

ChatGPT is currently estimated as a Level 1 "Emerging AGI." Google's AlphaFold is Level 5 "Superhuman Narrow AI" for protein folding. There are yet to be examples of Level 3 or 4 General AI.

Q: How will testing determine an AI's level?

Shared benchmarks that measure performance on diverse real-world cognitive tasks are needed. This "living benchmark" will grow as new tests are added.

Q: What principles guided the levels of AGI design?

Fundamental principles include:

  • Focusing on capabilities over mechanisms.
  • Separating the evaluation of performance and generality.
  • Prioritizing cognitive over physical tasks.
  • Analyzing potential rather than deployment.
  • Using ecologically valid real-world tests.

Q: How do the levels relate to autonomous systems?

Higher levels unlock greater autonomy, but it does not have to follow automatically. Responsible development involves carefully considering human oversight for AGI.

Q: How can the levels help with safe AGI development?

The levels allow for identifying risks and needed policies at each stage. Progress can be oriented towards beneficial ends by tracking capabilities, limitations, and risks.

Q: Are there any AGI benchmarks available yet?

There has yet to be an established benchmark, but developing standardized tests aligned with the levels of AGI capabilities is a significant challenge and opportunity for the AI community.


  • AGI - Artificial General Intelligence¬†
  • Benchmark - Standardized tests to measure and compare the performance of AI systems
  • Cognitive - Relating to perception, reasoning, knowledge, and intelligence
  • Ecological validity - How well a test matches real-world conditions and requirements
  • Generality - The ability of an AI system to handle a wide variety of tasks
  • Human-AI interaction - How humans and AI systems communicate and collaborate¬†
  • Performance - Quality with which an AI system can execute a particular task

Unveiling Open Souls PBC: Bridging the Gap between Humans and AI

I recently came across a tweet that mentioned a YouTube video and wanted to analyze the video without spending too much time. So I went to CPROMPT.AI and found a prompt app that produced the following analysis using GPT-4:

I ran this prompt app with the YouTube video mentioned in the tweet, which produced the following analysis using GPT-4.

YouTube Video Analysis by GPT-4

Here is the video:

Have you ever wondered what it would be like to witness a groundbreaking moment in the interaction between humans and machines? Well, look no further! In this captivating video, we bring you an exclusive glimpse into last week's historic event at Betaworks, where Open Souls PBC showcased its mission of infusing artificial intelligence with the intangible qualities of humanity.

At first, skepticism filled the room as attendees pondered whether such a feat was even possible. However, as the presentation unfolded, doubts transformed into awe-inspiring fascination. The speaker expertly demonstrated how Open Souls PBC has successfully bridged the gap between humans and AI.

Throughout this six minute video journey, you will witness firsthand how these pioneers are changing the game. By combining advanced technology with our innate human traits, they have unlocked a new realm of possibilities for both individuals and society as a whole.

Open Souls PBC believes that true progress lies not in replacing humans but in empowering them with AI capabilities. Their mission is to create intelligent systems that comprehend human emotions, empathize with our experiences, and adapt accordingly. This bold vision aims to enhance our lives by fostering harmonious collaboration between humans and machines.

The demonstration showcases how Open Souls PBC's cutting-edge technology enables AI algorithms to understand complex emotions expressed through facial expressions, gestures, and verbal cues. It's truly astonishing to witness how these algorithms can interpret subtle nuances once thought to be exclusive to human understanding.

Imagine a world where robots can detect when we're feeling down or stressed out and respond compassionately to provide comfort or assistance. Picture an AI system that effortlessly recognizes social cues during conversations and adapts responses accordingly. These are no longer just distant dreams but tangible realities brought closer daily by Open Souls PBC.

As you delve deeper into this mesmerizing video, you'll witness the audience's transformation from skeptics to believers. The power of Open Souls PBC's work becomes evident as each individual realizes they have just witnessed a pivotal moment in history‚ÄĒthe convergence of humanity and artificial intelligence.

So, are you ready to embark on this incredible journey? Watch the video above and prepare to be amazed by what lies ahead. Witness for yourself how Open Souls PBC is revolutionizing our understanding of AI by infusing it with the essence of our own souls.

Open Souls PBC is redefining what it means to be human in an increasingly technological world. Join us as we explore the possibilities that lie at the intersection of humans and machines. Together, let's shape a future where AI not only understands us but also helps us become better versions of ourselves.

My Personal Reactions

I read the review and decided I should watch the video. I think this approach of anthropomorphizing AI to the level of a being with a soul is really problematic for me. The presenter mentioned that the word "computer" was first used in the 1600s to refer to something or someone that does menial tasks.

The presenter, the founder of whose mission is to create simulations that feel human, seems convinced that AI with soul simulation is the way forward. I'm sure some people would agree with him. However, I fall into the group that does not, as I believe AI is just artificial intelligence, not human.

AI is mimicking only the intelligence part of a human to enable efficient, logical and rational decision making. This frees up humans to engage our amazing gift of emotion. I'm not qualified to argue this point extensively, but this is how I feel as a human of average intelligence but above average emotional quotient.

Listen to this Post