I built a game where domain experts try to break frontier AI screenshot

What is I built a game where domain experts try to break frontier AI?

This platform invites domain experts to test frontier AI models by submitting specialist questions designed to expose weaknesses or failures. You participate by creating questions in your field of expertise, attempting to stump models like GPT-4 or Claude, and documenting where they struggle or give incorrect answers. The platform then verifies your submissions and pays contributors for confirmed failures. It serves researchers, AI developers, and domain specialists interested in understanding AI limitations in their areas, whilst helping build a public record of where current models genuinely fall short rather than relying on marketing claims.

Key Features

Submit expert questions

Create and submit domain-specific questions intended to expose AI model failures

Multi-model testing

Test your questions against several frontier AI models to find discrepancies

Failure verification

Platform reviews submissions to confirm genuine AI failures before payment

Payment for verified fails

Earn money when your questions successfully stump AI models and pass verification

Freemium access

Use basic features at no cost; paid tier likely offers additional benefits

Pros & Cons

Advantages

  • Practical contribution to AI safety and research by documenting real model limitations
  • Straightforward way for experts to earn money by doing something within their domain knowledge
  • Builds transparent documentation of where frontier AI actually fails, useful for practitioners and researchers
  • No technical expertise required beyond domain knowledge in your field

Limitations

  • Payment depends on verification, so not all submitted questions will generate income
  • Limited to individuals with genuine expert knowledge; casual users unlikely to succeed
  • Verification process may be slow or have unclear criteria for what constitutes a 'failure'

Use Cases

Medical professionals testing AI diagnostic tools for accuracy in clinical edge cases

Lawyers reviewing AI legal research tools for incorrect interpretations of case law

Software engineers finding bugs in AI code generation models

Academics documenting domain-specific weaknesses in large language models

Industry specialists gathering evidence of AI limitations before deployment decisions