About 97,800 results
Open links in new tab
  1. Details about METR’s preliminary evaluation of Claude 3.5 ...

    METR evaluated Claude-3.5-Sonnet on tasks from both our general autonomy and AI R&D task suites. The general autonomy evaluations were performed similarly to our GPT-4o evaluation, and uses …

  2. Techmeme: METR: Claude Opus 4.5 has a 50% task completion ...

    9 hours ago · METR: Claude Opus 4.5 has a 50% task completion time horizon of about 4 hours and 49 minutes, more than double that of Claude Opus 4 released earlier this year — We estimate that, on …

  3. Anthropic's models beat o3 in some time-horizon tests | METR ...

    In measurements using our set of multi-step software and reasoning tasks, Anthropic's Claude 4 Opus and Sonnet reach 50%-time-horizon point estimates of about 80 and 65 minutes, respectively. Note ...

  4. METR açıkladı: Claude Opus 4.5, görev tamamlamada selefini ...

    Yapay zeka araştırma kuruluşu METR, Anthropic şirketinin en yeni yapay zeka modeli Claude Opus 4.5'in performans değerlendirmesini yayımladı. Kuruluşun yaptı...

  5. METR’s preliminary evaluation of o3 and o4-mini — LessWrong

    Apr 16, 2025 · Do we know how Claude would do if given a higher token budget? Maybe this isn't relevant as it never gets close to the budget, and it submits answers well before hitting the limit?

  6. An update on our preliminary evaluations of Claude 3.5 Sonnet ...

    Jan 31, 2025 · METR conducted preliminary evaluations of Anthropic’s upgraded Claude 3.5 Sonnet (October 2024 release), and a pre-deployment checkpoint of OpenAI’s o1. In both cases, we failed to …

  7. In this introductory section, we briefly describe the models and our release decision process for them, including our decision to release Claude Opus 4 under the AI Safety Level 3 Standard and Claude …