![]() |
52 Weeks of CloudAuthor: Noah Gift
A weekly podcast on technical topics related to cloud computing including: MLOPs, LLMs, AWS, Azure, GCP, Multi-Cloud and Kubernetes. Language: en Genres: Education, How To, Technology Contact email: Get it Feed URL: Get it iTunes ID: Get it |
Listen Now...
ELO Ratings Questions
Episode 223
Thursday, 18 September, 2025
Key ArgumentThesis: Using ELO for AI agent evaluation = measuring noiseProblem: Wrong evaluators, wrong metrics, wrong assumptions Solution: Quantitative assessment frameworksThe Comparison (00:00-02:00)Chess ELOFIDE arbiters: 120hr trainingBinary outcome: win/lossTest-retest: r=0.95Cohen's κ=0.92AI Agent ELORandom users: Google engineer? CS student? 10-year-old?Undefined dimensions: accuracy? style? speed?Test-retest: r=0.31 (coin flip)Cohen's κ=0.42Cognitive Bias Cascade (02:00-03:30)Anchoring: 34% rating variance in first 3 secondsConfirmation: 78% selective attention to preferred featuresDunning-Kruger: d=1.24 effect sizeResult: Circular preferences (A>B>C>A)The Quantitative Alternative (03:30-05:00)Objective MetricsMcCabe complexity ≤20Test coverage ≥80%Big O notation comparisonSelf-admitted technical debtReliability: r=0.91 vs r=0.42Effect size: d=2.18Dream Scenario vs Reality (05:00-06:00)DreamWorld's best engineersAnnotated metricsStandardized criteriaReality Random internet usersNo expertise verificationSubjective preferencesKey StatisticsMetricChessAI AgentsInter-rater reliabilityκ=0.92κ=0.42Test-retestr=0.95r=0.31Temporal drift±10 pts±150 ptsHurst exponent0.890.31TakeawaysStop: Using preference votes as quality metricsStart: Automated complexity analysisROI: 4.7 months to break evenCitations MentionedKapoor et al. (2025): "AI agents that matter" - κ=0.42 findingSantos et al. (2022): Technical Debt Grading validationRegan & Haworth (2011): Chess arbiter reliability κ=0.92Chapman & Johnson (2002): 34% anchoring effectQuotable Moments"You can't rate chess with basketball fans""0.31 reliability? That's a coin flip with extra steps""Every preference vote is a data crime""The psychometrics are screaming"ResourcesTechnical Debt Grading (TDG) FrameworkPMAT (Pragmatic AI Labs MCP Agent Toolkit)McCabe Complexity CalculatorCohen's Kappa Calculator 🔥 Hot Course Offers:🤖 Master GenAI Engineering - Build Production AI Systems🦀 Learn Professional Rust - Industry-Grade Development📊 AWS AI & Analytics - Scale Your ML in Cloud⚡ Production GenAI on AWS - Deploy at Enterprise Scale🛠️ Rust DevOps Mastery - Automate Everything🚀 Level Up Your Career:💼 Production ML Program - Complete MLOps & Cloud Mastery🎯 Start Learning Now - Fast-Track Your ML Career🏢 Trusted by Fortune 500 TeamsLearn end-to-end ML engineering from industry veterans at PAIML.COM