All scorecards are generated using Claude (Anthropic). The model version is logged at generation time and stored in the scorecard JSON metadata. Grade comparisons across scorecards generated by different model versions will note this limitation.
AI is used because it makes the platform viable at scale — a human-only research team could not cover senators, representatives, governors, mayors, and federal agencies simultaneously. The AI handles research and bucket assignment. The grade formula is computed deterministically by code after AI outputs are received.
The grade calculation is performed deterministically by the scoring pipeline code after AI outputs are received. The AI assigns bucket values; the code computes all arithmetic. This means the grade formula cannot be altered by prompt drift or model behavior changes. An official's grade is a function of their record, not of a language model's output tendencies.
The AI prompt includes the full promise qualification criteria, the full scoring rubric with all bucket definitions, the Their Role lookup table, behavioral flag definitions, and explicit instructions to output structured JSON. The prompt does not include examples from specific officials that could bias scoring toward particular party performance levels. Low-confidence assessments are flagged as Contested or Limited Evidence rather than forced into a bucket.
Model version is logged in every scorecard JSON. Periodic consistency audits re-score a fixed set of test cases across model versions. Any systematic difference in classification triggers a methodology review before new scorecards are published.
Every scorecard carries an AI Draft flag until human review is complete. Human reviewers verify scores, check evidence, confirm flag applications, and have override authority with documentation requirements. Scorecards with politically sensitive findings, D0 or D4 delivery scores, or behavioral flags applied receive full review before publication.