Nonpartisan Government Accountability
PolicyLogic
How We Apply Our Methodology
INDEPENDENT & TRANSPARENT METHODOLOGY
PolicyLogic
Home About
Scorecards
All Scorecards State & Local Congress Presidential International Depts & Agencies Nonprofits & NGOs
Learn
Learning Center Think Clearly How Policy Works Take Action Govt 101 Glossary
Methodology
Elected Officials Depts & Agencies Presidential & International AI Pipeline Limitations
Corrections
Error Log Submission Tracker Contact
Elected Officials Departments & Agencies Presidential & International AI Pipeline Limitations
AI Pipeline
How AI is used in PolicyLogic, what it determines, what it does not determine, and how the pipeline is designed to prevent grade drift across model versions.

The Role of AI

All scorecards are generated using Claude (Anthropic). The model version is logged at generation time and stored in the scorecard JSON metadata. Grade comparisons across scorecards generated by different model versions will note this limitation.

AI is used because it makes the platform viable at scale — a human-only research team could not cover senators, representatives, governors, mayors, and federal agencies simultaneously. The AI handles research and bucket assignment. The grade formula is computed deterministically by code after AI outputs are received.

What the AI Determines

What the AI Does Not Determine

The grade calculation is performed deterministically by the scoring pipeline code after AI outputs are received. The AI assigns bucket values; the code computes all arithmetic. This means the grade formula cannot be altered by prompt drift or model behavior changes. An official's grade is a function of their record, not of a language model's output tendencies.

Prompt Discipline

The AI prompt includes the full promise qualification criteria, the full scoring rubric with all bucket definitions, the Their Role lookup table, behavioral flag definitions, and explicit instructions to output structured JSON. The prompt does not include examples from specific officials that could bias scoring toward particular party performance levels. Low-confidence assessments are flagged as Contested or Limited Evidence rather than forced into a bucket.

Consistency Audits

Model version is logged in every scorecard JSON. Periodic consistency audits re-score a fixed set of test cases across model versions. Any systematic difference in classification triggers a methodology review before new scorecards are published.

Symmetry testing: Before any scoring run, identical fact patterns are tested across party labels to check for systematic differences in AI classification. Any difference is a prompt or methodology issue to fix before the run goes live.

Human Review

Every scorecard carries an AI Draft flag until human review is complete. Human reviewers verify scores, check evidence, confirm flag applications, and have override authority with documentation requirements. Scorecards with politically sensitive findings, D0 or D4 delivery scores, or behavioral flags applied receive full review before publication.