Methodology · PolicyLogic

AI Pipeline

How AI is used in PolicyLogic, what it determines, what it does not determine, and how the pipeline is designed to prevent grade drift across model versions.

The Role of AI

All scorecards are generated using Claude (Anthropic). The model version is logged at generation time and stored in the scorecard JSON metadata. Grade comparisons across scorecards generated by different model versions will note this limitation.

AI is used because it makes the platform viable at scale — a human-only research team could not cover senators, representatives, governors, mayors, and federal agencies simultaneously. The AI handles research and bucket assignment. The grade formula is computed deterministically by code after AI outputs are received.

What the AI Determines

Identifying qualifying promises from campaign sources, applying the inclusion criteria.
Assigning each promise a Promise Type, Delivery bucket (D0–D4), Their Role score (using the lookup table), Difficulty bucket (H1–H3), Scale (S1–S3), Magnitude (M1–M3), and Clarity score (2–5).
Flagging applicable behavioral flags with a written rationale for each.
Writing a one-sentence evidence summary for each promise, with source citation.
Populating the context panel fields (term stage, legislature control, inherited conditions, external events).
Generating the scorecard narrative.

What the AI Does Not Determine

The grade calculation is performed deterministically by the scoring pipeline code after AI outputs are received. The AI assigns bucket values; the code computes all arithmetic. This means the grade formula cannot be altered by prompt drift or model behavior changes. An official's grade is a function of their record, not of a language model's output tendencies.

Prompt Discipline

The AI prompt includes the full promise qualification criteria, the full scoring rubric with all bucket definitions, the Their Role lookup table, behavioral flag definitions, and explicit instructions to output structured JSON. The prompt does not include examples from specific officials that could bias scoring toward particular party performance levels. Low-confidence assessments are flagged as Contested or Limited Evidence rather than forced into a bucket.

Consistency Audits

Model version is logged in every scorecard JSON. Periodic consistency audits re-score a fixed set of test cases across model versions. Any systematic difference in classification triggers a methodology review before new scorecards are published.

Symmetry testing: Before any scoring run, identical fact patterns are tested across party labels to check for systematic differences in AI classification. Any difference is a prompt or methodology issue to fix before the run goes live.

Human Review

Every scorecard carries an AI Draft flag until human review is complete. Human reviewers verify scores, check evidence, confirm flag applications, and have override authority with documentation requirements. Scorecards with politically sensitive findings, D0 or D4 delivery scores, or behavioral flags applied receive full review before publication.