Methodology · PolicyLogic

Known Limitations

Every accountability system has gaps. These are ours — documented transparently so that users and researchers understand where to apply caution.

AI Training Data Bias

Mitigation: human review + cross-party audits

Claude's training data has an unknown partisan distribution. It may have seen more critical coverage of some officials than others, producing systematic scoring differences unrelated to actual delivery.

Source Availability Asymmetry

Mitigation: Limited Evidence flag + minimum source requirements

Governors of large states have extensive press coverage. Mayors of small cities may have one or two local sources. Low-coverage officials will tend toward lower scores because evidence of delivery is harder to find.

Their Role Subjectivity

Mitigation: lookup table + published rationale per promise

Assigning a 0.6 vs. 0.8 Their Role score involves judgment even with the lookup table. Inconsistency across AI runs remains possible for situations that don't match an anchor value.

Impact Scores Embed Value Judgments

Mitigation: Magnitude defined in terms of documented population affected

Designating a promise as M3 vs. M2 requires a judgment about significance. The framework defines Magnitude in terms of population affected, but application embeds assumptions about what matters.

High-Impact Failures Penalize More

Intentional design — disclosed on every scorecard

Impact is not modified by delivery. A failed high-impact promise drags down a grade more than a failed low-impact one. Ambitious officials who fail are graded more harshly than cautious ones who deliver.

Non-English-Language Coverage Gap

Mitigation: community-submitted promise nominations accepted

Promises made in small-market radio interviews or community events in languages other than English may not surface in web searches. This systematically disadvantages accountability for commitments made to non-English-speaking constituencies.

Long-Horizon Promises Mid-Term

Mitigation: Time Pressure adjustment + prominent term stage display

Officials scored mid-term appear to underdeliver on long-term promises simply because they haven't had time. The Time Pressure adjustment partially addresses this but does not eliminate it.

Model Version Drift

Mitigation: version logging + periodic consistency audits

As Claude is updated, scoring behavior may shift. Without periodic re-scoring audits, grade comparisons across scorecards generated by different model versions may reflect model drift rather than genuine differences in delivery.

Difficulty Is Coupled to Delivery

Intentional design — disclosed in the scoring framework

Difficulty points are earned in proportion to delivery, not independently. An ambitious promise that is never delivered earns zero difficulty credit, so a failed high-difficulty promise loses points on two axes at once. This is deliberate — it prevents officials from banking credit for promising hard things they never did — but it means Difficulty is not a standalone measure of ambition.

Found an Error?

If you find a factual error, a missing promise, or a score you believe is wrong, use the Report an Error link on any scorecard. Every submission is reviewed. Corrections are published in the Error Log.

View the Error Log →