Basic statistics: We calculate the metrics per dataset:
- Number of preference pairs
- Proportion of datapoints that prefer the first response (
Prop preferring text_a
)
- Average length of first response (
Avg len text_a
)
- Average length of second response (
Avg len text_b
)
- Average length of preferred response (
Avg len pref. text
)
- Average length of rejected response (
Avg len rej. text
)
- Proportion of datapoints preferring longer text (
Prop preferring longer text
)
Per-objective metrics: We test what objectives are implicitly encoded in the annotations
(e.g. "a response in list format is preferred"). How much a objective is
encoded in the annotations is measured using an objective-following AI annotator (LLM-as-a-Judge)
and checking how well it can reconstruct the original annotations. In the online interface we use objectives
that are adapted from principles generated by
Inverse Constitutional AI (ICAI) and
the literature around model biases (including VibeCheck).
The ICAI pipeline powers our annotators.
For each objective, we calculate the following metrics:
Relevance (rel
): Proportion of datapoints that AI annotators deemed
the objective relevant to. Ranges from 0 to 1.
Accuracy (acc
): Accuracy of objective-following AI annotator
reconstructing the original annotations, when datapoints are deemed relevant.
Ranges from 0 to 1.
Cohen's kappa (kappa
): Measures agreement beyond chance between the
objective-following AI annotator and original preferences. Calculated as
kappa=2ร(acc-0.5), using 0.5 as the expected agreement by chance for a binary choice
(which holds for our annotator setup). Unlike the strength metric, it doesn't take
relevance into account. Ranges from -1 (perfect disagreement) through 0 (random agreement)
to 1 (perfect agreement).
Strength of objective (strength
): Combines Cohen's kappa and relevance,
ranges from -1 to 1. Calculated as strength = kappa ร relevance, which equals
2ร(acc-0.5)รrelevance. A value of 0 indicates no predictive performance (either due
to random prediction or low relevance), values below 0 indicate objective-following
AI annotator is worse than random annotator, and values above 0 indicate
objective-following AI annotator is better than random annotator.