Logo

Configuration

Some configuration options (grouping by column, selecting annotators) are only available when selecting a single dataset. Select a single dataset to use these features.

Results

Overall metrics

No data loaded

Annotation metrics

Metric
Sort by
Sort order

Basic statistics: We calculate the metrics per dataset:

  • Number of preference pairs
  • Proportion of datapoints that prefer the first response (Prop preferring text_a)
  • Average length of first response (Avg len text_a)
  • Average length of second response (Avg len text_b)
  • Average length of preferred response (Avg len pref. text)
  • Average length of rejected response (Avg len rej. text)
  • Proportion of datapoints preferring longer text (Prop preferring longer text)

Per-objective metrics: We test what objectives are implicitly encoded in the annotations (e.g. "a response in list format is preferred"). How much a objective is encoded in the annotations is measured using an objective-following AI annotator (LLM-as-a-Judge) and checking how well it can reconstruct the original annotations. In the online interface we use objectives that are adapted from principles generated by Inverse Constitutional AI (ICAI) and the literature around model biases (including VibeCheck). The ICAI pipeline powers our annotators.

For each objective, we calculate the following metrics:

  • Relevance (rel): Proportion of datapoints that AI annotators deemed the objective relevant to. Ranges from 0 to 1.

  • Accuracy (acc): Accuracy of objective-following AI annotator reconstructing the original annotations, when datapoints are deemed relevant. Ranges from 0 to 1.

  • Cohen's kappa (kappa): Measures agreement beyond chance between the objective-following AI annotator and original preferences. Calculated as kappa=2Γ—(acc-0.5), using 0.5 as the expected agreement by chance for a binary choice (which holds for our annotator setup). Unlike the strength metric, it doesn't take relevance into account. Ranges from -1 (perfect disagreement) through 0 (random agreement) to 1 (perfect agreement).

  • Strength of objective (strength): Combines Cohen's kappa and relevance, ranges from -1 to 1. Calculated as strength = kappa Γ— relevance, which equals 2Γ—(acc-0.5)Γ—relevance. A value of 0 indicates no predictive performance (either due to random prediction or low relevance), values below 0 indicate objective-following AI annotator is worse than random annotator, and values above 0 indicate objective-following AI annotator is better than random annotator.

No data loaded
Feedback Forensics app v0.2.1