
Configuration
ποΈ Chatbot Arena
- 10k subsample of Chatbot Arena dataset (100k) released alongside Arena Explorer work, crowdsourced human annotations from between June and August 2024 in English.
- Source: https://huggingface.co/datasets/lmarena-ai/arena-human-preference-100k
π¦ AlpacaEval
- 648 cross-annotated human preference pairs used to validate AlpacaEval annotators.
- Source: https://huggingface.co/datasets/tatsu-lab/alpaca_eval/
π PRISM
- ~8k human preference pairs from PRISM dataset, focused on controversial topics with extensive annotator information. Originally four-way annotations, subsampled using 1-of-3 rejected responses to get pairwise preferences.
- Source: https://huggingface.co/datasets/HannahRoseKirk/prism-alignment
π Anthropic helpful
- 5k subsample of human preference pairs favouring helpful responses from RLHF dataset by Anthropic.
- Source: https://github.com/anthropics/hh-rlhf
ποΈ Anthropic harmless
- 5k subsample of human preference pairs favouring harmless responses from RLHF dataset by Anthropic.
- Source: https://github.com/anthropics/hh-rlhf
ποΈ OLMo-2 0325 pref-mix
- 10k preference pairs subsampled randomly from original 378k pairs used for fine-tuning OLMo 2 model by Ai2. Synthetically generated via multiple different pipelines.
- Source: https://huggingface.co/datasets/allenai/olmo-2-0325-32b-preference-mix
π MultiPref
- 10k preference pairs annotated by 4 human annotators, as well as GPT-4-based AI annotators.
- Source: https://huggingface.co/datasets/allenai/multipref
ποΈ Arena (special)
- Llama-4-Maverick-03-26-Experimental arena results, combined with public weights version of Llama-4-Maverick.
- Source: https://huggingface.co/spaces/lmarena-ai/Llama-4-Maverick-03-26-Experimental_battles/tree/main/data
Create separate results for data subsets grouped by this column's values. If no column is selected, entire original dataset will be analyzed.
Results
Overall metrics
No data loaded |
---|
Annotation metrics
Basic statistics: We calculate the metrics per dataset:
- Number of preference pairs
- Proportion of datapoints that prefer the first response (
Prop preferring text_a
) - Average length of first response (
Avg len text_a
) - Average length of second response (
Avg len text_b
) - Average length of preferred response (
Avg len pref. text
) - Average length of rejected response (
Avg len rej. text
) - Proportion of datapoints preferring longer text (
Prop preferring longer text
)
Per-objective metrics: We test what objectives are implicitly encoded in the annotations (e.g. "a response in list format is preferred"). How much a objective is encoded in the annotations is measured using an objective-following AI annotator (LLM-as-a-Judge) and checking how well it can reconstruct the original annotations. In the online interface we use objectives that are adapted from principles generated by Inverse Constitutional AI (ICAI) and the literature around model biases (including VibeCheck). The ICAI pipeline powers our annotators.
For each objective, we calculate the following metrics:
Relevance (
rel
): Proportion of datapoints that AI annotators deemed the objective relevant to. Ranges from 0 to 1.Accuracy (
acc
): Accuracy of objective-following AI annotator reconstructing the original annotations, when datapoints are deemed relevant. Ranges from 0 to 1.Cohen's kappa (
kappa
): Measures agreement beyond chance between the objective-following AI annotator and original preferences. Calculated as kappa=2Γ(acc-0.5), using 0.5 as the expected agreement by chance for a binary choice (which holds for our annotator setup). Unlike the strength metric, it doesn't take relevance into account. Ranges from -1 (perfect disagreement) through 0 (random agreement) to 1 (perfect agreement).Strength of objective (
strength
): Combines Cohen's kappa and relevance, ranges from -1 to 1. Calculated as strength = kappa Γ relevance, which equals 2Γ(acc-0.5)Γrelevance. A value of 0 indicates no predictive performance (either due to random prediction or low relevance), values below 0 indicate objective-following AI annotator is worse than random annotator, and values above 0 indicate objective-following AI annotator is better than random annotator.
No data loaded |
---|
No data loaded |
---|