Tag: eval

Label some data, align LLM-evaluators, and run the eval harness with each change.

23 Nov 2025 · 9 min · eval engineering production

Evaluation metrics, how to build eval datasets, eval methodology, and a review of several benchmarks.

22 Jun 2025 · 28 min · llm eval survey

Applying the scientific method, building via eval-driven development, and monitoring AI output.

20 Apr 2025 · 5 min · eval llm engineering

Look at and label your data, build and evaluate your LLM-evaluator, and optimize it against your labels.

27 Oct 2024 · 14 min · llm eval learning 🛠 🩷

Being a human judge at the Weights & Biases LLM-as-a-Judge Hackathon

22 Sep 2024 · 2 min · llm eval

Use cases, techniques, alignment, finetuning, and critiques against LLM-evaluators.

18 Aug 2024 · 49 min · llm eval production survey 🔥

Evals for classification, summarization, translation, copyright regurgitation, and toxicity.

31 Mar 2024 · 33 min · llm eval survey

How to use open-source, permissive-use data and collect less labeled samples for our tasks.

Reference, context, and preference-based metrics, self-consistency, and catching hallucinations.

03 Sep 2023 · 23 min · llm eval survey

Thinking about recsys as interventional vs. observational, and inverse propensity scoring.

10 Apr 2022 · 8 min · recsys eval machinelearning

eugeneyan