eval
(10)Label some data, align LLM-evaluators, and run the eval harness with each change.
23 Nov 2025  Ā·  9 min  Ā·  eval engineering production
Evaluation metrics, how to build eval datasets, eval methodology, and a review of several benchmarks.
22 Jun 2025  Ā·  28 min  Ā·  llm eval survey
Applying the scientific method, building via eval-driven development, and monitoring AI output.
20 Apr 2025  Ā·  5 min  Ā·  eval llm engineering
Look at and label your data, build and evaluate your LLM-evaluator, and optimize it against your labels.
27 Oct 2024  Ā·  14 min  Ā·  llm eval learning š š©·
Being a human judge at the Weights & Biases LLM-as-a-Judge Hackathon
Use cases, techniques, alignment, finetuning, and critiques against LLM-evaluators.
18 Aug 2024  Ā·  49 min  Ā·  llm eval production survey š„
Evals for classification, summarization, translation, copyright regurgitation, and toxicity.
31 Mar 2024  Ā·  33 min  Ā·  llm eval survey
How to use open-source, permissive-use data and collect less labeled samples for our tasks.
05 Nov 2023  Ā·  12 min  Ā·  llm eval machinelearning python
Reference, context, and preference-based metrics, self-consistency, and catching hallucinations.
03 Sep 2023  Ā·  23 min  Ā·  llm eval survey
Thinking about recsys as interventional vs. observational, and inverse propensity scoring.
10 Apr 2022  Ā·  8 min  Ā·  recsys eval machinelearning
Join 11,800+ readers getting updates on machine learning, RecSys, LLMs, and engineering.