Weights & Biases LLM-Evaluator Hackathon - Hackathon Judge

[ llm eval ] · 2 min read

This weekend, I had the opportunity to judge the Weights & Biases LLM-Judge Hackathon. Over two days, more than 100 people took part with 15 teams demoing their work on day two. The teams built creative and practical projects such as constructing and validating knowledge graphs from documents, evaluating LLMs on MBTI traits and creativity, optimizing evaluation prompts, evaluating multi-turn conversations, and more.

I was invited to kick off the hackathon with a short talk, and took the chance to discuss:

  • Things to consider when using LLMs-evaluators: What is our baseline? How will LLM-evaluators score responses? What metrics to evaluate LLM-evaluators on?
  • A decision tree to decide on scoring methods, metrics, and evaluator vs. guardrail
  • Open questions on LLM-evaluator performance, alignment, and integration


I was impressed by the level of effort and care that went into the demos, with some teams hacking all the way till 10pm on Saturday night (and had to get kicked out of the building). From the demos, the teams accomplished A LOT in the span of one and a half days. The top team won Meta Ray-Bans for each member of the team.


Overall, everyone had a great time hacking and giving demos. I also hacked on something of my own and hope to share it soon. Yes, it’s also LLM-evaluator related, focused on the UX/UI with the goal of making labeling and evaluation more effective and fun.


If you found this useful, please cite this write-up as:

Yan, Ziyou. (Sep 2024). Weights & Biases LLM-Evaluator Hackathon - Hackathon Judge. eugeneyan.com. https://eugeneyan.com/speaking/hackathon-judge/.

or

@article{yan2024judge,
  title   = {Weights & Biases LLM-Evaluator Hackathon - Hackathon Judge},
  author  = {Yan, Ziyou},
  journal = {eugeneyan.com},
  year    = {2024},
  month   = {Sep},
  url     = {https://eugeneyan.com/speaking/hackathon-judge/}
}

Share on:

Browse related tags: [ llm eval ]

Join 9,800+ readers getting updates on machine learning, RecSys, LLMs, and engineering.