After my previous write-up on LLM patterns, I’ve received questions on how to match those patterns to various LLM problems. Thus, in this follow-up, we’ll discuss some potential problems faced when using LLMs and the patterns that help mitigate them.
Before we dive into it, I think it’s helpful to distinguish between external vs. internal LLMs.
External LLMs are models we don’t have full control over. We can’t fine-tune them, are constrained by rate/token limits, and may have concerns with sending them confidential or proprietary data. Nonetheless, they’re currently the SOTA in terms of quality and safety.
Internal LLMs are those we develop and host ourselves. While they may not have the constraints of external LLMs, we incur the cost of developing and hosting these LLMs. Also, we typically build them on top of open LLMs (unless you have a couple million to pre-train your own). Unfortunately, these open LLMs tend to be months, if not years, behind the best 3rd-party provider LLMs.
With regard to patterns, I think it’s useful to understand the role data plays: Is data the main component or a by-product? Or is the pattern minimally related to data?
For example, evals and fine-tuning are tied to data. We can’t do evals without gold labels or references; we can’t fine-tune without demonstration data. In contrast, patterns such as caching, defensive UX, and guardrails have more to do with infra and UI than data.
On the other hand, patterns such as RAG and collecting user feedback lie somewhere in the middle. RAG provides data for in-context learning but needs to be backed by retrieval or search indices. User feedback helps with fine-tuning but needs to be collected via the right UX and refined via analysis and data pipelines.
Here are some LLM problems and the patterns that help address them. Since we’ve previously discussed the patterns extensively, we’ll focus on the problems here.
Lack of performance metrics for our specific task. When benchmarking external and internal LLMs, we often want to—rightly or wrongly—summarize performance into a table of numbers. (Note: This is fiendishly difficult, especially for more abstract tasks where there are innumerable good outputs.) And as we tweak our systems via updating prompt templates, fine-tuning models, or improving RAG, we need a way to measure improvements or regressions. Finally, how do we measure if users like or dislike these new LLM-based features, and the impact of our tweaks?
External model performing poorly. This can be due to the model not being trained on recent data (e.g., ChatGPT’s knowledge cutoff of Sept 2021) or proprietary data within your org (e.g., internal code and documents). Other concerns include hallucinations or incorrect responses because the model lacks sufficient or the most recent context.
Internal model performing poorly. Open LLMs tend to perform poorly on specific tasks in our specific domain out of the box. This can lead to defects such as poor extraction or summarization, non-factual responses, going off-topic, or simply lack of fluency. These LLMS may also return harmful responses that we want to minimize. While the patterns that apply to external LLMs are also relevant, because it’s an internal model, we have a few more tricks at our disposal.
Constraints on external models. These constraints can be technical (e.g., rate limits, latency, models being excessively fine-tuned), legal (e.g., not sending confidential or user private data, copyright over and use of external LLM output), and financial (i.e., high cost of API calls). AFAIK, there are two viable solutions: Either negotiate a contract with your external LLM provider or develop and self-host your own LLMs. I suggest saintly patience for the former and the patterns below for the latter.
Latency exceeds UX requirements. Certain use cases require the entire LLM output to be available within a few hundred milliseconds, including running guardrails on the output. While streaming output helps a ton with the UX, it may not be viable for certain user experiences and interfaces (no, I’m not referring to chat).
Unreliable or unusable model output. I categorize these as syntactic vs. semantic errors. Syntactic errors occur when the model doesn’t adhere to a specific format such as JSON or a specific sentence structure, or the generated code or SQL doesn’t run. Semantic errors occur when the model output is harmful, non-factual, off-topic, or simply incoherent.
Customer experience paper cuts. Machine learning models aren’t perfect—they will produce inaccurate output. The same goes for LLMs. Thus, after we’ve built a new LLM product/feature, how do we make it easy for users to explore and increase adoption? Also, how do we acknowledge that errors will happen, mitigate them, and earn trust over time?
Lack of visibility on customer impact. How do we know if our models are helping or hurting? Someone shared an anecdote of running an LLM-based customer support solution in prod for two weeks before discontinuing it—an A/B test showed that losses were 12x more when using an LLM as a substitute for their support team!
Are there any key problems I’ve missed? Please let me know!
If you found this useful, please cite this write-up as:
Yan, Ziyou. (Aug 2023). How to Match LLM Patterns to Problems. eugeneyan.com. https://eugeneyan.com/writing/llm-problems/.
or
@article{yan2023llm-problems,
title = {How to Match LLM Patterns to Problems},
author = {Yan, Ziyou},
journal = {eugeneyan.com},
year = {2023},
month = {Aug},
url = {https://eugeneyan.com/writing/llm-problems/}
}
Join 9,100+ readers getting updates on machine learning, RecSys, LLMs, and engineering.