Evaluating the quality of outputs produced by LLMs is increasingly challenging due to the complexity of generative tasks and the length of responses. While language model-based evaluation has emerged as a scalable and cost-effective method for assessing these outputs, it comes with its own issues. This blog delves into the problems associated with current evaluation methods, the factors affecting LLM performance, and how we address these challenges.
The "vibe check" approach to evaluating LLMs involves subjective human judgments, often through A/B testing with crowd workers. While this method simulates real-world interactions and helps rank models based on usefulness and harmlessness, it has serious limitations:
Expensive: Conducting human evaluation is expensive and time-consuming, requiring coordination with annotators, custom web interfaces, detailed annotation instructions, data analysis, and considerations for employing crowdworkers.
The recent paper - Human Feedback is not Gold Standard - highlights limitations and potential biases in using human preference scores for LLM evaluation and training, calling for more nuanced and objective approaches to assess model performance.
Impact of Confounders: The study explores the influence of two confounding factors—assertiveness and complexity—on human evaluations. They use instruction-tuned models to generate outputs varying in these dimensions and find that more assertive outputs tend to be perceived as more factually accurate, regardless of their actual factual content. This suggests that human evaluators might be misled by the confidence with which information is presented.
Subjectivity and Bias in Preference Scores: The authors hypothesize that preference scores, which are used to rate the quality of LLM outputs, are subjective and prone to biases. This implies that what one person prefers might not be universally agreed upon and could introduce unintended biases into the evaluation process.
Coverage of Crucial Error Criteria: While preference scores generally cover a wide range of evaluation criteria, they tend to under-represent certain critical aspects, notably factual accuracy. This means that models might be rated favorably even if they produce factually incorrect information, as long as human evaluators prefer the output.
Assertiveness in Model Outputs: The authors present preliminary evidence that training models using human feedback can disproportionately increase the assertiveness of their outputs. This raises concerns about the potential for models to become overly confident in their responses, which could further mislead users!
LLMs can be used to evaluate other LLMs but comes with its own set of challenges:
Correlation with human judgments: Open-source evaluator LLMs often do not correlate well enough with human judgments or proprietary LLMs, limiting their effectiveness in real-world scenarios.
Affordability: Proprietary models are not always affordable, making them impractical to integrate into evaluation pipelines.
Latency: LLMs can have high response times, which might be impractical for large-scale or real-time evaluation needs.
Compliance: Proprietary LLMs often lack transparency about their training data, raising concerns about fairness and compliance.
Biases in LLMs can have significant implications, affecting how they generate responses and evaluate text. Let's go through some of the research done in this field.
Nepotism Bias
LLM evaluators inherently favor text generated by themselves.
Paper - LLM Evaluators Recognize and Favor Their Own Generations
Fallacy Oversight Bias
This bias entails overlooking logical fallacies within arguments. LLMs might accept conclusions without critically evaluating the evidence, potentially propagating flawed or misleading information.
Paper - Humans or LLMs as the Judge? A Study on Judgement Biases
Authority Bias
Attributing greater credibility to statements from perceived authorities, irrespective of the evidence, characterizes this bias. LLMs might uncritically accept expert opinions without adequate scrutiny.
Paper - Humans or LLMs as the Judge? A Study on Judgement Biases
Beauty Bias
LLMs might favor aesthetically pleasing text, potentially overlooking the accuracy or reliability of the content.
Paper - Humans or LLMs as the Judge? A Study on Judgement Biases
Verbosity Bias
LLMs might equate quantity of information with quality, potentially prioritizing verbose text over succinct and accurate content.
Paper - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Positional Bias
LLMs might exhibit a bias towards information placement(beginning or end of a document may be deemed more important), potentially impacting text interpretation.
Paper - Large Language Models are not Fair Evaluators
Attention Bias (for lengthy text)
LLMs can sometimes miss contextual information present in the middle of the lengthy text. This bias suggests that the model may focus more on the beginning and end of passages, potentially leading to incomplete understanding or interpretation of the text.
Paper - Lost in the Middle: How Language Models Use Long Contexts
Sycophancy
LLM assistants tend to agree with the user even when the user is mistaken. They can change their correct answer to incorrect if they are asked, “Are you sure?”
Paper - Towards Understanding Sycophancy in Language Models
Researchers have tried out many approaches to see if they can find reliable way to evaluate performance of generative models but all of them suffer from limitations mentioned earlier.
LLM-Derived Metrics: Developing metrics from LLM generation probabilities.
Ex Looking for a Needle in a Haystack
Prompting LLMs: Using designed prompts to query existing LLMs, offering flexibility and interpretability directly.
Ex. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Fine-Tuning LLMs: Using labeled evaluation data to fine-tune LLMs, improving their evaluation capabilities.
Ex. Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
What can we learn from some world-class companies? In a recent article, LinkedIn shared its comprehensive strategies to tackle LLM evaluation challenges.
Developing Guidelines
The team faced challenges establishing consistent evaluation guidelines, particularly in areas like job assessment. They needed to ensure that responses were not only factually accurate but also empathetic and helpful. This required meticulous attention to detail to maintain uniformity in annotator scores.
Scaling Annotation
Initially, the team relied on input from various team members for annotation. However, they recognized the need for a more systematic approach. They developed tooling and processes to enable consistent and diverse annotation, allowing them to evaluate up to 500 daily conversations and track metrics like overall quality score, hallucination rate, coherence, and style.
Automatic Evaluation
While manual evaluation provided valuable insights, the team recognized the need for automated evaluation to expedite the iteration process. They built model-based evaluators to estimate key metrics like overall quality, hallucination rate, and coherence. This approach aimed to enable faster experimentation and iteration cycles.
Error Analysis for Model Improvement
The team initially achieved basic functionality but faced challenges improving quality beyond 80%. They underestimated the complexity of detecting and mitigating hallucinations, leading to slower progress in quality improvement. To address this, they segmented errors and focused on fine-tuning LLMs to fix error cases, aiming to improve overall performance and consistency.
This is a great example of developing the right processes with human in the loop for evaluating LLMs reliably.
At Galileo, we adhere to the principle of 'Evaluation First' — every process begins and ends with the thorough evaluation and inspection of your application. In our ongoing research into the capabilities of various LLMs, particularly in detecting hallucinations, we have developed two high-performance methods for evaluating LLMs.
We have been looking for reliable evaluation methods for years. Last year, we released ChainPoll, a pioneering technique that combines Chain-of-Thought prompting with polling to ensure robust and nuanced assessment.
Chain: The power of Chain-of-Thought Prompting
Chain-of-thought (CoT) prompting is a straightforward yet potent strategy for extracting more accurate answers from LLMs. By prompting the LLM to articulate its reasoning process step-by-step before presenting the final answer, ChainPoll ensures deeper engagement and comprehension.
Consider the analogy of solving a complex problem: Just as humans benefit from reflection before responding, LLMs excel when allowed to process information sequentially. CoT prompts facilitate this reflective process, significantly elevating the quality of generated responses.
Poll: Leveraging Response Diversity
ChainPoll extends CoT prompting by soliciting multiple, independently generated responses to the same prompt and aggregating these responses. By diversifying the pool of generated arguments, ChainPoll embraces the concept of self-consistency, wherein valid arguments converge towards the correct answer while invalid arguments scatter.
While closely related to self-consistency, ChainPoll introduces several key innovations. Unlike self-consistency, which relies on majority voting to select a single best answer, ChainPoll employs averaging to produce a nuanced score reflective of the LLM's level of certainty.
By capturing the breadth of responses and their associated levels of certainty, ChainPoll transcends simplistic binary judgments, offering a comprehensive evaluation framework adaptable to diverse use cases.
Galileo Luna, a suite of Evaluation Foundation Models (EFMs), represents a breakthrough in scalable, cost-effective, and accurate LLM evaluations. They are designed to address enterprise adoption and hallucination mitigation. Here are some key benefits and features.
Accuracy: Galileo Luna has proven to outperform all popular evaluation techniques, including our own Chainpoll methodology. The above graph is for the Context Adherence Luna EFM which detects hallucinations in RAG-based systems. During testing against popular publicly available datasets that cover multiple industry verticals, Luna proved to be 18% more accurate than evaluating with OpenAI’s GPT 3.5 at detecting hallucinations. We are seeing similar performance for evaluation tasks such as prompt injections, PII detection, and more.
For many of our customers focused on hallucination prevention, Luna has quickly become the ‘first line of defense’, especially at Fortune 500 scale. While we still recommend humans remain in the loop while working with Galileo, Luna has helped these organizations dramatically improve both the speed and accuracy of their evaluations.
Cost: Luna helps AI teams reduce evaluation cost in two ways. First, Luna replaces costly LLM-based evaluations, which for some customers, exceeds $1M per month at production scale. In our testing, Luna proved 97% cheaper than OpenAI’s GPT3.5 when evaluating production traffic! Second, Luna helps teams reduce their reliance on human-in-the-loop evaluations.
Particularly in production, assuming an application gets 1 QPS of traffic, that is 31M queries a year. If the AI team that built the application used GPT 3.5 (to save costs, instead of GPT-4) as a judge to evaluate model outputs, even assuming an average input size of 4k tokens, the cost of evaluation alone would be ~$650k for the year
Speed: Luna EFMs have been built to evaluate LLM responses in milliseconds. This innovation allowed us to launch Galileo Protect in early May, which serves as a real-time GenAI firewall, intercepting inputs and response as they occur. This required ultra-low latencies for model evaluation, and would not have been possible without the large number of innovations we made to ensure millisecond latencies with Galileo Luna without compromising on evaluation accuracy. In our tests for hallucination detection, Luna proved 11x faster than GPT3.5!
Diverse Evaluation Metrics: The EFMs are fine-tuned to evaluate various aspects such as toxicity, sexism, context adherence, chunk utilization, and completeness. These metrics are essential for deploying trustworthy AI and preventing hallucinations.
Scalability: The EFMs enable large-scale evaluations, supporting high-demand production use cases like chatbots requiring thousands of evaluations per minute or businesses generating tens of thousands of text outputs daily.
Overcoming issues of human vibe checks: Galileo's EFMs standardize evaluations, making them more systematic and reliable.
Overcoming issues of LLM-as-a-Judge: Galileo's EFMs, fine-tuned for specific evaluation tasks, offer a precise and cost-effective alternative.
By combining the innovative ChainPoll technique with the comprehensive Galileo Luna suite, Galileo provides a robust framework for LLM evaluation. This framework enables enterprises to scale their AI solutions efficiently, ensuring high-quality outputs and accelerating their path to production.
In conclusion, evaluating LLM outputs is an intricate challenge. Conventional methods such as vibe-check assessments or using GPT-4 as a judge have huge drawbacks like bias, cost, and latency. Galileo’s ChainPoll and Luna suite present robust, cost-efficient evaluation frameworks that mitigate biases and enhance reliability. This approach simplifies the development of reliable GenAI applications for enterprises, making it easier to build trustworthy GenAI solutions.
To learn more about the Luna family of Evaluation Foundation Models and see them in action, join us on June 18th for a live webinar or contact us for further information.