Evaluation of Large Language Models (LLMs)

Large language models (LLMs) have shown tremendous capabilities, ranging from text summarization and classification to more complex tasks like code generation. However, there is still an urgent need to understand how we can holistically evaluate properly trained models. Traditional benchmarks tend to fall short, as LLMs are capable of handling more advanced and general tasks, and we should be closer to real-world scenarios, considering things that were not possible years ago have now been surpassed. Not only can evaluating LLMs help us understand their strengths, but it also reveals their weaknesses and supports the enforcement of interpretability.

LLMs are capable of doing many things and are not restricted to one specific task, especially with zero and few-shot learning, which removes the necessity to fine-tune models and lets the model learn latent concepts and functions on the fly. However, things get more complicated when the task is complex and demands high reasoning; evaluation of LLMs would shed some light on these difficulties.

Additionally, companies are starting to integrate LLMs into their workflow. There needs to be a way to measure how well the model, once updated, continues to perform for the target task, to avoid ending up in an unmanageable situation. Manual testing is not always sufficient, and employing a hybrid model is usually the way to go.

Traditional ML model evaluation is relatively straightforward; this is because those models were trained specifically for individual or limited-scope tasks. However, we should rethink how we could effectively evaluate LLMs and assess all the risks, especially when we know that these models are currently being deployed at a scale that no one would have thought possible years ago.

ChatGPT-like models suffer from a well-known phenomenon, hallucination. The models start creating things that do not exist and step outside the knowledge space they are supposed to learn. The primary cause of this is still being researched, but it mainly occurs because the model is confronted with an enormous amount of training data, and we, humans, are setting the bar too high. Imagine what could happen in the medical realm? We cannot just blindly integrate it without further exploration of its weaknesses and without making it more interpretable.

One critical concern about LLMs is data leakage; current benchmarks may hold data that the model has already been trained on, what is also referred as data contamination. In this case, evaluation metrics would not be completely fair as models could use their implicit memory instead of generalising.

LLMs attempt to encode the world knowledge implicitly into billions of parameters. Can you imagine how much time it would take for a human to master even a single domain?

To evaluate LLMs, several questions arise:

Quality: How is the quality of the responses?
Understanding of the Natural Language Space: Does the model understand the semantics of the words? Is it able to perform well in sentiment analysis? Can it comprehend instructions?
Hallucination: Is the model capable of knowing when it does not know? HaluEval is an example of a benchmark used to evaluate hallucination in LLMs.
Responsible AI: Is the model aligned with current societal and ethical values? Can we trust it?
Robustness and Sensitivity: Is the model robust against adversarial attacks? If we slightly reword a prompt, does the output change drastically? PromptBench is an example of a benchmark that evaluates LLMs against adversarial prompt attacks.
Problem Solving and Logical Reasoning: Is the model capable of understanding how to solve problems ? Is it good at solving complex exams? Several benchmarks for these aspects are available.

There is a project I have looked into: LLM-eval-survey. It consolidates several resources and benchmarks for LLM evaluations.

Conclusion

As we place significant expectations on these models, our evaluation metrics should also reflect that. Individuals considering the integration of these models into their toolkit, irrespective of the domain, should be aware that without a clear understanding of how the model is evaluated, they might head in the wrong direction. This is particularly crucial when you don’t possess the LLM itself and are only utilizing it through an API.

Additional Sources and Recommended Reading:

Evaluation of Large Language Models (LLMs)

Conclusion

Share This Post

Check out these related posts

Sequencing the Invisible: System Behavioral Modeling from Sequence-Based Approaches to Provenance Graphs

How ADRs solve the the last mile problem of application security

LLM-based Agents