Large language models (LLMs) have shown tremendous capabilities, ranging from text summarization and classification to more complex tasks like code generation. However, there is still an urgent need to understand how we can holistically evaluate properly trained models. Traditional benchmarks tend to fall short, as LLMs are capable of handling more advanced and general tasks, and we should be closer to real-world scenarios, considering things that were not possible years ago have now been surpassed. Not only can evaluating LLMs help us understand their strengths, but it also reveals their weaknesses and supports the enforcement of interpretability.
LLMs are capable of doing many things and are not restricted to one specific task, especially with zero and few-shot learning, which removes the necessity to fine-tune models and lets the model learn latent concepts and functions on the fly. However, things get more complicated when the task is complex and demands high reasoning; evaluation of LLMs would shed some light on these difficulties.
Additionally, companies are starting to integrate LLMs into their workflow. There needs to be a way to measure how well the model, once updated, continues to perform for the target task, to avoid ending up in an unmanageable situation. Manual testing is not always sufficient, and employing a hybrid model is usually the way to go.
Traditional ML model evaluation is relatively straightforward; this is because those models were trained specifically for individual or limited-scope tasks. However, we should rethink how we could effectively evaluate LLMs and assess all the risks, especially when we know that these models are currently being deployed at a scale that no one would have thought possible years ago.
ChatGPT-like models suffer from a well-known phenomenon, hallucination. The models start creating things that do not exist and step outside the knowledge space they are supposed to learn. The primary cause of this is still being researched, but it mainly occurs because the model is confronted with an enormous amount of training data, and we, humans, are setting the bar too high. Imagine what could happen in the medical realm? We cannot just blindly integrate it without further exploration of its weaknesses and without making it more interpretable.
One critical concern about LLMs is data leakage; current benchmarks may hold data that the model has already been trained on, what is also referred as data contamination. In this case, evaluation metrics would not be completely fair as models could use their implicit memory instead of generalising.
LLMs attempt to encode the world knowledge implicitly into billions of parameters. Can you imagine how much time it would take for a human to master even a single domain?
To evaluate LLMs, several questions arise:
There is a project I have looked into: LLM-eval-survey. It consolidates several resources and benchmarks for LLM evaluations.
As we place significant expectations on these models, our evaluation metrics should also reflect that. Individuals considering the integration of these models into their toolkit, irrespective of the domain, should be aware that without a clear understanding of how the model is evaluated, they might head in the wrong direction. This is particularly crucial when you don’t possess the LLM itself and are only utilizing it through an API.
Additional Sources and Recommended Reading: