Ongoing developments in the field of artificial intelligence, especially in the language communication of AI, will affect various aspects of our lives in different ways. We cannot predict all the uses that will be made of technologies such as Large Language Models (LLMs), nor all the consequences of their use. But we can reasonably say that the effects will be significant, and we can reasonably fear that some of these effects will be harmful. Such concern is made even more reasonable by the fact that it is not just the consequences of LLMs that we ignore; there is a lot we don’t know about what LLMs can do, how they do it, and to what extent. Given this ignorance, it is hard to believe that we are ready for the changes we have set in motion.
By now, many Daily Us readers will have at least heard of GPT-3 (remember “Philosophers on GPT-3” as well as this discussion and this one regarding its impact on teaching). But GPT-3 (still being upgraded) is just one of dozens of LLMs currently in existence (and GPT-4 is rumored to be released within the next few months).
Advances in this technology have prompted some researchers to begin to tackle our ignorance about it and produce the kind of knowledge that will be crucial to understanding it and determining the standards for its use. A prime example of this is work recently published by a large team of researchers from the Institute for Human-Centered Artificial Intelligence at Stanford University. Their project, “Holistic Evaluation of Language Models”, (HELM) “flags” 30 LLMs.
One of the objectives of benchmarking is transparency. As the team writes in a summary of their paper:
We need to know what this technology can and cannot do, what risks it poses, so that we can both have a deeper scientific understanding and a fuller account of its societal impact. Transparency is the essential first step towards both of these goals. But the AI community lacks the necessary transparency: many language models exist, but they are not compared on a unified standard, and even when language models are evaluated, the set of societal considerations (e.g. , fairness, robustness, uncertainty estimation, common sense knowledge, misinformation) have not been addressed in a unified way.
The document documents the results of a substantial amount of work by 50 scholars to articulate and apply a set of standards to the ever-growing range of LLMs. Here is an excerpt from the abstract of the article:
We introduce holistic evaluation of language models (HELM) to improve the transparency of language models.
First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest to LMs. Then we select a large subset based on coverage and feasibility, noting what is missing or underrepresented (e.g. question response for overlooked English dialects, reliability measures).
Second, we take a multimetric approach: we measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficacy) for each of the 16 main scenarios whenever possible (87.5% of the time), ensuring that that metrics beyond precision do not fall by the wayside, and trade-offs between models and metrics are clearly exposed. We also perform 7 targeted assessments, based on 26 targeted scenarios, to further analyze specific aspects (e.g. knowledge, reasoning, memorization/copyright, misinformation).
Third, we perform a large-scale assessment of 30 important language models (covering open, restricted, and closed models) across the 42 scenarios, including 21 scenarios that were not previously used in traditional LM assessment. Before HELM, models were evaluated on average on only 17.9% of the main HELM scenarios, with some important models sharing no common scenario. We’re improving that figure to 96.0%: now all 30 models have been tightly calibrated to a set of baseline scenarios and measurements under standardized conditions.
Our assessment presents 25 high-level conclusions regarding the interaction between different scenarios, metrics and models. For full transparency, we publicly publish all prompts and raw model achievements for further analysis, along with a general modular toolkit to easily add new scenarios, models, metrics, and prompting strategies. We want HELM to be a living reference for the community, continuously updated with new scenarios, metrics and models.
One of the authors of this article is Thomas Icard, associate professor of philosophy at Stanford. His work on the HELM project has been primarily related to assessing the reasoning skills of LLMs. One of the things he emphasized about the project is that it aims to be a continuous and democratic evaluation process: “it aspires to democratize the continuous development of the baseline task suite. In other words, what is reported in the document is only a first attempt to isolate a wide range of important tasks of interest, and it should grow and change over time.
Philosophical questions in a wide range of fields – philosophy of science, epistemology, logic, philosophy of mind, cognitive science, ethics, social and political philosophy, philosophy of law, aesthetics, philosophy of education, etc. – are raised by the development and use of language models and by efforts (like HELM) to understand them. There’s a lot to work on here, philosophers. While some of you are already there, it seems to me that there is a discrepancy between, on the one hand, the social significance and philosophical fertility of the subject and, on the other hand, the attention it actually receives from philosophers. . I’m curious to hear what others think, and invite philosophers working on these issues to share links to their writings and/or descriptions of their projects in the comments.
#ready #horizon #people #day