Evaluation in artificial intelligence: from task-oriented to ability-oriented measurement

The evaluation of artificial intelligence systems and components is crucial for the progress of the discipline. In this paper we describe and critically assess the different ways AI systems are evaluated, and the role of components and techniques in these systems. We first focus on the traditional task-oriented evaluation approach. We identify three kinds of evaluation: human discrimination, problem benchmarks and peer confrontation. We describe some of the limitations of the many evaluation schemes and competitions in these three categories, and follow the progression of some of these tests. We then focus on a less customary (and challenging) ability-oriented evaluation approach, where a system is characterised by its (cognitive) abilities, rather than by the tasks it is designed to solve. We discuss several possibilities: the adaptation of cognitive tests used for humans and animals, the development of tests derived from algorithmic information theory or more integrated approaches under the perspective of universal psychometrics. We analyse some evaluation tests from AI that are better positioned for an ability-oriented evaluation and discuss how their problems and limitations can possibly be addressed with some of the tools and ideas that appear within the paper. Finally, we enumerate a series of lessons learnt and generic guidelines to be used when an AI evaluation scheme is under consideration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic €32.70 /Month

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Rent this article via DeepDyve

Similar content being viewed by others

A Note on Chances and Limitations of Psychometric AI

Chapter © 2014

Human-Level Artificial Intelligence Must Be a Science

Chapter © 2013

Can Machine Intelligence be Measured in the Same Way as Human intelligence?

Article 22 April 2015

Explore related subjects

Notes

Worst-case performance and best-case performance are special cases of a rank-based aggregation (using the cumulative distribution of results), with other possibilities such as the median, the first decile, etc. Rank-based aggregation, especially worst-case performance, is more robust to systems getting good scores on many easy problems but doing poorly on the difficult problems.

Note that this formula does not have the size of the instance as a parameter, and hence it is not comparable to the usual view of worst-case analysis of algorithms.

The distinction between white and black box can be enriched to consider those problems where the solution must be accompanied by a verification, proof or explanation (Hernández-Orallo 2000b; Alpcan et al. 2014).

Although it is not uncommon, as we will see, that the set of problems from M are chosen by the research team that is evaluating its own method, so the probability to choose from M can be biased in such a way that it is actually a best-case evaluation.

Statistical tests are not used to determine when a contestant can be said to be significantly better than another.

The Chinese room argument was introduced by (Searle 1980) to argue against the possibility of a machine having a mind, by comparing a computer processing inputs and outputs as symbols with a person knowing no Chinese in a room receiving messages in Chinese that have to be answered, also in Chinese, using a series of books to map inputs to outputs. Given the relevance of machine learning in AI nowadays, among other things, the argument has mostly faded today.

References

Acknowledgments

I thank the organisers of the AEPIA Summer School On Artificial Intelligence, held in September 2014, for giving me the opportunity to give a lecture on ‘AI Evaluation’. This paper was born out of and evolved through that lecture. The information about many benchmarks and competitions discussed in this paper have been contrasted with information from and discussions with many people: M. Bedia, A. Cangelosi, C. Dimitrakakis, I. GarcÍa-Varea, Katja Hofmann, W. Langdon, E. Messina, S. Mueller, M. Siebers and C. Soares. Figure 4 is courtesy of F. Martínez-Plumed. Finally, I thank the anonymous reviewers, whose comments have helped to significantly improve the balance and coverage of the paper. This work has been partially supported by the EU (FEDER) and the Spanish MINECO under Grants TIN 2013-45732-C4-1-P, TIN 2015-69175-C4-1-R and by Generalitat Valenciana PROMETEOII2015/013.

Author information

Authors and Affiliations

  1. DSIC, Universitat Politècnica de València, Valencia, Spain José Hernández-Orallo
  1. José Hernández-Orallo