Image for post
Image for post
We have to thread carefully — photo by @beefchen

Evaluating NLP Models in 2020

Testing an AI system’s ability through spatial, temporal, causal and motivational entities or events in fictional stories

With language models we have to thread carefully. Developing, benchmarking and deploying natural-language processing (NLP) models must be a considerate process beyond traditional/existing performance or evaluation measures.

  • Work in computational linguistics is in some cases motivated from a scientific perspective in that one is trying to provide a computational explanation for a particular linguistic or psycholinguistic phenomenon; and in other cases the motivation may be more purely technological in that one wants to provide a working component of a speech or natural language system.
  • Indeed, the work of computational linguists is incorporated into many working systems today, including speech recognition systems, text-to-speech synthesizers, automated voice response systems, web search engines, text editors, language instruction materials, to name just a few.”

“Are current methods really enough to achieve the field’s ultimate goals? What even are those goals?”

In this he refers to two articles.

  • Normans. The Stanford Question Answering Dataset.
  • HotpotQA, a question answering dataset.

“State of the art” has practically become a proper noun: “We beat SOTA on SQuAD by 2.4 points!”

Not everyone agrees with this approach.

“Do recent “advances” really translate into helping people solve problems?”

This is more than abstract — it has stakes for society.

To Test Machine Comprehension, Start by Defining Comprehension

He makes a rather witty remark on the current situation:

NLP researchers have been training to become professional sprinters by “glancing around the gym and adopting any exercises that look hard.”

What is healthy?

Image for post
Image for post

“One reliable technique is to probe the system’s model of the world, which can leave even the much-ballyhooed GPT-3 babbling about cycloptic blades of grass.”

As a side note cycloptic means: approaching or viewing a situation with a single perspective. You may have assumed this, yet admittedly I had to Google it.

“…however systems are implemented, if they need to have faithful world models, then evaluations should systematically test whether they have faithful world models.”

This according to Dunietz is rarely done.

  • Asking questions that rely on multiple reasoning steps.
  • Aggregating many benchmarks.
  • Other researchers focused on testing common sense.
  1. Temporal: What events occur and when?
  2. Causal: How do events lead mechanistically to other events?
  3. Motivational: Why do the characters decide to take the actions they take?

Written by

AI Policy and Ethics at Student at University of Copenhagen MSc in Social Data Science. All views are my own.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store