We have to thread carefully — photo by @beefchen

Evaluating NLP Models in 2020

Testing an AI system’s ability through spatial, temporal, causal and motivational entities or events in fictional stories

With language models we have to thread carefully. Developing, benchmarking and deploying natural-language processing (NLP) models must be a considerate process beyond traditional/existing performance or evaluation measures.

Today I wanted to write about an opinion piece in the MIT Technology Review. The story was written on the 31st of July by Jesse Dunietz.

Jesse starts by commenting on the annual meeting of the Association for Computational Linguistics (ACL).

Apparently there was a change in the mood at the conference.

Previously technical flavour permeated the papers, research talks and chats.

The conference was in July.

But wait… computational linguistics?

According to the website of ACL this is the definition [I added three bulletpoints and bold]:

Computational linguistics is the scientific study of language from a computational perspective. Computational linguists are interested in providing computational models of various kinds of linguistic phenomena.

  • These models may be “knowledge-based” (“hand-crafted”) or “data-driven” (“statistical” or “empirical”).
  • Work in computational linguistics is in some cases motivated from a scientific perspective in that one is trying to provide a computational explanation for a particular linguistic or psycholinguistic phenomenon; and in other cases the motivation may be more purely technological in that one wants to provide a working component of a speech or natural language system.
  • Indeed, the work of computational linguists is incorporated into many working systems today, including speech recognition systems, text-to-speech synthesizers, automated voice response systems, web search engines, text editors, language instruction materials, to name just a few.”

So, Dunietz says this conference felt different.

Conversations were introspective about core methods in natural-language processing (NLP).

NLP is considered the branch of AI focused on systems analysing human language.

Why is that?

Perhaps partly to the new addition. There was a ‘theme’ track.

Papers for the new “Theme” track asked questions like:

“Are current methods really enough to achieve the field’s ultimate goals? What even are those goals?”

In this he refers to two articles.

Dunietz believes the angst is justified.

He runs a firm called Elemental Cognition based in New York.

So, why should practitioners and theoretical researchers in NLP be angsty?

Maybe for a lack of evaluation, he argues.

Him, and his company believes the field needs transformation in evaluation.

(Quick disclaimer: I work as a consultant in KPMG, and they do a lot of evaluations.)

Dunietz argues the field has progressed, but comprehension of NLP has been measured on benchmark data sets.

→ These consist of thousands of questions, each accompanied by passages containing the answer.

Deep neural networks swept the field in the mid-2010's.

With them they brought a leap in performance.

New datasets would emerge with even trickier questions [links by Dunietz].

  • The GLUE benchmark.
  • Normans. The Stanford Question Answering Dataset.
  • HotpotQA, a question answering dataset.

Progress entails ‘tweaking’ models to get more points. He says:

“State of the art” has practically become a proper noun: “We beat SOTA on SQuAD by 2.4 points!”

Not everyone agrees with this approach.

What Dunietz calls leaderboard-chasing can be troubling.

It is an academic exercise, sometimes in exploiting spurious patterns in data.

He refers to:

“Do recent “advances” really translate into helping people solve problems?”

This is more than abstract — it has stakes for society.

“…when people imagine computers that comprehend language, they envision far more sophisticated behaviors: legal tools that help people analyze their predicaments; research assistants that synthesize information from across the web; robots or game characters that carry out detailed instructions.”

We need to wake up and smell the coffee.

That is, realise the truth about the current situation.

Dunietz is in one of the leading environments for technology, and he argues today’s models are not close to achieving the comprehension needed.

There is a gap in paper evaluations and real-world ability.

He, and his colleagues argue in a new paper that:

To Test Machine Comprehension, Start by Defining Comprehension

He makes a rather witty remark on the current situation:

NLP researchers have been training to become professional sprinters by “glancing around the gym and adopting any exercises that look hard.”

What is healthy?

What is socially beneficial?

Percentage cannot immediately measure these aspects.

Recently I wrote that GPT-3 (new model from OpenAI) is amazing:

Yet, in the very same article it showed the model had dangerous limitations with possible adverse social limitations:

Yes, some researchers or people within the field of AI will not be surprised.

However, people with no prior knowledge or limited understanding of the field of AI may certainly be – that is, most of society.

People in society trust experts, and NLP practitioners can certainly fall in this category.

A human reading will build a “mental model” of the world.

Then, hypothesise about counterfactual alternatives.

Can this be done?

Automated research assistants and game characters should be able to do this, Dunietz argues.

An NLP researcher can: “…stump a state-of-the-art reading comprehension system within a few tries”

“One reliable technique is to probe the system’s model of the world, which can leave even the much-ballyhooed GPT-3 babbling about cycloptic blades of grass.”

As a side note cycloptic means: approaching or viewing a situation with a single perspective. You may have assumed this, yet admittedly I had to Google it.

His overall argument is relatively simple:

“…however systems are implemented, if they need to have faithful world models, then evaluations should systematically test whether they have faithful world models.”

This according to Dunietz is rarely done.

Allen Institute for AI have proposed other ways to harden the evaluations.

  • Targeting diverse linguistic structures,
  • Asking questions that rely on multiple reasoning steps.
  • Aggregating many benchmarks.
  • Other researchers focused on testing common sense.

Most of these still focus on compiling questions.

Maybe one has to go beyond?

We’re proposing a more fundamental shift: to construct more meaningful evaluations, NLP researchers should start by thoroughly specifying what a system’s world model should contain to be useful for downstream applications. We call such an account a “template of understanding.”

One example is fictional stories.

These cannot be Googled.

His CEO David Ferrucci has proposed a four-part template for testing an AI system’s ability to understand stories.

  1. Spatial: Where is everything located and how is it positioned throughout the story?
  2. Temporal: What events occur and when?
  3. Causal: How do events lead mechanistically to other events?
  4. Motivational: Why do the characters decide to take the actions they take?

Personally, I think this is an interesting path to follow in times ahead.

Particularly considering systems can be racist, misogynist and create adverse outcomes.

Consider Black Lives Matter, protests against climate change and the pandemic. Text matters, and we need to think carefully about how these systems are developed, benchmarked and deployed.

This is #500daysofAI and you are reading article 427. I am writing one new article about or related to artificial intelligence every day for 500 days.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store