We have to thread carefully — photo by @beefchen

Evaluating NLP Models in 2020

Testing an AI system’s ability through spatial, temporal, causal and motivational entities or events in fictional stories

With language models we have to thread carefully. Developing, benchmarking and deploying natural-language processing (NLP) models must be a considerate process beyond traditional/existing performance or evaluation measures.

Today I wanted to write about an opinion piece in the MIT Technology Review. The story was written on the 31st of July by Jesse Dunietz.

Jesse starts by commenting on the annual meeting of the Association for Computational Linguistics (ACL).

Apparently there was a change in the mood at the conference.

Previously technical flavour permeated the papers, research talks and chats.

The conference was in July.

But wait… computational linguistics?

According to the website of ACL this is the definition :

So, Dunietz says this conference felt different.

Conversations were introspective about core methods in.

NLP is considered the branch of AI focused on systems analysing human language.

Why is that?

Perhaps partly to the new addition. There was a ‘theme’ track.

Papers for the new “Theme” track asked questions like:

“Are current methods really enough to achieve the field’s ultimate goals? What even are those goals?”

In this he refers to two articles.

Dunietz believes the angst is justified.

He runs a firm called Elemental Cognition based in New York.

So, why should practitioners and theoretical researchers in NLP be angsty?

Maybe for a lack of evaluation, he argues.

Him, and his company believes the field needs transformation in evaluation.

Dunietz argues the field has progressed, but comprehension of NLP has been measured on benchmark data sets.

→ These consist of thousands of questions, each accompanied by passages containing the answer.

Deep neural networks swept the field in the mid-2010's.

With them they brought a leap in performance.

New datasets would emerge with even trickier questions [links by Dunietz].

  • The GLUE benchmark.
  • Normans. The Stanford Question Answering Dataset.
  • HotpotQA, a question answering dataset.

Progress entails ‘tweaking’ models to get more points. He says:

“State of the art” has practically become a proper noun: “We beat SOTA on SQuAD by 2.4 points!”

Not everyone agrees with this approach.

What Dunietz calls leaderboard-chasing can be troubling.

It is an academic exercise, sometimes in exploiting spurious patterns in data.

He refers to:

“Do recent “advances” really translate into helping people solve problems?”

This is more than abstract — it has stakes for society.

We need to wake up and smell the coffee.

That is, realise the truth about the current situation.

Dunietz is in one of the leading environments for technology, and he argues today’s models are not close to achieving the comprehension needed.

There is a gap in paper evaluations and real-world ability.

He, and his colleagues argue in a new paper that:

To Test Machine Comprehension, Start by Defining Comprehension

He makes a rather witty remark on the current situation:

NLP researchers have been training to become professional sprinters by “glancing around the gym and adopting any exercises that look hard.”

What is healthy?

What is socially beneficial?

Percentage cannot immediately measure these aspects.

Recently I wrote that GPT-3 (new model from OpenAI) is amazing:

Yet, in the very same article it showed the model had dangerous limitations with possible adverse social limitations:

Yes, some researchers or people within the field of AI will not be surprised.

However, people with no prior knowledge or limited understanding of the field of AI may certainly be – that is, most of society.

People in society trust experts, and NLP practitioners can certainly fall in this category.

A human reading will build a “mental model” of the world.

Then, hypothesise about counterfactual alternatives.

Can this be done?

Automated research assistants and game characters should be able to do this, Dunietz argues.

An NLP researcher can:

“One reliable technique is to probe the system’s model of the world, which can leave even the much-ballyhooed GPT-3 babbling about cycloptic blades of grass.”

As a side note means: approaching or viewing a situation with a single perspective. You may have assumed this, yet admittedly I had to Google it.

His overall argument is relatively simple:

“…however systems are implemented, if they need to have faithful world models, then evaluations should systematically test whether they have faithful world models.”

This according to Dunietz is rarely done.

Allen Institute for AI have proposed other ways to harden the evaluations.

  • Targeting diverse linguistic structures,
  • Asking questions that rely on multiple reasoning steps.
  • Aggregating many benchmarks.
  • Other researchers focused on testing common sense.

Most of these still focus on compiling questions.

Maybe one has to go beyond?

One example is fictional stories.

These cannot be Googled.

His CEO David Ferrucci has proposed a four-part template for testing an AI system’s ability to understand stories.

  1. Spatial: Where is everything located and how is it positioned throughout the story?
  2. Temporal: What events occur and when?
  3. Causal: How do events lead mechanistically to other events?
  4. Motivational: Why do the characters decide to take the actions they take?

Personally, I think this is an interesting path to follow in times ahead.

Particularly considering systems can be racist, misogynist and create adverse outcomes.

Consider Black Lives Matter, protests against climate change and the pandemic. Text matters, and we need to think carefully about how these systems are developed, benchmarked and deployed.

AI Policy and Ethics at www.nora.ai. Student at University of Copenhagen MSc in Social Data Science. All views are my own. twitter.com/AlexMoltzau