Stanza Python NLP
The last few days I have been scratching the surface of tools for Natural Language Processing (NLP). Doing so I have touched on three separate packages.
The Natural Language Toolkit (NLTK) and spaCy. Today I thought I would consider another, this one from a community at Stanford University. It is called Stanza. It seems to have been originally developed in Java, but now it has a version in Python. Stanford NLP Group released Stanza.
What is the Stanford University NLP Group?
According to their website the Natural Language Processing Group at Stanford University is a:
“… team of faculty, postdocs, programmers and students who work together on algorithms that allow computers to process and understand human languages.”
Their work ranges from basic research in computational linguistics to key applications in human language technology.
Their work covers areas such as:
- Sentence understanding.
- Automatic question answering.
- Machine translation.
- Syntactic parsing and tagging.
- Sentiment analysis.
- Models of text and visual scenes.
- Applications of natural language processing to the digital humanities and computational social sciences
They have members both from the Linguistics Department and the Computer Science Department.
They are additionally part of the Stanford AI Lab.
They claim Stanza provides implementations of fast neural network models for tokenization, multi-word token expansion, part-of-speech and morphological features tagging, lemmatization and dependency parsing using the Universal Depdnencies formalism.
Pretrained models are provided for more than 70 human languages. That seems to be more than most other packages out there.
An article from InfoQ written in May 2020 displays the following comparison:
However, it seems that according to discussions on Reddit r/LanguageTechnology it seems spaCy is still superior in terms of speed.
Therefore, it may depend on speed or utility. It was also mentioned that if you had issues it may be easier to find solutions to issues in spaCy since spaCy is more widely used.
Also, when it comes to teaching materials it seems like there are more available to teach how to use spaCy.
That is great on the one hand, on the other hand it is important to keep track of development in different packages, especially if specific aspects of one package may excel at a task or way of working specific to one of your projects.