Let’s Talk About spaCy
Industrial-Strength Natural Language Processing
There are increasingly more libraries within natural language processing (NLP). One of these is spaCy.
“spaCy (/speɪˈsiː/ spay-SEE) is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython”
This seems to be a contender to the Natural Language Toolkit, or more commonly NLTK. It is another suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.
According to a blog post in ActiveState there are philosophical differences. NLTK was built by scholars and researchers as a tool to help you create complex NLP functions. “In contrast, spaCy is similar to a service: it helps you get specific tasks done.”
Yet another difference stems from the way in which these libraries were built.
- NLTK is essentially a string processing library, where each function takes strings as input and returns a processed string.
- spaCy takes an object-oriented approach. Each function returns objects instead of strings or arrays.
Another difference is that spaCy has grown to support over 50 languages.
NLTK has often been referred to in recent books about NLP, however spaCy seems somewhat more accessible. It provides a fast and accurate syntactic analysis.
In 2015, independent researchers from Emory University and Yahoo! Labs showed that spaCy offered the fastest syntactic parser in the world and that its accuracy was within 1% of the best available (Choi et al., 2015). spaCy v2.0, released in 2017, is more accurate than any of the systems Choi et al. evaluated.
It is argued that the underlying philosophy of spaCy is providing a service rather than being a tool. Perhaps for that reason it is user-friendly and performs well.