NLP Frameworks

NLP Frameworks#

Apache OpenNLP#

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also included maximum entropy and perceptron based machine learning.

The goal of the OpenNLP project will be to create a mature toolkit for the abovementioned tasks. An additional goal is to provide a large number of pre-built models for a variety of languages, as well as the annotated text resources that those models are derived from.

Item	Value
SBB License	Apache License 2.0
Core Technology	Java
Project URL	http://opennlp.apache.org/
Source Location	http://opennlp.apache.org/source-code.html
Tag(s)	NLP

Apache Tika#

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

Several wrappers are available to use Tika in another programming language, such as Julia or Python

Item	Value
SBB License	Apache License 2.0
Core Technology	Java
Project URL	https://tika.apache.org/
Source Location	https://tika.apache.org/
Tag(s)	NLP

BERT#

BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.

Our academic paper which describes BERT in detail and provides full results on a number of tasks can be found here: https://arxiv.org/abs/1810.04805.

OSS NLP training models from Google Research.

Item	Value
SBB License	Apache License 2.0
Core Technology	Python
Project URL	google-research/bert
Source Location	google-research/bert
Tag(s)	NLP

Bling Fire#

A lightning fast Finite State machine and REgular expression manipulation library. Bling Fire Tokenizer is a tokenizer designed for fast-speed and quality tokenization of Natural Language text. It mostly follows the tokenization logic of NLTK, except hyphenated words are split and a few errors are fixed.

Item	Value
SBB License	MIT License
Core Technology	CPP
Project URL	Microsoft/BlingFire
Source Location	Microsoft/BlingFire
Tag(s)	NLP

ERNIE#

An Implementation of ERNIE For Language Understanding (including Pre-training models and Fine-tuning tools)

ERNIE 2.0 is a continual pre-training framework for language understanding in which pre-training tasks can be incrementally built and learned through multi-task learning. In this framework, different customized tasks can be incrementally introduced at any time. For example, the tasks including named entity prediction, discourse relation recognition, sentence order prediction are leveraged in order to enable the models to learn language representations.

Item	Value
SBB License	Apache License 2.0
Core Technology	Python
Project URL	PaddlePaddle/ERNIE
Source Location	PaddlePaddle/ERNIE
Tag(s)	NLP, Python

fastText#

fastText is a library for efficient learning of word representations and sentence classification. Models can later be reduced in size to even fit on mobile devices.

Created by Facebook Opensource, now available for us all. Also used for the new search on StackOverflow, see https://stackoverflow.blog/2019/08/14/crokage-a-new-way-to-search-stack-overflow/

Item	Value
SBB License	MIT License
Core Technology	CPP, Python
Project URL	https://fasttext.cc/
Source Location	facebookresearch/fastText
Tag(s)	NLP

Flair#

A very simple framework for state-of-the-art NLP. Developed by Zalando Research.

Flair is:

A powerful NLP library. Flair allows you to apply our state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS), sense disambiguation and classification.
Multilingual. Thanks to the Flair community, we support a rapidly growing number of languages. We also now include ‘one model, many languages‘ taggers, i.e. single models that predict PoS or NER tags for input text in various languages.
A text embedding library. Flair has simple interfaces that allow you to use and combine different word and document embeddings, including our proposed Flair embeddings, BERT embeddings and ELMo embeddings.
A Pytorch NLP framework. Our framework builds directly on Pytorch, making it easy to train your own models and experiment with new approaches using Flair embeddings and classes.

Item	Value
SBB License	MIT License
Core Technology	Python
Project URL	zalandoresearch/flair
Source Location	zalandoresearch/flair
Tag(s)	ML, NLP, Python

Gensim#

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.

Item	Value
SBB License	MIT License
Core Technology	Python
Project URL	RaRe-Technologies/gensim
Source Location	RaRe-Technologies/gensim
Tag(s)	ML, NLP, Python

Icecaps#

Microsoft Icecaps is an open-source toolkit for building neural conversational systems. Icecaps provides an array of tools from recent conversation modeling and general NLP literature within a flexible paradigm that enables complex multi-task learning setups.

Background information can be found here https://www.aclweb.org/anthology/P19-3021

Item	Value
SBB License	MIT License
Core Technology	Python
Project URL	https://www.microsoft.com/en-us/research/project/microsoft-icecaps/
Source Location	microsoft/icecaps
Tag(s)	NLP, Python

jiant#

jiant is a software toolkit for natural language processing research, designed to facilitate work on multitask learning and transfer learning for sentence understanding tasks.

New software for the The General Language Understanding Evaluation (GLUE) benchmark. This software can be used for evaluating, and analyzing natural language understanding systems.

See also: https://super.gluebenchmark.com/

Item	Value
SBB License	MIT License
Core Technology	Python
Project URL	https://jiant.info/
Source Location	nyu-mll/jiant
Tag(s)	NLP, Python, Research

Neuralcoref#

State-of-the-art coreference resolution based on neural nets and spaCy.

NeuralCoref is a pipeline extension for spaCy 2.0 that annotates and resolves coreference clusters using a neural network. NeuralCoref is production-ready, integrated in spaCy’s NLP pipeline and easily extensible to new training datasets.

Item	Value
SBB License	MIT License
Core Technology	Python
Project URL	https://huggingface.co/coref/
Source Location	huggingface/neuralcoref
Tag(s)	ML, NLP, Python

NLP Architect#

NLP Architect is an open-source Python library for exploring the state-of-the-art deep learning topologies and techniques for natural language processing and natural language understanding. It is intended to be a platform for future research and collaboration.

Features:

Core NLP models used in many NLP tasks and useful in many NLP applications
Novel NLU models showcasing novel topologies and techniques
Optimized NLP/NLU models showcasing different optimization algorithms on neural NLP/NLU models
Model-oriented design:
- Train and run models from command-line.
- API for using models for inference in python.
- Procedures to define custom processes for training, inference or anything related to processing.
- CLI sub-system for running procedures
Based on optimized Deep Learning frameworks:
Essential utilities for working with NLP models – Text/String pre-processing, IO, data-manipulation, metrics, embeddings.

Item	Value
SBB License	Apache License 2.0
Core Technology	Python
Project URL	http://nlp_architect.nervanasys.com/
Source Location	NervanaSystems/nlp-architect
Tag(s)	ML, ML Tool, NLP, Python

NLTK (Natural Language Toolkit)#

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

NLTK is known as a good learning platform, but is not designed to robustly serve millions of customers. So ideal for experimenting, but for some business use cases not always the right choice.

Check also the (free) online Book (OReily published)

Item	Value
SBB License	Apache License 2.0
Core Technology	Python
Project URL	http://www.nltk.org
Source Location	nltk/nltk
Tag(s)	NLP

Pattern#

Pattern is a web mining module for Python. It has tools for:

Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM parser
Natural Language Processing: part-of-speech taggers, n-gram search, sentiment analysis, WordNet
Machine Learning: vector space model, clustering, classification (KNN, SVM, Perceptron)
Network Analysis: graph centrality and visualization.

Item	Value
SBB License	BSD License 2.0 (3-clause, New or Revised) License
Core Technology	Python
Project URL	clips/pattern
Source Location	clips/pattern
Tag(s)	ML, NLP, Web scraping

Rant#

Rant is an all-purpose procedural text engine that is most simply described as the opposite of Regex. It has been refined to include a dizzying array of features for handling everything from the most basic of string generation tasks to advanced dialogue generation, code templating, automatic formatting, and more.

The goal of the project is to enable developers of all kinds to automate repetitive writing tasks with a high degree of creative freedom.

Features:

Recursive, weighted branching with several selection modes
Queryable dictionaries
Automatic capitalization, rhyming, English indefinite articles, and multi-lingual number verbalization
Print to multiple separate outputs
Probability modifiers for pattern elements
Loops, conditional statements, and subroutines
Fully-functional object model
Import/Export resources easily with the .rantpkg format
Compatible with Unity 2017

Item	Value
SBB License	MIT License
Core Technology	.NET
Source Location	rant-lang/rant
Tag(s)	.NET, ML, NLP, text generation

SpaCy#

Industrial-strength Natural Language Processing (NLP) with Python and Cython

Features:

Non-destructive tokenization
Named entity recognition
Support for 26+ languages
13 statistical models for 8 languages
Pre-trained word vectors
Easy deep learning integration
Part-of-speech tagging
Labelled dependency parsing
Syntax-driven sentence segmentation
Built in visualizers for syntax and NER
Convenient string-to-hash mapping
Export to numpy data arrays
Efficient binary serialization
Easy model packaging and deployment
State-of-the-art speed
Robust, rigorously evaluated accuracy

Item	Value
SBB License	MIT License
Core Technology	Python
Project URL	https://spacy.io/
Source Location	explosion/spaCy
Tag(s)	NLP

Stanford CoreNLP#

Stanford CoreNLP provides a set of human language technology tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and syntactic dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract particular or open-class relations between entity mentions, get the quotes people said, etc.

Choose Stanford CoreNLP if you need:

An integrated NLP toolkit with a broad range of grammatical analysis tools
A fast, robust annotator for arbitrary texts, widely used in production
A modern, regularly updated package, with the overall highest quality text analytics
Support for a number of major (human) languages
Available APIs for most major modern programming languages
Ability to run as a simple web service

Item	Value
SBB License	GNU General Public License (GPL) 3.0
Core Technology	Java
Project URL	https://stanfordnlp.github.io/CoreNLP/
Source Location	stanfordnlp/CoreNLP
Tag(s)	NLP

Sumeval#

Well tested & Multi-language evaluation framework for text summarization. Multi-language.

Item	Value
SBB License	Apache License 2.0
Core Technology	Python
Project URL	chakki-works/sumeval
Source Location	chakki-works/sumeval
Tag(s)	NLP, Python

Texar-PyTorch#

Texar-PyTorch is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides a library of easy-to-use ML modules and functionalities for composing whatever models and algorithms. The tool is designed for both researchers and practitioners for fast prototyping and experimentation.

Texar-PyTorch integrates many of the best features of TensorFlow into PyTorch, delivering highly usable and customizable modules superior to PyTorch native ones.

Item	Value
SBB License	Apache License 2.0
Core Technology	Python
Project URL	https://asyml.io/
Source Location	asyml/texar-pytorch
Tag(s)	ML, NLP, Python

TextBlob: Simplified Text Processing#

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

Features

Noun phrase extraction
Part-of-speech tagging
Sentiment analysis
Classification (Naive Bayes, Decision Tree)
Language translation and detection powered by Google Translate
Tokenization (splitting text into words and sentences)
Word and phrase frequencies
Parsing
n-grams
Word inflection (pluralization and singularization) and lemmatization
Spelling correction
Add new models or languages through extensions
WordNet integration

Item	Value
SBB License	MIT License
Core Technology	Python
Project URL	https://textblob.readthedocs.io/en/dev/
Source Location	sloria/textblob
Tag(s)	NLP, Python

Thinc#

Thinc is the machine learning library powering spaCy. It features a battle-tested linear model designed for large sparse learning problems, and a flexible neural network model under development for spaCy v2.0.

Thinc is a lightweight deep learning library that offers an elegant, type-checked, functional-programming API for composing models, with support for layers defined in other frameworks such as PyTorch, TensorFlow and MXNet. You can use Thinc as an interface layer, a standalone toolkit or a flexible way to develop new models.

Thinc is a practical toolkit for implementing models that follow the “Embed, encode, attend, predict” architecture. It’s designed to be easy to install, efficient for CPU usage and optimised for NLP and deep learning with text – in particular, hierarchically structured input and variable-length sequences.

Item	Value
SBB License	MIT License
Core Technology	Python
Project URL	https://thinc.ai/
Source Location	explosion/thinc
Tag(s)	ML, ML Framework, NLP, Python

Torchtext#

Data loaders and abstractions for text and NLP. Build on PyTorch.

Item	Value
SBB License	BSD License 2.0 (3-clause, New or Revised) License
Core Technology
Project URL	pytorch/text
Source Location	pytorch/text
Tag(s)	NLP

Transformers#

Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.

Features

As easy to use as pytorch-transformers
As powerful and concise as Keras
High performance on NLU and NLG tasks
Low barrier to entry for educators and practitioners

State-of-the-art NLP for everyone:

Deep learning researchers
Hands-on practitioners
AI/ML/NLP teachers and educators

Lower compute costs, smaller carbon footprint

Researchers can share trained models instead of always retraining
Practitioners can reduce compute time and production costs
8 architectures with over 30 pretrained models, some in more than 100 languages

Item	Value
SBB License	Apache License 2.0
Core Technology	Python
Project URL	https://huggingface.co/transformers/
Source Location	huggingface/transformers
Tag(s)	NLP, Python

End of SBB list

NLP Frameworks

Contents

NLP Frameworks#

Apache OpenNLP#

Apache Tika#

BERT#

Bling Fire#

ERNIE#

fastText#

Flair#

Gensim#

Icecaps#

jiant#

Neuralcoref#

NLP Architect#

NLTK (Natural Language Toolkit)#

Pattern#

Rant#

SpaCy#

Stanford CoreNLP#

Sumeval#

Texar-PyTorch#

TextBlob: Simplified Text Processing#

Features

Thinc#

Torchtext#

Transformers#