NLP Frameworks#
Apache OpenNLP#
The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.
The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also included maximum entropy and perceptron based machine learning.
The goal of the OpenNLP project will be to create a mature toolkit for the abovementioned tasks. An additional goal is to provide a large number of pre-built models for a variety of languages, as well as the annotated text resources that those models are derived from.
Item |
Value |
---|---|
SBB License |
Apache License 2.0 |
Core Technology |
Java |
Project URL |
|
Source Location |
|
Tag(s) |
NLP |
Apache Tika#
The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.
Several wrappers are available to use Tika in another programming language, such as Julia or Python
Item |
Value |
---|---|
SBB License |
Apache License 2.0 |
Core Technology |
Java |
Project URL |
|
Source Location |
|
Tag(s) |
NLP |
BERT#
BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.
Our academic paper which describes BERT in detail and provides full results on a number of tasks can be found here: https://arxiv.org/abs/1810.04805.
OSS NLP training models from Google Research.
Item |
Value |
---|---|
SBB License |
Apache License 2.0 |
Core Technology |
Python |
Project URL |
|
Source Location |
|
Tag(s) |
NLP |
Bling Fire#
A lightning fast Finite State machine and REgular expression manipulation library. Bling Fire Tokenizer is a tokenizer designed for fast-speed and quality tokenization of Natural Language text. It mostly follows the tokenization logic of NLTK, except hyphenated words are split and a few errors are fixed.
Item |
Value |
---|---|
SBB License |
MIT License |
Core Technology |
CPP |
Project URL |
|
Source Location |
|
Tag(s) |
NLP |
ERNIE#
An Implementation of ERNIE For Language Understanding (including Pre-training models and Fine-tuning tools)
ERNIE 2.0 is a continual pre-training framework for language understanding in which pre-training tasks can be incrementally built and learned through multi-task learning. In this framework, different customized tasks can be incrementally introduced at any time. For example, the tasks including named entity prediction, discourse relation recognition, sentence order prediction are leveraged in order to enable the models to learn language representations.
Item |
Value |
---|---|
SBB License |
Apache License 2.0 |
Core Technology |
Python |
Project URL |
|
Source Location |
|
Tag(s) |
NLP, Python |
fastText#
fastText is a library for efficient learning of word representations and sentence classification. Models can later be reduced in size to even fit on mobile devices.
Created by Facebook Opensource, now available for us all. Also used for the new search on StackOverflow, see https://stackoverflow.blog/2019/08/14/crokage-a-new-way-to-search-stack-overflow/
Item |
Value |
---|---|
SBB License |
MIT License |
Core Technology |
CPP, Python |
Project URL |
|
Source Location |
|
Tag(s) |
NLP |
Flair#
A very simple framework for state-of-the-art NLP. Developed by Zalando Research.
Flair is:
- A powerful NLP library. Flair allows you to apply our state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS), sense disambiguation and classification.
- Multilingual. Thanks to the Flair community, we support a rapidly growing number of languages. We also now include ‘one model, many languages‘ taggers, i.e. single models that predict PoS or NER tags for input text in various languages.
- A text embedding library. Flair has simple interfaces that allow you to use and combine different word and document embeddings, including our proposed Flair embeddings, BERT embeddings and ELMo embeddings.
- A Pytorch NLP framework. Our framework builds directly on Pytorch, making it easy to train your own models and experiment with new approaches using Flair embeddings and classes.
Item |
Value |
---|---|
SBB License |
MIT License |
Core Technology |
Python |
Project URL |
|
Source Location |
|
Tag(s) |
ML, NLP, Python |
Gensim#
Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.
Item |
Value |
---|---|
SBB License |
MIT License |
Core Technology |
Python |
Project URL |
|
Source Location |
|
Tag(s) |
ML, NLP, Python |
Icecaps#
Microsoft Icecaps is an open-source toolkit for building neural conversational systems. Icecaps provides an array of tools from recent conversation modeling and general NLP literature within a flexible paradigm that enables complex multi-task learning setups.
Background information can be found here https://www.aclweb.org/anthology/P19-3021
Item |
Value |
---|---|
SBB License |
MIT License |
Core Technology |
Python |
Project URL |
https://www.microsoft.com/en-us/research/project/microsoft-icecaps/ |
Source Location |
|
Tag(s) |
NLP, Python |
jiant#
jiant
is a software toolkit for natural language processing research, designed to facilitate work on multitask learning and transfer learning for sentence understanding tasks.
New software for the The General Language Understanding Evaluation (GLUE) benchmark. This software can be used for evaluating, and analyzing natural language understanding systems.
See also: https://super.gluebenchmark.com/
Item |
Value |
---|---|
SBB License |
MIT License |
Core Technology |
Python |
Project URL |
|
Source Location |
|
Tag(s) |
NLP, Python, Research |
Neuralcoref#
State-of-the-art coreference resolution based on neural nets and spaCy.
NeuralCoref is a pipeline extension for spaCy 2.0 that annotates and resolves coreference clusters using a neural network. NeuralCoref is production-ready, integrated in spaCy’s NLP pipeline and easily extensible to new training datasets.
Item |
Value |
---|---|
SBB License |
MIT License |
Core Technology |
Python |
Project URL |
|
Source Location |
|
Tag(s) |
ML, NLP, Python |
NLP Architect#
NLP Architect is an open-source Python library for exploring the state-of-the-art deep learning topologies and techniques for natural language processing and natural language understanding. It is intended to be a platform for future research and collaboration.
Features:
- Core NLP models used in many NLP tasks and useful in many NLP applications
- Novel NLU models showcasing novel topologies and techniques
- Optimized NLP/NLU models showcasing different optimization algorithms on neural NLP/NLU models
-
Model-oriented design:
- Train and run models from command-line.
- API for using models for inference in python.
- Procedures to define custom processes for training, inference or anything related to processing.
- CLI sub-system for running procedures
- Based on optimized Deep Learning frameworks:
- Essential utilities for working with NLP models – Text/String pre-processing, IO, data-manipulation, metrics, embeddings.
Item |
Value |
---|---|
SBB License |
Apache License 2.0 |
Core Technology |
Python |
Project URL |
|
Source Location |
|
Tag(s) |
ML, ML Tool, NLP, Python |
NLTK (Natural Language Toolkit)#
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.
NLTK is known as a good learning platform, but is not designed to robustly serve millions of customers. So ideal for experimenting, but for some business use cases not always the right choice.
Check also the (free) online Book (OReily published)
Item |
Value |
---|---|
SBB License |
Apache License 2.0 |
Core Technology |
Python |
Project URL |
|
Source Location |
|
Tag(s) |
NLP |
Pattern#
Pattern is a web mining module for Python. It has tools for:
- Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM parser
- Natural Language Processing: part-of-speech taggers, n-gram search, sentiment analysis, WordNet
- Machine Learning: vector space model, clustering, classification (KNN, SVM, Perceptron)
- Network Analysis: graph centrality and visualization.
Item |
Value |
---|---|
SBB License |
BSD License 2.0 (3-clause, New or Revised) License |
Core Technology |
Python |
Project URL |
|
Source Location |
|
Tag(s) |
ML, NLP, Web scraping |
Rant#
Rant is an all-purpose procedural text engine that is most simply described as the opposite of Regex. It has been refined to include a dizzying array of features for handling everything from the most basic of string generation tasks to advanced dialogue generation, code templating, automatic formatting, and more.
The goal of the project is to enable developers of all kinds to automate repetitive writing tasks with a high degree of creative freedom.
Features:
- Recursive, weighted branching with several selection modes
- Queryable dictionaries
- Automatic capitalization, rhyming, English indefinite articles, and multi-lingual number verbalization
- Print to multiple separate outputs
- Probability modifiers for pattern elements
- Loops, conditional statements, and subroutines
- Fully-functional object model
- Import/Export resources easily with the .rantpkg format
- Compatible with Unity 2017
Item |
Value |
---|---|
SBB License |
MIT License |
Core Technology |
.NET |
Source Location |
|
Tag(s) |
.NET, ML, NLP, text generation |
SpaCy#
Industrial-strength Natural Language Processing (NLP) with Python and Cython
Features:
- Non-destructive tokenization
- Named entity recognition
- Support for 26+ languages
- 13 statistical models for 8 languages
- Pre-trained word vectors
- Easy deep learning integration
- Part-of-speech tagging
- Labelled dependency parsing
- Syntax-driven sentence segmentation
- Built in visualizers for syntax and NER
- Convenient string-to-hash mapping
- Export to numpy data arrays
- Efficient binary serialization
- Easy model packaging and deployment
- State-of-the-art speed
- Robust, rigorously evaluated accuracy
Item |
Value |
---|---|
SBB License |
MIT License |
Core Technology |
Python |
Project URL |
|
Source Location |
|
Tag(s) |
NLP |
Stanford CoreNLP#
Stanford CoreNLP provides a set of human language technology tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and syntactic dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract particular or open-class relations between entity mentions, get the quotes people said, etc.
Choose Stanford CoreNLP if you need:
- An integrated NLP toolkit with a broad range of grammatical analysis tools
- A fast, robust annotator for arbitrary texts, widely used in production
- A modern, regularly updated package, with the overall highest quality text analytics
- Support for a number of major (human) languages
- Available APIs for most major modern programming languages
- Ability to run as a simple web service
Item |
Value |
---|---|
SBB License |
GNU General Public License (GPL) 3.0 |
Core Technology |
Java |
Project URL |
|
Source Location |
|
Tag(s) |
NLP |
Sumeval#
Well tested & Multi-language evaluation framework for text summarization. Multi-language.
Item |
Value |
---|---|
SBB License |
Apache License 2.0 |
Core Technology |
Python |
Project URL |
|
Source Location |
|
Tag(s) |
NLP, Python |
Texar-PyTorch#
Texar-PyTorch is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides a library of easy-to-use ML modules and functionalities for composing whatever models and algorithms. The tool is designed for both researchers and practitioners for fast prototyping and experimentation.
Texar-PyTorch integrates many of the best features of TensorFlow into PyTorch, delivering highly usable and customizable modules superior to PyTorch native ones.
Item |
Value |
---|---|
SBB License |
Apache License 2.0 |
Core Technology |
Python |
Project URL |
|
Source Location |
|
Tag(s) |
ML, NLP, Python |
TextBlob: Simplified Text Processing#
TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
Features
- Noun phrase extraction
- Part-of-speech tagging
- Sentiment analysis
- Classification (Naive Bayes, Decision Tree)
- Language translation and detection powered by Google Translate
- Tokenization (splitting text into words and sentences)
- Word and phrase frequencies
- Parsing
- n-grams
- Word inflection (pluralization and singularization) and lemmatization
- Spelling correction
- Add new models or languages through extensions
- WordNet integration
Item |
Value |
---|---|
SBB License |
MIT License |
Core Technology |
Python |
Project URL |
|
Source Location |
|
Tag(s) |
NLP, Python |
Thinc#
Thinc is the machine learning library powering spaCy. It features a battle-tested linear model designed for large sparse learning problems, and a flexible neural network model under development for spaCy v2.0.
Thinc is a lightweight deep learning library that offers an elegant, type-checked, functional-programming API for composing models, with support for layers defined in other frameworks such as PyTorch, TensorFlow and MXNet. You can use Thinc as an interface layer, a standalone toolkit or a flexible way to develop new models.
Thinc is a practical toolkit for implementing models that follow the “Embed, encode, attend, predict” architecture. It’s designed to be easy to install, efficient for CPU usage and optimised for NLP and deep learning with text – in particular, hierarchically structured input and variable-length sequences.
Item |
Value |
---|---|
SBB License |
MIT License |
Core Technology |
Python |
Project URL |
|
Source Location |
|
Tag(s) |
ML, ML Framework, NLP, Python |
Torchtext#
Data loaders and abstractions for text and NLP. Build on PyTorch.
Item |
Value |
---|---|
SBB License |
BSD License 2.0 (3-clause, New or Revised) License |
Core Technology |
|
Project URL |
|
Source Location |
|
Tag(s) |
NLP |
Transformers#
Transformers (formerly known as pytorch-transformers
and pytorch-pretrained-bert
)
provides state-of-the-art general-purpose architectures (BERT, GPT-2,
RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding
(NLU) and Natural Language Generation (NLG) with over 32+ pretrained
models in 100+ languages and deep interoperability between TensorFlow
2.0 and PyTorch.
Features
- As easy to use as pytorch-transformers
- As powerful and concise as Keras
- High performance on NLU and NLG tasks
- Low barrier to entry for educators and practitioners
State-of-the-art NLP for everyone:
- Deep learning researchers
- Hands-on practitioners
- AI/ML/NLP teachers and educators
Lower compute costs, smaller carbon footprint
- Researchers can share trained models instead of always retraining
- Practitioners can reduce compute time and production costs
- 8 architectures with over 30 pretrained models, some in more than 100 languages
Item |
Value |
---|---|
SBB License |
Apache License 2.0 |
Core Technology |
Python |
Project URL |
|
Source Location |
|
Tag(s) |
NLP, Python |
End of SBB list