An accurate natural language detection library for Golang

language detection for Golang

What does this library do?

Its task is simple: It tells you which language some provided textual data is written in. This is very useful as a preprocessing step for linguistic data in natural language processing applications such as text classification and spell checking. Other use cases, for instance, might include routing e-mails to the right geographically located customer service department, based on the e-mails’ languages.

Why does this Golang library exist?

Language detection is often done as part of large machine-learning frameworks or natural language processing applications. In cases where you don’t need the full-fledged functionality of those systems or don’t want to learn the ropes of those, a small flexible library comes in handy.

So far, the only other comprehensive open-source library in the Go ecosystem for this task is Whatlanggo. Unfortunately, it has two major drawbacks:

  1. Detection only works with quite lengthy text fragments. It does not provide adequate results for very short text snippets such as Twitter messages.
  2. The more languages take part in the decision process, the less accurate are the detection results.

Lingua aims to eliminate these problems. She nearly does not need any configuration and yields accurate results on long and short text, even on single words and phrases. She draws on both rule-based and statistical methods but does not use any dictionaries of words. She does not need a connection to any external API or service either. Once the library has been downloaded, it can be used completely offline.

Which languages are supported?

Compared to other language detection libraries, Lingua’s focus is on quality over quantity, that is, getting detection right for a small set of languages first before adding new ones. Currently, the following 75 languages are supported: