Natural language detection

Overview

franc

Build Status Coverage Status

Detect the language of text.

What’s so cool about franc?

  1. franc can support more languages(†) than any other library
  2. franc is packaged with support for 82, 187, or 406 languages
  3. franc has a CLI

† - Based on the UDHR, the most translated document in the world.

What’s not so cool about franc?

franc supports many languages, which means it’s easily confused on small samples. Make sure to pass it big documents to get reliable results.

Install

npm:

npm install franc

This installs the franc package, with support for 187 languages (languages which have 1 million or more speakers). franc-min (82 languages, 8m or more speakers) and franc-all (all 406 possible languages) are also available. Finally, use franc-cli to install the CLI.

Browser builds for franc-min, franc, and franc-all are available on GitHub Releases.

Use

var franc = require('franc')

franc('Alle menslike wesens word vry') // => 'afr'
franc('এটি একটি ভাষা একক IBM স্ক্রিপ্ট') // => 'ben'
franc('Alle menneske er fødde til fridom') // => 'nno'

franc('') // => 'und' (language code that stands for undetermined)

// You can change what’s too short (default: 10):
franc('the') // => 'und'
franc('the', {minLength: 3}) // => 'sco'
.all
console.log(franc.all('O Brasil caiu 26 posições'))

Yields:

[ [ 'por', 1 ],
  [ 'src', 0.8797557538750587 ],
  [ 'glg', 0.8708313762329732 ],
  [ 'snn', 0.8633161108501644 ],
  [ 'bos', 0.8172851103804604 ],
  ... 116 more items ]
only
console.log(franc.all('O Brasil caiu 26 posições', {only: ['por', 'spa']}))

Yields:

[ [ 'por', 1 ], [ 'spa', 0.799906059182715 ] ]
ignore
console.log(franc.all('O Brasil caiu 26 posições', {ignore: ['src', 'glg']}))

Yields:

[ [ 'por', 1 ],
  [ 'snn', 0.8633161108501644 ],
  [ 'bos', 0.8172851103804604 ],
  [ 'hrv', 0.8107092531705026 ],
  [ 'lav', 0.810239549084077 ],
  ... 114 more items ]

CLI

Install:

npm install franc-cli --global

Use:

CLI to detect the language of text

Usage: franc [options] <string>

Options:

  -h, --help                    output usage information
  -v, --version                 output version number
  -m, --min-length <number>     minimum length to accept
  -o, --only <string>           allow languages
  -i, --ignore <string>         disallow languages
  -a, --all                     display all guesses

Usage:

# output language
$ franc "Alle menslike wesens word vry"
# afr

# output language from stdin (expects utf8)
$ echo "এটি একটি ভাষা একক IBM স্ক্রিপ্ট" | franc
# ben

# ignore certain languages
$ franc --ignore por,glg "O Brasil caiu 26 posições"
# src

# output language from stdin with only
$ echo "Alle mennesker er født frie og" | franc --only nob,dan
# nob

Supported languages

Package Languages Speakers
franc-min 82 8M or more
franc 187 1M or more
franc-all 406 -

Language code

Note that franc returns ISO 639-3 codes (three letter codes). Not ISO 639-1 or ISO 639-2. See also GH-10 and GH-30.

To get more info about the languages represented by ISO 639-3, use iso-639-3. There is also an index available to map ISO 639-3 to ISO 639-1 codes, iso-639-3/to-1.json, but note that not all 639-3 codes can be represented in 639-1.

Ports

Franc has been ported to several other programming languages.

The works franc is derived from have themselves also been ported to other languages.

Derivation

Franc is a derivative work from guess-language (Python, LGPL), guesslanguage (C++, LGPL), and Language::Guess (Perl, GPL). Their creators granted me the rights to distribute franc under the MIT license: respectively, Kent S. Johnson, Jacob R. Rideout, and Maciej Ceglowski.

License

MIT © Titus Wormer

Comments
  • Add support for BCP 47 and output IANA language subtags

    Add support for BCP 47 and output IANA language subtags

    By default, Franc returns ISO-639-3 three-letter language tags, as listed in the Supported Languages table.

    We would like Franc to alternatively support outputting IANA language subtags as an option, in compliance with the W3C recommendation for specifying the value of the lang attribute in HTML (and the xml:lang attribute in XML) documents.

    (Two- and three-letter) IANA language codes are used as the primary language subtags in the language tag syntax as defined by the IETF’s BCP 47, which may be further specified by adding subtags for “extended language”, script, region, dialect variants, etc. (RFC 5646 describes the syntax in full). The addition of such more fine-grained secondary qualifiers are, I guess, out of Franc’s scope, but it would be very helpful nevertheless when Franc would be able to at least return the IANA primary language tags, which suffice, if used stand-alone, to be still in compliance with the spec.

    On the Web — as the IETF and W3C agree — IANA language subtags and BCP 47 seem to be the de facto industry standard (at least more so than ISO 639-3). Moreover, the naming convention for TeX hyphenation pattern files (such as used by i.a. OpenOffice) use ISO-8859-2 codes, which overlap better with IANA language subtags, too.

    If Franc would output IANA language subtags, then the return values could be used as-is, and without any further post-processing or re-mapping, in, for example CSS rules, specifying hyphenation:

    @media print {
      :lang(nl) { hyphenate-patterns: url(hyphenation/hyph-nl.pat); }
    }
    

    @wooorm :

    1. What is the rationale for Franc to default on ISO-639-3 (only)? Is it a “better” standard, and, if so, why?
    2. If you would agree it would be a good idea for Franc to support BCP 47 and outputting IANA language subtags as an available option, then how would you prefer it to be implemented and accept a PR? (We’d happily contribute.) Would it suffice to add and map them in data/support.json?
    opened by rhythmus 12
  • Reference of source document

    Reference of source document

    It seems that NONE of the languages have sources to the data.json 3-gram model. Is it possible to provide document sources for each language such that we can review the material, and possibly generate 2-grams and 4-grams (or 2/3 or 3/4 or 2/3/4-gram combos) models?

    opened by DonaldTsang 10
  • Problems with franc and Uzbek (uzb, uzn, uzs)

    Problems with franc and Uzbek (uzb, uzn, uzs)

    I have implemented and found that uzbek (my native) language is not working properly. I tested with large data-sets. Can I make contribution? Also, there is some issue on naming convention of language code here, 'uzn' (Nothern Uzbek) language has never been in linguistics. But I wonder how it became ISO 639 identifier.

    opened by muminoff 10
  • BUG: Basic tests show that franc is extremely inaccurate

    BUG: Basic tests show that franc is extremely inaccurate

    > franc.all('Hola amiga', { only: [ 'eng', 'spa', 'por', 'ita', 'fra' ] })
    [
      [ 'spa', 1 ],
      [ 'ita', 0.9323770491803278 ],
      [ 'fra', 0.5942622950819672 ],
      [ 'por', 0.5368852459016393 ],
      [ 'eng', 0 ]
    ]
    > franc.all('Hola mi amiga', { only: [ 'eng', 'spa', 'por', 'ita', 'fra' ] })
    [
      [ 'ita', 1 ],
      [ 'spa', 0.6840958605664488 ],
      [ 'fra', 0.6318082788671024 ],
      [ 'por', 0.08714596949891062 ],
      [ 'eng', 0 ]
    ]
    > franc.all('Ciao amico!', { only: [ 'eng', 'spa', 'por', 'ita', 'fra' ] })
    [
      [ 'spa', 1 ],
      [ 'por', 0.9940758293838863 ],
      [ 'ita', 0.9170616113744076 ],
      [ 'eng', 0.6232227488151658 ],
      [ 'fra', 0.46563981042654023 ]
    ]
    

    These are all completely incorrect accuracies.

    opened by niftylettuce 8
  • Make MAX_LENGTH an options parameter

    Make MAX_LENGTH an options parameter

    Hello!

    First of all, thank you for this wonderful project.

    It seems that franc limits the text sample to analyse to a hard-coded 2048 chars in these lines

    https://github.com/wooorm/franc/blob/5842af9c1a74ffb47ebe3307bfc61cf29b6e842e/packages/franc/index.js#L21 https://github.com/wooorm/franc/blob/5842af9c1a74ffb47ebe3307bfc61cf29b6e842e/packages/franc/index.js#L93

    Could this MAX_LENGTH const be part of options? It seems to me this is due to speed reasons, but I care more about accuracy than speed.

    I am reading web pages that have parts in more than one language, and need to detect the most used language, but maybe the first 2048 characters are in the less used language.

    Sorry if I misinterpreted the code and is not doing what I thought

    opened by porkopek 8
  • Explain the output of 'all'

    Explain the output of 'all'

    The results of 'all' consist of the language code and a score number. I've guessed that the lowest number is the detected language, but what can be learned from the score number? Doesn't seem to be documented.

    I'm looking to detect the language of job titles in English and French only (because Canada) and I was getting results all over the place using just franc(jobTitle) but whitelisting english and french then applying a threshold to the score I was able to tune in a much more accurate result (still a 3.92% error rate over 1020 job titles, but it was in the 25% range before the threshold). Is this a good use for the score or am I just getting lucky?

    opened by stockholmux 8
  • Problems with latin alphabet languages

    Problems with latin alphabet languages

    A term like yellow flicker beat suggest german, english (correct) quite far below.

    Can you explain how this would work?

    I would like to use franc in combination with a spell checker, first detecting the language and then looking up correct words with a spell checker using the identified language.

    opened by djui 8
  • Some Japanese are detected as Chinese mandarin

    Some Japanese are detected as Chinese mandarin

    Hi, I see something strange about Japanese detection,

    if I put a translated text from google translate to Japanese: 裁判の周辺のラオスにUターンした元元兵士

    the lib detects it and returns 'jpn', but if I put a Japanese text from yahoo japan or amazon japan: ここ最近、よく拡散されたつぶやきや画像をまとめてご紹介。気になるも

    it returns 'cmn', does anyone know why?

    opened by ThisIsRoy1 7
  • Consistency on ISO standards for easier integration.

    Consistency on ISO standards for easier integration.

    Revisiting #10 I think its great that you support other languages not found in any of the ISO standards.

    But to those that can be found, the fact that Franc sometimes returns the 2T and others the 2B , makes it really hard to map without huge lists.

    For instance:

    • arm matches 2B for Armenian but not 2T nor 3 which are 'hye'
    • ces, on the other hand, matches 2T and 3 while 2B is 'cze'

    So it makes for difficult integration with standards that you return one or the other without consistency.

    I agree that with languages you wouldn't find, then we must find a solution and it is great! But for those that match, adhering to one or the other would be very helpful.

    Thanks, best regards, Rafa.

    opened by RafaPolit 6
  • Getting weird results

    Getting weird results

    Hey @wooorm am I doing something wrong here?

    > apps.forEach(app => console.log(franc(app.description), app.description))
    
    eng A universal clipboard managing app that makes it easy to access your clipboard from anywhere on any device
    fra 5EPlay CSGO Client
    nob Open-source Markdown editor built for desktop
    eng Communication tool to optimize the connection between people
    vmw Wireless HDMI
    eng An RSS and Atom feed aggregator
    eng A work collaboration product that brings conversation to your files.
    src Pristine Twitter app
    dan A Simple Friendly Markdown Note.
    nno An open source trading platform
    eng A hackable text editor for the 21 st Century
    eng One workspace open to all designers and developers
    nya A place to work + a way to work
    cat An experimental P2P browser
    sco Focused team communications
    sco Bitbloq is a tool to help children to learn and create programs for a microcontroller or robot, and to load them easily.
    eng A simple File Encryption application for Windows. Encrypt your bits.
    eng Markdown editor witch clarity +1
    eng Text editor with the power or Markdown
    eng Open-sourced note app for programmers
    sco Web browser that automatically blocks ads and trackers
    bug Facebook Messenger app
    dan Markdown editor for Mac / Windows / Linux
    fra Desktop build status notifications
    sco Group chat for global teams
    src Your rubik's cube solves
    sco Orthodox web file manager with console and editor
    cat Game development tools
    sco RPG style coding application
    deu Modern browser without tabs
    eng Your personal galaxy of inspiration
    sco A menubar/taskbar Gmail App for Windows, macOS and Linux.
    
    opened by zeke 6
  • Inaccurate detection examples

    Inaccurate detection examples

    Here are just a few inaccuracies I've come across testing this package:

    franc('iphone unlocked') // returns 'ibb' instead of 'eng'
    franc('new refrigerator') // returns 'dan' instead of 'eng'
    franc('макбук копмьютер очень хороший') // returns 'kir' instead of 'rus'
    
    opened by demisx 6
  • Improved accuracy for small documents

    Improved accuracy for small documents

    I'd like to play with patching franc, or making some alternative to it, that can detect the language of small documents much more accurately.

    First of all is this something that could be interesting to merge into franc itself?

    Secondly I'm almost clueless about language classification, could trying the following things make sense?

    1. Storing more than 300 trigrams, maybe 400 or so.
    2. Using quadgrams or bigrams rather than trigrams.
    3. Extracting the trigrams from a longer and more diverse document than the UDHR.

    From a shallow reading of this paper on n-grams it sounds to me like ngrams may be fundamentally not well suited for short documents because there just isn't enough data to reconstruct the top 300 or whatever ngrams reliably from that, maybe 🤔.

    CLD3 seems to feed unigrams bigrams and trigrams to some neural network and that seems to work much better for smaller texts somehow, I'm not sure how or why, maybe that's the way to go.

    Any other ideas that I should try?

    opened by fabiospampinato 19
  • Probability normalization

    Probability normalization

    Currently franc to me often returns a probability close to 1 for many languages, IMO all these probabilities should be normalized to add up to 1.

    Also there seems to always be a language at the top with probability 1, this makes it difficult to judge how sure the "model" is about the detection, which would be another interesting point of data to have.

    opened by fabiospampinato 3
  • Some Chinese sentences are detected as Japanese

    Some Chinese sentences are detected as Japanese

    sentence 1

    特別推薦的必訪店家「ヤマシロヤ」,雖然不在阿美橫町上,但就位於JR上野站廣小路口對面

    jpn 1
    google translate result is Chinese correctly
    

    sentence 2

    特別推薦的必訪店家,雖然不在阿美橫町上,但就位於JR上野站廣小路口對面

    cmn 1
    google translate result is Chinese correctly
    

    Sentence 1 almost are Chinese characters and contains 5 Katakana characters. But its result is jpn incorrectly.

    Sentence 2 are Chinese characters fully, and its result is cmn correctly.

    Maybe the result is related to #77

    opened by kewang 3
  • Use languages' alphabets to make detection more accurate

    Use languages' alphabets to make detection more accurate

    Что это за язык? is a Russian sentence, which is detected as Bulgarian (bul 1, rus 0.938953488372093, mkd 0.9353197674418605). However, neither Bulgarian nor Macedonian have the letters э and ы in their alphabets.

    Same with Чекаю цієї хвилини., which is Ukrainian, but is detected as Northern Uzbek with probability 1 whereas Ukrainian gets only 0.33999999999999997. However, the letters є and ї are used only in Ukrainian whereas the Uzbek Cyrillic alphabet doesn't include as many as five letters from this sentence, namely: ю, ц, і, є and ї.

    I know that Franc is supposed to be not good with short input strings, but taking alphabets into account seems to be a promising way to improve the accuracy.

    opened by thorn0 15
Releases(6.1.0)
Owner
Titus
🐧 Making it easier for developers to develop · core team @unifiedjs · full-time OSS · syntax trees, markdown, markup, natural language 🐧
Titus
【Auto】原神⭐钓鱼辅助工具 | 自动收竿、校准游标 | ✨您只需要抛出鱼竿,我们会帮你完成一切✨

原神钓鱼辅助工具 ✨ 作者正在努力重构代码中……会尽快带给大家一个更完美的脚本 ✨ 「您只需抛出鱼竿,然后我们会帮您搞定一切」 如果你觉得这个脚本好用,请点一个 Star ⭐ ,你的 Star 就是作者更新最大的动力 点击这里 查看演示视频 ✨ 欢迎大家在 Issues 中分享自己的配置文件 ✨ ✨

261 Jan 02, 2023
Localization of thoracic abnormalities model based on VinBigData (top 1%)

Repository contains the code for 2nd place solution of VinBigData Chest X-ray Abnormalities Detection competition. The goal of competition was to auto

33 May 24, 2022
📷 Face Recognition using Haar-Cascade Classifier, OpenCV, and Python

Face-Recognition-System Face Recognition using Haar-Cascade Classifier, OpenCV and Python. This project is based on face detection and face recognitio

1 Jan 10, 2022
Framework for the Complete Gaze Tracking Pipeline

Framework for the Complete Gaze Tracking Pipeline The figure below shows a general representation of the camera-to-screen gaze tracking pipeline [1].

Pascal 20 Jan 06, 2023
Program created with opencv that allows you to automatically count your repetitions on several fitness exercises.

Virtual partner of gym Description Program created with opencv that allows you to automatically count your repetitions on several fitness exercises li

1 Jan 04, 2022
A document scanner application for laptops/desktops developed using python, Tkinter and OpenCV.

DcoumentScanner A document scanner application for laptops/desktops developed using python, Tkinter and OpenCV. Directly install the .exe file to inst

Harsh Vardhan Singh 1 Oct 29, 2021
This is a tensorflow re-implementation of PSENet: Shape Robust Text Detection with Progressive Scale Expansion Network.My blog:

PSENet: Shape Robust Text Detection with Progressive Scale Expansion Network Introduction This is a tensorflow re-implementation of PSENet: Shape Robu

Michael liu 498 Dec 30, 2022
Library used to deskew a scanned document

Deskew //Note: Skew is measured in degrees. Deskewing is a process whereby skew is removed by rotating an image by the same amount as its skew but in

Stéphane Brunner 273 Jan 06, 2023
Face Anonymizer - FaceAnonApp v1.0

Face Anonymizer - FaceAnonApp v1.0 Blur faces from image and video files in /data/files folder. Contents Repo of the source files for the FaceAnonApp.

6 Apr 18, 2022
BD-ALL-DIGIT - This Is Bangladeshi All Sim Cloner Tools

BANGLADESHI ALL SIM CLONER TOOLS INSTALL TOOL ON TERMUX $ apt update $ apt upgra

MAHADI HASAN AFRIDI 2 Jan 19, 2022
Code for the paper "Controllable Video Captioning with an Exemplar Sentence"

SMCG Code for the paper "Controllable Video Captioning with an Exemplar Sentence" Introduction We investigate a novel and challenging task, namely con

10 Dec 04, 2022
A Python wrapper for the tesseract-ocr API

tesserocr A simple, Pillow-friendly, wrapper around the tesseract-ocr API for Optical Character Recognition (OCR). tesserocr integrates directly with

Fayez 1.7k Dec 31, 2022
(CVPR 2021) ST3D: Self-training for Unsupervised Domain Adaptation on 3D Object Detection

ST3D Code release for the paper ST3D: Self-training for Unsupervised Domain Adaptation on 3D Object Detection, CVPR 2021 Authors: Jihan Yang*, Shaoshu

CVMI Lab 224 Dec 28, 2022
A novel region proposal network for more general object detection ( including scene text detection ).

DeRPN: Taking a further step toward more general object detection DeRPN is a novel region proposal network which concentrates on improving the adaptiv

Deep Learning and Vision Computing Lab, SCUT 151 Dec 12, 2022
一款基于Qt与OpenCV的仿真数字示波器

一款基于Qt与OpenCV的仿真数字示波器

郭赟 4 Nov 02, 2022
RRD: Rotation-Sensitive Regression for Oriented Scene Text Detection

RRD: Rotation-Sensitive Regression for Oriented Scene Text Detection For more details, please refer to our paper. Citing Please cite the related works

Minghui Liao 102 Jun 29, 2022
This is a Computer vision package that makes its easy to run Image processing and AI functions. At the core it uses OpenCV and Mediapipe libraries.

CVZone This is a Computer vision package that makes its easy to run Image processing and AI functions. At the core it uses OpenCV and Mediapipe librar

CVZone 648 Dec 30, 2022
Deep Learning Chinese Word Segment

引用 本项目模型BiLSTM+CRF参考论文:http://www.aclweb.org/anthology/N16-1030 ,IDCNN+CRF参考论文:https://arxiv.org/abs/1702.02098 构建 安装好bazel代码构建工具,安装好tensorflow(目前本项目需

2.1k Dec 23, 2022
Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

Microsoft 235 Dec 22, 2022
A facial recognition program that plays a alarm (mp3 file) when a person i seen in the room. A basic theif using Python and OpenCV

Home-Security-Demo A facial recognition program that plays a alarm (mp3 file) when a person is seen in the room. A basic theif using Python and OpenCV

SysKey 4 Nov 02, 2021