Python character encoding detector

Overview

Chardet: The Universal Character Encoding Detector

Build status Latest version on PyPI

License

Detects
  • ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
  • Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
  • EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP (Japanese)
  • EUC-KR, ISO-2022-KR, Johab (Korean)
  • KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
  • ISO-8859-5, windows-1251 (Bulgarian)
  • ISO-8859-1, windows-1252 (Western European languages)
  • ISO-8859-7, windows-1253 (Greek)
  • ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
  • TIS-620 (Thai)

Note

Our ISO-8859-2 and windows-1250 (Hungarian) probers have been temporarily disabled until we can retrain the models.

Requires Python 3.6+.

Installation

Install from PyPI:

pip install chardet

Documentation

For users, docs are now available at https://chardet.readthedocs.io/.

Command-line Tool

chardet comes with a command-line script which reports on the encodings of one or more files:

% chardetect somefile someotherfile
somefile: windows-1252 with confidence 0.5
someotherfile: ascii with confidence 1.0

About

This is a continuation of Mark Pilgrim's excellent original chardet port from C, and Ian Cordasco's charade Python 3-compatible fork.

maintainer: Dan Blanchard
Comments
  • New language models added; old inacurate models was rebuilded. Hungarian test files changed. Script for language model building added

    New language models added; old inacurate models was rebuilded. Hungarian test files changed. Script for language model building added

    Text in the hungarian language can't contain many english words inside detected text. For example xml files can have more english words because of tag names and others. This detector is based on the letter frequency. The second problem arises if the hungarian text has many sentences in uppercase.

    opened by ghost 28
  • Modified filter_english_with_letters to mimic the behavior form Mozilla's version.

    Modified filter_english_with_letters to mimic the behavior form Mozilla's version.

    This change helps pass three unit tests that were failing before. I have also tested the changes by comparing the output of this function with Mozilla's version over some large randomly generated byte strings and so far so good.

    opened by rsnair2 19
  • Add Python 3 support (and drop support for < 2.6)

    Add Python 3 support (and drop support for < 2.6)

    Most of the credit for this goes to @bsidhom. I just took his Python 3 port and added a bunch of __future__ imports and the occasional from io import open to make it backward compatible with 2.6 and 2.7.

    I did some minor clean-up things like sorting imports and things like that. Oh and I added a .gitattributes file to ensure line endings are consistent.

    @erikrose Are you still actively maintaining this? I notice there a few outstanding pull requests and I just want to make sure the version on PyPI is 2/3 compatible soon.

    opened by dan-blanchard 18
  • Certain input creates extremely long runtime and memory leak

    Certain input creates extremely long runtime and memory leak

    I am using chardet as part of a web crawler written in python3. I noticed that over time (many hours), the program consumes all memory. I narrowed down the problem to a single call of chardet.detect() method for certain web pages.

    After some testing, it seems that chardet has problem with some special input and I managed to get a sample of such an input. It consumes on my machine about 220 MB of memory (however, the input is 2.5 MB) and takes about 1:22 minutes to process (in contrast to 43 ms when the file is truncated to about 2 MB). It seems not to be limited to python3, in python2 the memory consumption is even worse (312 MB).

    Versions:

    Fedora release 20 (Heisenbug) x86_64 chardet-2.2.1 (via pip) python3-3.3.2-11.fc20.x86_64 python-2.7.5-11.fc20.x86_64

    How to reproduce:

    I cannot attach any files to this issue so I uploaded them to my dropbox account: https://www.dropbox.com/sh/26dry8zj18cv0m1/sKgP_E44qx/chardet_test.zip Please let me know of a better place where to put it if necessary. Here is an overview of the content and the results:

    setup='import chardet; html = open("mem_leak_html.txt", "rb").read()'
    python3 -m timeit -s "$setup"  'chardet.detect(html[:2543482])'
    # produces: 10 loops, best of 3: 43 ms per loop
    python3 -m timeit -s "$setup"  'chardet.detect(html[:2543483])'
    # produces: 1 loops, best of 3: 1min 22s per loop
    python3 mem_leak_test.py
    # produces:
    # Good input left 2.65 MB of unfreed memory.
    # Bad input left 220.16 MB of unfreed memory.
    
    python -m timeit -s "$setup"  'chardet.detect(html[:2543482])'
    # produces: 10 loops, best of 3: 41.7 ms per loop
    python -m timeit -s "$setup"  'chardet.detect(html[:2543483])'
    # produces: 10 loops, best of 3: 111 sec per loop
    python mem_leak_test.py
    # produces:
    # Good input left 3.00 MB of unfreed memory.
    # Bad input left 312.00 MB of unfreed memory.
    
    mem_leak_test.py:
    import resource
    import chardet
    import gc
    
    mem_use = lambda: resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024
    html = open("mem_leak_html.txt", "rb").read()
    
    def test(desc, instr):
        gc.collect()
        mem_start = mem_use()
        chardet.detect(instr)    
        gc.collect()
        mem_used = mem_use() - mem_start
        print('%s left %.2f MB of unfreed memory.' % (desc, mem_used))    
    
    test('Good input', html[:2543482])
    test('Bad input', html[:2543483])
    
    bug help wanted 
    opened by radeklat 17
  • UTF detection when missing Byte Order Mark

    UTF detection when missing Byte Order Mark

    This change adds heuristic detection of UTF-16 and UTF-32 files when they are missing their byte order marks.

    At present we have no strategy for detecting the format of these files. Feel free to give feedback on the PR by the way, happy to have other eyes on it.

    Note I report these files as UTF-16LE, UTF-16BE, UTF-32LE and UTF-32BE rather than UTF-16 and UTF-32. This is justified for the following reasons: it's quite material whether it is little endian or big endian - Python will not decode these files correctly when passed decode("utf-16") - Python assumes endian-ness. This is less important for the files with the BOMs, as UTF-aware readers will generally inspect this and auto-decode the file, but for things missing the BOM, it's really important to be told the endian-ness in hopes of reading the file.

    We could change the reporting of of files with BOM to inform the endian-ness also, but I've avoided having an opinion on that for the moment.

    opened by jpz 16
  • Changing license

    Changing license

    This feels strange to be posing as a question, since I'm one of the co-maintainers, but @sigmavirus24 and @erikrose, do you know if it's okay/legal for us to change the license of chardet? Because it was started by Mark Pilgrim I feel like it's kind of a nebulous question, because he's not someone you can just email, and he has nothing to do with development anymore. I would really like to change the license to at least be MPL, since that's what the C++ version is, and our setup currently mirrors that code pretty closely.

    I'm not a fan of the LGPL and feel weird having a project I work on use it.

    question 
    opened by dan-blanchard 15
  • Add detection for MacRoman encoding

    Add detection for MacRoman encoding

    MacRoman is not in particularly common use anymore, as it has been deprecated by Mac OS for over a decade. However, there are programs such as Microsoft Office for Mac that didn't get the memo, and will often output in MacRoman when they write plain text files.

    This patch allows chardet to correctly detect MacRoman, instead of calling it something random and incorrect like ISO-8859-2. The MacRoman detector works similarly to the Latin-1 detector, but starts at a lower probability.

    I hope this is the right way to do it. There is surprisingly little support in chardet for adding a new single-byte Latin encoding.

    opened by rspeer 15
  • Add Hypothesis based test of chardet

    Add Hypothesis based test of chardet

    The concept here is pretty simple: This tries to test for the invariant that if a string comes from valid unicode and is encoded in one of the chardet supported encodings then chardet should detect some encoding.

    More nuanced tests are possible (e.g. asserting that the string should be decodable from the detected encoding) but given that even this test is demonstrating a bunch of bugs this seemed to be a good starting point.

    This is (more or less) the test that caught #65, #64 and #63. #62 had one extra line in it to try to reencode the data as the reported format.

    Notes:

    1. This test is currently failing. I'm pretty sure this is because of issues it's finding in the code, not issues with the test.
    2. min_size=100 is to rule out bugs that come solely from the length, prompted by your saying that short strings aren't really supported. Anecdotally all of the bugs that have been found so far don't depend on the length and min_size=1 would have been fine (leaving min_size alone is also valid, but I assume '' having a None encoding is intended behaviour)
    opened by DRMacIver 14
  • Failing to guess a single MS-apostrophe

    Failing to guess a single MS-apostrophe

    I have a page of text in ASCII with a single Microsoft-apostrophe chr(8217) detected as ISO-8859-2.

    #1. Create problematic sample
    >>> s = 'today' + chr(8217) + 's research'
    >>> s
    'today’s research'
    >>> b = s.encode('windows-1252')
    >>> b
    b'today\x92s research'
    
    #2. Attempt to decode it
    >>> chardet.detect(b)
    {'encoding': 'ISO-8859-2', 'confidence': 0.8060609643099236}
    >>> b.decode('ISO-8859-2')
    'today\x92s research'
    
    #3. Now try the correct encoding
    >>> b.decode('windows-1252')
    'today’s research'
    

    This text is very typical of anything created using a Microsoft editor. Furthermore, latest version of Firefox detects it correctly. I am using Python 3.3. Any help is appreciated.

    opened by shompol 13
  • Add upstream changes and clean up where possible

    Add upstream changes and clean up where possible

    This is very much a work in progress at the moment, and I'm just creating the PR to make it easier for me to keep track of Travis results.

    I have a few goals for this branch:

    1. Pull in changes from Mozilla's upstream code. There aren't as many as I had initially expected but there are some.
    2. Improve PEP8 compliance all over the place. The previous maintainers tried to keep variable names identical to the C code, presumably to ease the comparison with the Mozilla code, but we're going to be diverging from upstream after pulling in the changes mentioned in 1. Basically, Mozilla seems very likely to abandon their character encoding detector in the near future and switch to using ICU, but ICU doesn't support all of the codecs we currently do, because it is more web-focused. If our goal here is to be a truly universal character encoding detector, we'll need to go our own way in the future in that respect.
    3. Make the unit tests pass, or at the very least make it obvious that the tests are actually failing (instead of ignoring the failures like our current Travis build does).

    So far, I've done a little bit of point 1 and updated the Travis testing setup to use nose and report test coverage via Coveralls.

    enhancement 
    opened by dan-blanchard 13
  • Don't indicate byte order for UTF-16/32 with given BOM

    Don't indicate byte order for UTF-16/32 with given BOM

    If passed a string starting with \xff\xfe (low endian byte order mark) or \xfe\xff (big endian byte order mark) the encoding is detected as UTF-16LE, UTF-32LE, UTF-16BE or UTF-32BE respectively.

    However, as the byte order mark is given in the string, the encoding should be simply UTF-16 or UTF-32. Otherwise bytes.decode() will fail or preserve the byte order mark:

    s = 'foo'.encode('UTF-16')
    encoding = chardet.detect(s)['encoding']  # "UTF-16LE"
    s.decode(encoding)                        # "\ufefffoo"
    
    s = codecs.BOM_BE + 'foo'.encode('UTF-16BE')
    encoding = chardet.detect(s)['encoding']  # "UTF-16BE"
    s.decode(encoding)                        # "\ufefffoo"
    

    Hence code that uses chardet in order to detect the encoding to decode data, would need to wrap chardet.detect in following inconvenient and counter-intuitive way:

    encoding = chardet.detect(enc)['encoding']
    if encoding in ('UTF-16LE', 'UTF16BE'):
      dec = enc.decode('UTF-16')
    elif encoding in ('UTF-32LE', 'UTF-32BE'):
      dec = enc.decode('UTF-32')
    else:
      dec = enc.decode(encoding)
    

    This PR changes the behavior to return simply UTF-16or UTF-32 respectively when a byte order mark were found, that the detected encoding can be passed unchanged to bytes.decode().

    opened by snoack 12
  • Allow running of the package via `python3 -m chardet ...`

    Allow running of the package via `python3 -m chardet ...`

    I want to be able to execute the chardet main script (packaged as an executable) by running python3 -m chardet .... Currently it doesn't work. Would be great if it did work.

    opened by DeflateAwning 1
  • Documentation licensed only to non-commercial and personal use found

    Documentation licensed only to non-commercial and personal use found

    Hi,

    In the file, 'https://github.com/chardet/chardet/blob/main/tests/windows-1255-hebrew/hydepark.hevre.co.il.7957.xml', we have found the following license text:

    " This copy is for your personal, non-commercial use only. To order presentation-ready copies for distribution to your colleagues, clients or customers, use the Order Reprints tool at the bottom of any article or visit: www.djreprints.com. " This may cause issues even for open source projects that allows commercial use.

    Can you please let us know if there is an option to retain the file even for commercial use? Is it possible to remove the content that is only for non-commercial and personal use?

    Regards, Rahul

    opened by rahulmohang 0
  • Fix broken CP949 state machine

    Fix broken CP949 state machine

    Abstract

    Current CP949 state machine has some false positives, and incorrectly marks valid CP949 texts as an error. This PR rewrites the state transition table, to comply the CP949 Specification.

    Details

    These are some cases, which a false-positive error can occur in the current implementation.

    • (0xAD68) The first byte is classified as the class 8, as it is 0xAD. And in the START state, the class 8 makes an transition to the ERROR state. But this is a valid CP949.

    • (0xC652) The first byte is classified as the class 9, and the second byte is classified as the class 5. In the START state, the class 9 makes an transition to the State 6, and in the State 6, the class 5 makes an transition to the ERROR state. But this is a valid CP949.

    Test

    I have tested the state machine (To-Be) for the all characters in the CP949 with following code, and it successfully returned Success. When I have tested it against the current implementation (As-Is), it shows Error! at byte 15479.

    from chardet.codingstatemachine import CodingStateMachine
    from chardet.mbcssm import CP949_SM_MODEL
    
    sm = CodingStateMachine(CP949_SM_MODEL)
    
    with open('./path/to/cp949-chars.txt', 'rb') as f:
        data = f.read()
    
    for i, byte in enumerate(data):
        state = sm.next_state(byte)
    
        if state == 1:
            print("Error! at byte %d" % i)
            break
    
    if state != 1:
      print("Success! :)")
    

    I couldn't upload the cp949 characters to the test fixtures folder, as it will make the test fail because of the frequency-based probing, which will not successfully mark it as the CP949. (Because it is just a plain listing of the all possible characters of the CP949.)

    opened by HelloWorld017 2
  • chardet 5.0 KeyError with Python 3.10 on Windows

    chardet 5.0 KeyError with Python 3.10 on Windows

    Yesterday I encountered a strange CI failure for our Windows GitHub CI workflows which had been running fine until then. The Python 3.7 job passed fine but the Python 3.10 job failed.

    https://github.com/deluge-torrent/deluge/actions/workflows/ci.yml?query=branch%3Adevelop

    The only difference I could find from a diff of the logs was the new chardet 5.0.0 being pulled in. So I pinned chardet to 4.0.0 and CI is passing again.

    GitHub Actions Environment:

    Virtual Environment: windows-2022 (20220626.1)
    Python 3.10.5
    

    Just to note that I also tested same error occurs with windows-2019.

    The traceback is rather cryptic since it comes from pytest but this is all there is from the job:

    INTERNALERROR> Traceback (most recent call last):
    INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\_pytest\main.py", line 264, in wrap_session
    INTERNALERROR>     config._do_configure()
    INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\_pytest\config\__init__.py", line 995, in _do_configure
    INTERNALERROR>     self.hook.pytest_configure.call_historic(kwargs=dict(config=self))
    INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\pluggy\_hooks.py", line 277, in call_historic
    INTERNALERROR>     res = self._hookexec(self.name, self.get_hookimpls(), kwargs, False)
    INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\pluggy\_manager.py", line 80, in _hookexec
    INTERNALERROR>     return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
    INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\pluggy\_callers.py", line 60, in _multicall
    INTERNALERROR>     return outcome.get_result()
    INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\pluggy\_result.py", line 60, in get_result
    INTERNALERROR>     raise ex[1].with_traceback(ex[2])
    INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\pluggy\_callers.py", line 39, in _multicall
    INTERNALERROR>     res = hook_impl.function(*args)
    INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\_pytest\faulthandler.py", line 27, in pytest_configure
    INTERNALERROR>     import faulthandler
    INTERNALERROR>   File "<frozen importlib._bootstrap>", line 1024, in _find_and_load
    INTERNALERROR>   File "<frozen importlib._bootstrap>", line 171, in __enter__
    INTERNALERROR>   File "<frozen importlib._bootstrap>", line 123, in acquire
    INTERNALERROR> KeyError: 1832
    
    opened by cas-- 4
  • test_detect_all_and_detect_one_should_agree fails on Python 3.11b3

    test_detect_all_and_detect_one_should_agree fails on Python 3.11b3

    $ python3.11 --version
    Python 3.11.0b3
    $ python3.11 -m venv _e
    $ . _e/bin/activate
    (_e) $ pip install -e .
    (_e) $ pip install -e pytest hypothesis
    (_e) $ pytest
    

    results in:

    ====================================================== FAILURES ======================================================
    ____________________________________ test_detect_all_and_detect_one_should_agree _____________________________________
    
    txt = 'Ā𐀀', enc = 'utf-8', _ = HypothesisRandom(generated data)
    
        @given(
            st.text(min_size=1),
            st.sampled_from(
                [
                    "ascii",
                    "utf-8",
                    "utf-16",
                    "utf-32",
                    "iso-8859-7",
                    "iso-8859-8",
                    "windows-1255",
                ]
            ),
            st.randoms(),
        )
        @settings(max_examples=200)
        def test_detect_all_and_detect_one_should_agree(txt, enc, _):
            try:
                data = txt.encode(enc)
            except UnicodeEncodeError:
                assume(False)
            try:
                result = chardet.detect(data)
                results = chardet.detect_all(data)
    >           assert result["encoding"] == results[0]["encoding"]
    E           AssertionError: assert None == 'utf-8'
    
    test.py:183: AssertionError
    
    The above exception was the direct cause of the following exception:
    
        @given(
    >       st.text(min_size=1),
            st.sampled_from(
                [
                    "ascii",
                    "utf-8",
                    "utf-16",
                    "utf-32",
                    "iso-8859-7",
                    "iso-8859-8",
                    "windows-1255",
                ]
            ),
            st.randoms(),
        )
    
    test.py:160: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    txt = 'Ā𐀀', enc = 'utf-8', _ = HypothesisRandom(generated data)
    
        @given(
            st.text(min_size=1),
            st.sampled_from(
                [
                    "ascii",
                    "utf-8",
                    "utf-16",
                    "utf-32",
                    "iso-8859-7",
                    "iso-8859-8",
                    "windows-1255",
                ]
            ),
            st.randoms(),
        )
        @settings(max_examples=200)
        def test_detect_all_and_detect_one_should_agree(txt, enc, _):
            try:
                data = txt.encode(enc)
            except UnicodeEncodeError:
                assume(False)
            try:
                result = chardet.detect(data)
                results = chardet.detect_all(data)
                assert result["encoding"] == results[0]["encoding"]
            except Exception as exc:
    >           raise RuntimeError(f"{result} != {results}") from exc
    E           RuntimeError: {'encoding': None, 'confidence': 0.0, 'language': None} != [{'encoding': 'utf-8', 'confidence': 0.505, 'language': ''}]
    
    test.py:185: RuntimeError
    ----------------------------------------------------- Hypothesis -----------------------------------------------------
    Falsifying example: test_detect_all_and_detect_one_should_agree(
        txt='Ā𐀀', enc='utf-8', _=HypothesisRandom(generated data),
    )
    ============================================== short test summary info ===============================================
    FAILED test.py::test_detect_all_and_detect_one_should_agree - RuntimeError: {'encoding': None, 'confidence': 0.0, '...
    ================================ 1 failed, 375 passed, 6 xfailed, 1 xpassed in 9.79s =================================
    

    The same steps succeed with Python 3.10.4.

    opened by musicinmybrain 3
Releases(5.1.0)
  • 5.1.0(Dec 1, 2022)

    Features

    • Add should_rename_legacy argument to most functions, which will rename older encodings to their more modern equivalents (e.g., GB2312 becomes GB18030) (#264, @dan-blanchard)
    • Add capital letter sharp S and ISO-8859-15 support (#222, @SimonWaldherr)
    • Add a prober for MacRoman encoding (#5 updated as c292b52a97e57c95429ef559af36845019b88b33, Rob Speer and @dan-blanchard )
    • Add --minimal flag to chardetect command (#214, @dan-blanchard)
    • Add type annotations to the project and run mypy on CI (#261, @jdufresne)
    • Add support for Python 3.11 (#274, @hugovk)

    Fixes

    • Clarify LGPL version in License trove classifier (#255, @musicinmybrain)
    • Remove support for EOL Python 3.6 (#260, @jdufresne)
    • Remove unnecessary guards for non-falsey values (#259, @jdufresne)

    Misc changes

    • Switch to Python 3.10 release in GitHub actions (#257, @jdufresne)
    • Remove setup.py in favor of build package (#262, @jdufresne)
    • Run tests on macos, Windows, and 3.11-dev (#267, @dan-blanchard)
    Source code(tar.gz)
    Source code(zip)
  • 5.0.0(Jun 25, 2022)

    ⚠️ This release is the first release of chardet that no longer supports Python < 3.6 ⚠️

    In addition to that change, it features the following user-facing changes:

    • Added a prober for Johab Korean (#207, @grizlupo)
    • Added a prober for UTF-16/32 BE/LE (#109, #206, @jpz)
    • Added test data for Croatian, Czech, Hungarian, Polish, Slovak, Slovene, Greek, and Turkish, which should help prevent future errors with those languages
    • Improved XML tag filtering, which should improve accuracy for XML files (#208)
    • Tweaked SingleByteCharSetProber confidence to match latest uchardet (#209)
    • Made detect_all return child prober confidences (#210)
    • Updated examples in docs (#223, @domdfcoding)
    • Documentation fixes (#212, #224, #225, #226, #220, #221, #244 from too many to mention)
    • Minor performance improvements (#252, @deedy5)
    • Add support for Python 3.10 when testing (#232, @jdufresne)
    • Lots of little development cycle improvements, mostly thanks to @jdufresne
    Source code(tar.gz)
    Source code(zip)
  • 4.0.0(Dec 10, 2020)

    ⚠️ This will be the last release of chardet to support Python 2.7. chardet 5.0 will only support 3.6+ ⚠️

    Major Changes

    This release is multiple years in the making, and provides some quality of life improvements to chardet. The primary user-facing changes are:

    1. Single-byte charset probers now use nested dictionaries under the hood, so they are usually a little faster than before. (See #121 for details)
    2. The CharsetGroupProber class now properly short-circuits when one of the probers in the group is considered a definite match. This lead to a substantial speedup.
    3. There is now a chardet.detect_all function that returns a list of possible encodings for the input with associated confidences.
    4. We have dropped support for Python 2.6, 3.4, and 3.5 as they are all past end-of-life.

    The changes in this release have also laid the groundwork for retraining the models to make them more accurate, and to support some more encodings/languages (see #99 for progress). This is our main focus for chardet 5.0 (beyond dropping Python 2 support).

    Benchmarks

    Running on a MacBook Pro (15-inch, 2018) with 2.2GHz 6-core i7 processor and 32GB RAM

    old version (chardet 3.0.4)

    Benchmarking chardet 3.0.4 on CPython 3.7.5 (default, Sep  8 2020, 12:19:42)
    [Clang 11.0.3 (clang-1103.0.32.62)]
    --------------------------------------------------------------------------------
    Calls per second for each encoding:
    ascii: 25559.439366240098
    big5: 7.187002209518091
    cp932: 4.71090956645177
    cp949: 2.937256786994428
    euc-jp: 4.870580412090848
    euc-kr: 6.6910755971933416
    euc-tw: 87.71098043480079
    gb2312: 6.614302607154443
    ibm855: 27.595893549680685
    ibm866: 29.93483661732791
    iso-2022-jp: 3379.5052775763434
    iso-2022-kr: 26181.67290886392
    iso-8859-1: 120.63424740403983
    iso-8859-5: 32.65106262196898
    iso-8859-7: 62.480089080556084
    koi8-r: 13.72481001727257
    maccyrillic: 33.018537255804496
    shift_jis: 4.996013583677438
    tis-620: 14.323112928341818
    utf-16: 166771.53081510935
    utf-32: 198782.18009478672
    utf-8: 13.966236809766901
    utf-8-sig: 193732.28637413395
    windows-1251: 23.038910006925768
    windows-1252: 99.48409117053738 
    windows-1255: 6.336261495718825
    
    Total time: 357.05358052253723s (10.054513372323958 calls per second)
    

    new version (chardet 4.0.0)

    
    Benchmarking chardet 4.0.0 on CPython 3.7.5 (default, Sep  8 2020, 12:19:42)
    [Clang 11.0.3 (clang-1103.0.32.62)]
    --------------------------------------------------------------------------------
    .......................................................................................................................................................................................................................................................................................................................................................................
    Calls per second for each encoding:
    ascii: 38176.31067961165
    big5: 12.86915132656389
    cp932: 4.656400877065864
    cp949: 7.282976434315926
    euc-jp: 4.329381447610525
    euc-kr: 8.16386823884839
    euc-tw: 90.230745070368
    gb2312: 14.248865889128146
    ibm855: 33.30225548069821
    ibm866: 44.181691968506
    iso-2022-jp: 3024.2295767539117
    iso-2022-kr: 25055.57945041816
    iso-8859-1: 59.25262902122995
    iso-8859-5: 39.7069713674529
    iso-8859-7: 61.008422013862194
    koi8-r: 41.21560517643845
    maccyrillic: 31.402474369805002
    shift_jis: 4.9091652743515155
    tis-620: 14.408875278821073
    utf-16: 177349.00634249471
    utf-32: 186413.51111111112
    utf-8: 108.62174360115105
    utf-8-sig: 181965.46637744035
    windows-1251: 43.16933400329809
    windows-1252: 211.27653358317968
    windows-1255: 16.15113643694104
    
    Total time: 268.0230791568756s (13.394368915143872 calls per second)
    
    
    

    Thank you to @aaaxx, @edumco, @hrnciar, @hroncok, @jdufresne, @mdamien, @saintamh , @xeor for submitting pull requests, to all of our users for being patient with how long this release has taken.

    Full changelog

    • Convert single-byte charset probers to use nested dicts for language models (#121) @dan-blanchard
    • Add API option to get all the encodings confidence (#111) @mdamien
    • Make sure pyc files are not in tarballs (d7c7343) @dan-blanchard
    • Add benchmark script (d702545, 8dccd00, 726973e, 71a0fad) @dan-blanchard
    • Include license file in the generated wheel package (#141) @jdufresne
    • Drop support for Python 2.6 (#143) @jdufresne
    • Remove unused coverage configuration (#142) @jdufresne
    • Doc the chardet package suitable for production (#144) @jdufresne
    • Pass python_requires argument to setuptools (#150) @jdufresne
    • Update pypi.python.org URL to pypi.org (#155) @jdufresne
    • Typo fix (#159) @saintamh
    • Support pytest 4, don't apply marks directly to parameters (PR #174, Issue #173) @hroncok
    • Test Python 3.7 and 3.8 and document support (#175) @jdufresne
    • Drop support for end-of-life Python 3.4 (#181) @jdufresne
    • Workaround for distutils bug in python 2.7 (#165) @xeor
    • Remove deprecated license_file from setup.cfg (#182) @jdufresne
    • Remove deprecated 'sudo: false' from Travis configuraiton (#200) @jdufresne
    • Add testing for Python 3.9 (#201) @jdufresne
    • Adds explicit os and distro definitions (#140) @edumco
    • Remove shebang from nonexecutable script (#192) @hrnciar
    • Remove use of deprecated 'setup.py test' (#187) @jdufresne
    • Remove unnecessary numeric placeholders from format strings (#176) @jdufresne
    • Update links (#152) @aaaxx
    • Remove shebang and executable bit from chardet/cli/chardetect.py (#171) @jdufresne
    • Handle weird logging edge case in universaldetector.py (056a2a4) @dan-blanchard
    • Switch from Travis to GitHub Actions (#204) @dan-blanchard
    • Properly set CharsetGroupProber.state to FOUND_IT (PR #203, Issue #202) @dan-blanchard
    • Add language to detect_all output (1e208b7) @dan-blanchard
    Source code(tar.gz)
    Source code(zip)
  • 3.0.4(Jun 8, 2017)

    This minor bugfix release just fixes some packaging and documentation issues:

    • Fix issue with setup.py where pytest_runner was always being installed. (PR #119, thanks @zmedico)
    • Make sure test.py is included in the manifest (PR #118, thanks @zmedico)
    • Fix a bunch of old URLs in the README and other docs. (PRs #123 and #129, thanks @qfan and @jdufresne)
    • Update documentation to no longer imply we test/support Python 3 versions before 3.3 (PR #130, thanks @jdufresne)
    Source code(tar.gz)
    Source code(zip)
  • 3.0.3(May 16, 2017)

  • 3.0.2(Apr 12, 2017)

    Fixes an issue where detect would sometimes return None instead of a dict with the keys encoding, language, and confidence (Issue #113, PR #114).

    Source code(tar.gz)
    Source code(zip)
  • 3.0.1(Apr 11, 2017)

  • 3.0.0(Apr 11, 2017)

    This release is long overdue, but still mostly serves as a placeholder for the impending 4.0.0 release, which will have retrained models for better accuracy. For now, this release will get the following improvements up on PyPI:

    • Added support for Turkish ISO-8859-9 detection (PR #41, thanks @queeup)
    • Commented out large unused sections of Big5 and EUC-KR tables to save memory (8bc4b89)
    • Removed Python 3.2 from testing, but add 3.4 - 3.6
    • Ensure that stdin is open with mode 'rb' for chardetect CLI. (PR #38, thanks @lpsinger)
    • Fixed chardetect crash with non-ascii file names (PR #39, thanks @nkanaev)
    • Made naming conventions more Pythonic throughout (no more mTypicalPositiveRatio, and instead typical_positive_ratio)
    • Modernized test scripts and infrastructure so we've got Travis testing and all that stuff
    • Rename filter_without_english_words to filter_international_words and make it match current Mozilla implementation (PR #44, thanks @rsnair2)
    • Updated filter_english_letters to match C implementation (c6654595)
    • Temporarily disabled Hungarian ISO-8859-2 and Windows-1250 detection because it is very inaccurate (da6c0a079)
    • Allow CLI sub-package to be importable (PR #55)
    • Add a hypotheis-based test (PR #66, thanks @DRMacIver)
    • Strip endianness from UTF with BOM predictions so that the encoding can be passed directly to bytes.decode() (PR #73, thanks @snoack)
    • Fixed broken links in docs (PR #90, thanks @roskakori)
    • Added early exit to chardetect when encoding is detected instead of looping through entire file (PR #103, thanks @jpz)
    • Use bytearray objects internally instead of wrap_ord calls, which provides a nice performance boost across the board (PR #106)
    • Add language property to probers and UniversalDetector results (PR #180)
    • Mark the 5 known test failures as such so we can have more useful Travis build results in the meantime (d588407)
    Source code(tar.gz)
    Source code(zip)
  • 2.3.0(Oct 7, 2014)

    In this release, we:

    • Added support for CP932 detection (thanks to @hashy).
    • Fixed an issue where UTF-8 with a BOM would not be detected as UTF-8-SIG (#8).
    • Modified chardetect to use argparse for argument parsing.
    • Moved docs to a gh-pages branch. You can now access them at http://chardet.github.io.
    Source code(tar.gz)
    Source code(zip)
  • 2.2.1(Oct 21, 2014)

  • 2.2.0(Oct 21, 2014)

Owner
Character Encoding Detector
Character Encoding Detector
text-to-speach bot - You really do NOT have time for read a newsletter? Now you can listen to it

NewsletterReader You really do NOT have time for read a newsletter? Now you can listen to it The Newsletter of Filipe Deschamps is a great place to re

ItanuRomero 8 Sep 18, 2021
A generator library for concise, unambiguous and URL-safe UUIDs.

Description shortuuid is a simple python library that generates concise, unambiguous, URL-safe UUIDs. Often, one needs to use non-sequential IDs in pl

Stavros Korokithakis 1.8k Dec 31, 2022
Bidirectionally transformed strings

bistring The bistring library provides non-destructive versions of common string processing operations like normalization, case folding, and find/repl

Microsoft 352 Dec 19, 2022
Python tool to make adding to your armory spreadsheet armory less of a pain.

Python tool to make adding to your armory spreadsheet armory slightly less of a pain by creating a CSV to simply copy and paste.

1 Oct 20, 2021
Phone Number formatting for PlaySMS Platform - BulkSMS Platform

BulkSMS-Number-Formatting Phone Number formatting for PlaySMS Platform - BulkSMS Platform. Phone Number Formatting for PlaySMS Phonebook Service This

Edwin Senunyeme 1 Nov 08, 2021
从flomo导出的笔记中生成词云

flomo-word-cloud 从flomo导出的笔记中生成词云 如何使用? 将本项目克隆到你的电脑上,使用如下的命令,安装所需python库 pip install -r requirements.txt 在项目里新建一个file文件夹,把所有从flomo导出的html文件放入其中 运行main

Hannnk 9 Dec 30, 2022
Umamusume story patcher with python

umamusume-story-patcher How to use Go to your umamusume folder, usually C:\Users\user\AppData\LocalLow\Cygames\umamusume Make a mods folder and clon

8 May 07, 2022
Text to ASCII and ASCII to text

Text2ASCII Description This python script (converter.py) contains two functions: encode() is used to return a list of Integer, one item per character

4 Jan 22, 2022
Redlines produces a Markdown text showing the differences between two strings/text

Redlines Redlines produces a Markdown text showing the differences between two strings/text. The changes are represented with strike-throughs and unde

Houfu Ang 2 Apr 08, 2022
Hotpotato is a recipe portfolio App that assists users to discover and comment new recipes.

Hotpotato Hotpotato is a recipe portfolio App that assists users to discover and comment new recipes. It is a fullstack React App made with a Redux st

Nico G Pierson 13 Nov 05, 2021
Username reconnaisance tool that checks the availability of a specified username on over 200 websites.

Username reconnaisance tool that checks the availability of a specified username on over 200 websites. Installation & Usage Clone from Github: $ git c

Richard Mwewa 20 Oct 30, 2022
Convert English text to IPA using the toPhonetic

Installation: Windows python -m pip install text2ipa macOS sudo pip3 install text2ipa Linux pip install text2ipa Features Convert English text to I

Joseph Quang 3 Jun 14, 2022
The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity

Contents Maintainer wanted Introduction Installation Documentation License History Source code Authors Maintainer wanted I am looking for a new mainta

Antti Haapala 1.2k Dec 16, 2022
Extract price amount and currency symbol from a raw text string

price-parser is a small library for extracting price and currency from raw text strings.

Scrapinghub 252 Dec 31, 2022
Search for terms(word / table / field name or any) under Snowflake schema names

snowflake-search-terms-in-ddl-views Search for terms(word / table / field name or any) under Snowflake schema names Version : 1.0v How to use ? Run th

Igal Emona 1 Dec 15, 2021
Maiden & Spell community player ranking based on tournament data.

MnSRank Maiden & Spell community player ranking based on tournament data. Why? 2021 just ended and this seemed like a cool idea. Elo doesn't work well

Jonathan Lee 1 Apr 20, 2022
Chilean Digital Vaccination Pass Parser (CDVPP) parses digital vaccination passes from PDF files

cdvpp Chilean Digital Vaccination Pass Parser (CDVPP) parses digital vaccination passes from PDF files Reads a Digital Vaccination Pass PDF file as in

Esteban Borai 1 Nov 17, 2021
基于Pytex的数学建模工具,实现将md文件转换成pdf/tex文档的前后端

Pytex-for-MCM 基于Pytex的数学建模工具,实现将md文件转换成pdf/tex文档的前后端。

3 May 17, 2021
一款高性能敏感词(非法词/脏字)检测过滤组件,附带繁体简体互换,支持全角半角互换,汉字转拼音,模糊搜索等功能。

一款高性能非法词(敏感词)检测组件,附带繁体简体互换,支持全角半角互换,获取拼音首字母,获取拼音字母,拼音模糊搜索等功能。

ToolGood 3.6k Jan 07, 2023
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (supports 16 languages) of Universal Sentence Encoder (USE).

Dani El-Ayyass 47 Sep 05, 2022