Python character encoding detector

Overview

Chardet: The Universal Character Encoding Detector

Build status Latest version on PyPI

License

Detects
  • ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
  • Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
  • EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP (Japanese)
  • EUC-KR, ISO-2022-KR, Johab (Korean)
  • KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
  • ISO-8859-5, windows-1251 (Bulgarian)
  • ISO-8859-1, windows-1252 (Western European languages)
  • ISO-8859-7, windows-1253 (Greek)
  • ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
  • TIS-620 (Thai)

Note

Our ISO-8859-2 and windows-1250 (Hungarian) probers have been temporarily disabled until we can retrain the models.

Requires Python 3.6+.

Installation

Install from PyPI:

pip install chardet

Documentation

For users, docs are now available at https://chardet.readthedocs.io/.

Command-line Tool

chardet comes with a command-line script which reports on the encodings of one or more files:

% chardetect somefile someotherfile
somefile: windows-1252 with confidence 0.5
someotherfile: ascii with confidence 1.0

About

This is a continuation of Mark Pilgrim's excellent original chardet port from C, and Ian Cordasco's charade Python 3-compatible fork.

maintainer: Dan Blanchard
Comments
  • New language models added; old inacurate models was rebuilded. Hungarian test files changed. Script for language model building added

    New language models added; old inacurate models was rebuilded. Hungarian test files changed. Script for language model building added

    Text in the hungarian language can't contain many english words inside detected text. For example xml files can have more english words because of tag names and others. This detector is based on the letter frequency. The second problem arises if the hungarian text has many sentences in uppercase.

    opened by ghost 28
  • Modified filter_english_with_letters to mimic the behavior form Mozilla's version.

    Modified filter_english_with_letters to mimic the behavior form Mozilla's version.

    This change helps pass three unit tests that were failing before. I have also tested the changes by comparing the output of this function with Mozilla's version over some large randomly generated byte strings and so far so good.

    opened by rsnair2 19
  • Add Python 3 support (and drop support for < 2.6)

    Add Python 3 support (and drop support for < 2.6)

    Most of the credit for this goes to @bsidhom. I just took his Python 3 port and added a bunch of __future__ imports and the occasional from io import open to make it backward compatible with 2.6 and 2.7.

    I did some minor clean-up things like sorting imports and things like that. Oh and I added a .gitattributes file to ensure line endings are consistent.

    @erikrose Are you still actively maintaining this? I notice there a few outstanding pull requests and I just want to make sure the version on PyPI is 2/3 compatible soon.

    opened by dan-blanchard 18
  • Certain input creates extremely long runtime and memory leak

    Certain input creates extremely long runtime and memory leak

    I am using chardet as part of a web crawler written in python3. I noticed that over time (many hours), the program consumes all memory. I narrowed down the problem to a single call of chardet.detect() method for certain web pages.

    After some testing, it seems that chardet has problem with some special input and I managed to get a sample of such an input. It consumes on my machine about 220 MB of memory (however, the input is 2.5 MB) and takes about 1:22 minutes to process (in contrast to 43 ms when the file is truncated to about 2 MB). It seems not to be limited to python3, in python2 the memory consumption is even worse (312 MB).

    Versions:

    Fedora release 20 (Heisenbug) x86_64 chardet-2.2.1 (via pip) python3-3.3.2-11.fc20.x86_64 python-2.7.5-11.fc20.x86_64

    How to reproduce:

    I cannot attach any files to this issue so I uploaded them to my dropbox account: https://www.dropbox.com/sh/26dry8zj18cv0m1/sKgP_E44qx/chardet_test.zip Please let me know of a better place where to put it if necessary. Here is an overview of the content and the results:

    setup='import chardet; html = open("mem_leak_html.txt", "rb").read()'
    python3 -m timeit -s "$setup"  'chardet.detect(html[:2543482])'
    # produces: 10 loops, best of 3: 43 ms per loop
    python3 -m timeit -s "$setup"  'chardet.detect(html[:2543483])'
    # produces: 1 loops, best of 3: 1min 22s per loop
    python3 mem_leak_test.py
    # produces:
    # Good input left 2.65 MB of unfreed memory.
    # Bad input left 220.16 MB of unfreed memory.
    
    python -m timeit -s "$setup"  'chardet.detect(html[:2543482])'
    # produces: 10 loops, best of 3: 41.7 ms per loop
    python -m timeit -s "$setup"  'chardet.detect(html[:2543483])'
    # produces: 10 loops, best of 3: 111 sec per loop
    python mem_leak_test.py
    # produces:
    # Good input left 3.00 MB of unfreed memory.
    # Bad input left 312.00 MB of unfreed memory.
    
    mem_leak_test.py:
    import resource
    import chardet
    import gc
    
    mem_use = lambda: resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024
    html = open("mem_leak_html.txt", "rb").read()
    
    def test(desc, instr):
        gc.collect()
        mem_start = mem_use()
        chardet.detect(instr)    
        gc.collect()
        mem_used = mem_use() - mem_start
        print('%s left %.2f MB of unfreed memory.' % (desc, mem_used))    
    
    test('Good input', html[:2543482])
    test('Bad input', html[:2543483])
    
    bug help wanted 
    opened by radeklat 17
  • UTF detection when missing Byte Order Mark

    UTF detection when missing Byte Order Mark

    This change adds heuristic detection of UTF-16 and UTF-32 files when they are missing their byte order marks.

    At present we have no strategy for detecting the format of these files. Feel free to give feedback on the PR by the way, happy to have other eyes on it.

    Note I report these files as UTF-16LE, UTF-16BE, UTF-32LE and UTF-32BE rather than UTF-16 and UTF-32. This is justified for the following reasons: it's quite material whether it is little endian or big endian - Python will not decode these files correctly when passed decode("utf-16") - Python assumes endian-ness. This is less important for the files with the BOMs, as UTF-aware readers will generally inspect this and auto-decode the file, but for things missing the BOM, it's really important to be told the endian-ness in hopes of reading the file.

    We could change the reporting of of files with BOM to inform the endian-ness also, but I've avoided having an opinion on that for the moment.

    opened by jpz 16
  • Changing license

    Changing license

    This feels strange to be posing as a question, since I'm one of the co-maintainers, but @sigmavirus24 and @erikrose, do you know if it's okay/legal for us to change the license of chardet? Because it was started by Mark Pilgrim I feel like it's kind of a nebulous question, because he's not someone you can just email, and he has nothing to do with development anymore. I would really like to change the license to at least be MPL, since that's what the C++ version is, and our setup currently mirrors that code pretty closely.

    I'm not a fan of the LGPL and feel weird having a project I work on use it.

    question 
    opened by dan-blanchard 15
  • Add detection for MacRoman encoding

    Add detection for MacRoman encoding

    MacRoman is not in particularly common use anymore, as it has been deprecated by Mac OS for over a decade. However, there are programs such as Microsoft Office for Mac that didn't get the memo, and will often output in MacRoman when they write plain text files.

    This patch allows chardet to correctly detect MacRoman, instead of calling it something random and incorrect like ISO-8859-2. The MacRoman detector works similarly to the Latin-1 detector, but starts at a lower probability.

    I hope this is the right way to do it. There is surprisingly little support in chardet for adding a new single-byte Latin encoding.

    opened by rspeer 15
  • Add Hypothesis based test of chardet

    Add Hypothesis based test of chardet

    The concept here is pretty simple: This tries to test for the invariant that if a string comes from valid unicode and is encoded in one of the chardet supported encodings then chardet should detect some encoding.

    More nuanced tests are possible (e.g. asserting that the string should be decodable from the detected encoding) but given that even this test is demonstrating a bunch of bugs this seemed to be a good starting point.

    This is (more or less) the test that caught #65, #64 and #63. #62 had one extra line in it to try to reencode the data as the reported format.

    Notes:

    1. This test is currently failing. I'm pretty sure this is because of issues it's finding in the code, not issues with the test.
    2. min_size=100 is to rule out bugs that come solely from the length, prompted by your saying that short strings aren't really supported. Anecdotally all of the bugs that have been found so far don't depend on the length and min_size=1 would have been fine (leaving min_size alone is also valid, but I assume '' having a None encoding is intended behaviour)
    opened by DRMacIver 14
  • Failing to guess a single MS-apostrophe

    Failing to guess a single MS-apostrophe

    I have a page of text in ASCII with a single Microsoft-apostrophe chr(8217) detected as ISO-8859-2.

    #1. Create problematic sample
    >>> s = 'today' + chr(8217) + 's research'
    >>> s
    'today’s research'
    >>> b = s.encode('windows-1252')
    >>> b
    b'today\x92s research'
    
    #2. Attempt to decode it
    >>> chardet.detect(b)
    {'encoding': 'ISO-8859-2', 'confidence': 0.8060609643099236}
    >>> b.decode('ISO-8859-2')
    'today\x92s research'
    
    #3. Now try the correct encoding
    >>> b.decode('windows-1252')
    'today’s research'
    

    This text is very typical of anything created using a Microsoft editor. Furthermore, latest version of Firefox detects it correctly. I am using Python 3.3. Any help is appreciated.

    opened by shompol 13
  • Add upstream changes and clean up where possible

    Add upstream changes and clean up where possible

    This is very much a work in progress at the moment, and I'm just creating the PR to make it easier for me to keep track of Travis results.

    I have a few goals for this branch:

    1. Pull in changes from Mozilla's upstream code. There aren't as many as I had initially expected but there are some.
    2. Improve PEP8 compliance all over the place. The previous maintainers tried to keep variable names identical to the C code, presumably to ease the comparison with the Mozilla code, but we're going to be diverging from upstream after pulling in the changes mentioned in 1. Basically, Mozilla seems very likely to abandon their character encoding detector in the near future and switch to using ICU, but ICU doesn't support all of the codecs we currently do, because it is more web-focused. If our goal here is to be a truly universal character encoding detector, we'll need to go our own way in the future in that respect.
    3. Make the unit tests pass, or at the very least make it obvious that the tests are actually failing (instead of ignoring the failures like our current Travis build does).

    So far, I've done a little bit of point 1 and updated the Travis testing setup to use nose and report test coverage via Coveralls.

    enhancement 
    opened by dan-blanchard 13
  • Don't indicate byte order for UTF-16/32 with given BOM

    Don't indicate byte order for UTF-16/32 with given BOM

    If passed a string starting with \xff\xfe (low endian byte order mark) or \xfe\xff (big endian byte order mark) the encoding is detected as UTF-16LE, UTF-32LE, UTF-16BE or UTF-32BE respectively.

    However, as the byte order mark is given in the string, the encoding should be simply UTF-16 or UTF-32. Otherwise bytes.decode() will fail or preserve the byte order mark:

    s = 'foo'.encode('UTF-16')
    encoding = chardet.detect(s)['encoding']  # "UTF-16LE"
    s.decode(encoding)                        # "\ufefffoo"
    
    s = codecs.BOM_BE + 'foo'.encode('UTF-16BE')
    encoding = chardet.detect(s)['encoding']  # "UTF-16BE"
    s.decode(encoding)                        # "\ufefffoo"
    

    Hence code that uses chardet in order to detect the encoding to decode data, would need to wrap chardet.detect in following inconvenient and counter-intuitive way:

    encoding = chardet.detect(enc)['encoding']
    if encoding in ('UTF-16LE', 'UTF16BE'):
      dec = enc.decode('UTF-16')
    elif encoding in ('UTF-32LE', 'UTF-32BE'):
      dec = enc.decode('UTF-32')
    else:
      dec = enc.decode(encoding)
    

    This PR changes the behavior to return simply UTF-16or UTF-32 respectively when a byte order mark were found, that the detected encoding can be passed unchanged to bytes.decode().

    opened by snoack 12
  • Allow running of the package via `python3 -m chardet ...`

    Allow running of the package via `python3 -m chardet ...`

    I want to be able to execute the chardet main script (packaged as an executable) by running python3 -m chardet .... Currently it doesn't work. Would be great if it did work.

    opened by DeflateAwning 1
  • Documentation licensed only to non-commercial and personal use found

    Documentation licensed only to non-commercial and personal use found

    Hi,

    In the file, 'https://github.com/chardet/chardet/blob/main/tests/windows-1255-hebrew/hydepark.hevre.co.il.7957.xml', we have found the following license text:

    " This copy is for your personal, non-commercial use only. To order presentation-ready copies for distribution to your colleagues, clients or customers, use the Order Reprints tool at the bottom of any article or visit: www.djreprints.com. " This may cause issues even for open source projects that allows commercial use.

    Can you please let us know if there is an option to retain the file even for commercial use? Is it possible to remove the content that is only for non-commercial and personal use?

    Regards, Rahul

    opened by rahulmohang 0
  • Fix broken CP949 state machine

    Fix broken CP949 state machine

    Abstract

    Current CP949 state machine has some false positives, and incorrectly marks valid CP949 texts as an error. This PR rewrites the state transition table, to comply the CP949 Specification.

    Details

    These are some cases, which a false-positive error can occur in the current implementation.

    • (0xAD68) The first byte is classified as the class 8, as it is 0xAD. And in the START state, the class 8 makes an transition to the ERROR state. But this is a valid CP949.

    • (0xC652) The first byte is classified as the class 9, and the second byte is classified as the class 5. In the START state, the class 9 makes an transition to the State 6, and in the State 6, the class 5 makes an transition to the ERROR state. But this is a valid CP949.

    Test

    I have tested the state machine (To-Be) for the all characters in the CP949 with following code, and it successfully returned Success. When I have tested it against the current implementation (As-Is), it shows Error! at byte 15479.

    from chardet.codingstatemachine import CodingStateMachine
    from chardet.mbcssm import CP949_SM_MODEL
    
    sm = CodingStateMachine(CP949_SM_MODEL)
    
    with open('./path/to/cp949-chars.txt', 'rb') as f:
        data = f.read()
    
    for i, byte in enumerate(data):
        state = sm.next_state(byte)
    
        if state == 1:
            print("Error! at byte %d" % i)
            break
    
    if state != 1:
      print("Success! :)")
    

    I couldn't upload the cp949 characters to the test fixtures folder, as it will make the test fail because of the frequency-based probing, which will not successfully mark it as the CP949. (Because it is just a plain listing of the all possible characters of the CP949.)

    opened by HelloWorld017 2
  • chardet 5.0 KeyError with Python 3.10 on Windows

    chardet 5.0 KeyError with Python 3.10 on Windows

    Yesterday I encountered a strange CI failure for our Windows GitHub CI workflows which had been running fine until then. The Python 3.7 job passed fine but the Python 3.10 job failed.

    https://github.com/deluge-torrent/deluge/actions/workflows/ci.yml?query=branch%3Adevelop

    The only difference I could find from a diff of the logs was the new chardet 5.0.0 being pulled in. So I pinned chardet to 4.0.0 and CI is passing again.

    GitHub Actions Environment:

    Virtual Environment: windows-2022 (20220626.1)
    Python 3.10.5
    

    Just to note that I also tested same error occurs with windows-2019.

    The traceback is rather cryptic since it comes from pytest but this is all there is from the job:

    INTERNALERROR> Traceback (most recent call last):
    INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\_pytest\main.py", line 264, in wrap_session
    INTERNALERROR>     config._do_configure()
    INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\_pytest\config\__init__.py", line 995, in _do_configure
    INTERNALERROR>     self.hook.pytest_configure.call_historic(kwargs=dict(config=self))
    INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\pluggy\_hooks.py", line 277, in call_historic
    INTERNALERROR>     res = self._hookexec(self.name, self.get_hookimpls(), kwargs, False)
    INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\pluggy\_manager.py", line 80, in _hookexec
    INTERNALERROR>     return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
    INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\pluggy\_callers.py", line 60, in _multicall
    INTERNALERROR>     return outcome.get_result()
    INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\pluggy\_result.py", line 60, in get_result
    INTERNALERROR>     raise ex[1].with_traceback(ex[2])
    INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\pluggy\_callers.py", line 39, in _multicall
    INTERNALERROR>     res = hook_impl.function(*args)
    INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\_pytest\faulthandler.py", line 27, in pytest_configure
    INTERNALERROR>     import faulthandler
    INTERNALERROR>   File "<frozen importlib._bootstrap>", line 1024, in _find_and_load
    INTERNALERROR>   File "<frozen importlib._bootstrap>", line 171, in __enter__
    INTERNALERROR>   File "<frozen importlib._bootstrap>", line 123, in acquire
    INTERNALERROR> KeyError: 1832
    
    opened by cas-- 4
  • test_detect_all_and_detect_one_should_agree fails on Python 3.11b3

    test_detect_all_and_detect_one_should_agree fails on Python 3.11b3

    $ python3.11 --version
    Python 3.11.0b3
    $ python3.11 -m venv _e
    $ . _e/bin/activate
    (_e) $ pip install -e .
    (_e) $ pip install -e pytest hypothesis
    (_e) $ pytest
    

    results in:

    ====================================================== FAILURES ======================================================
    ____________________________________ test_detect_all_and_detect_one_should_agree _____________________________________
    
    txt = 'Ā𐀀', enc = 'utf-8', _ = HypothesisRandom(generated data)
    
        @given(
            st.text(min_size=1),
            st.sampled_from(
                [
                    "ascii",
                    "utf-8",
                    "utf-16",
                    "utf-32",
                    "iso-8859-7",
                    "iso-8859-8",
                    "windows-1255",
                ]
            ),
            st.randoms(),
        )
        @settings(max_examples=200)
        def test_detect_all_and_detect_one_should_agree(txt, enc, _):
            try:
                data = txt.encode(enc)
            except UnicodeEncodeError:
                assume(False)
            try:
                result = chardet.detect(data)
                results = chardet.detect_all(data)
    >           assert result["encoding"] == results[0]["encoding"]
    E           AssertionError: assert None == 'utf-8'
    
    test.py:183: AssertionError
    
    The above exception was the direct cause of the following exception:
    
        @given(
    >       st.text(min_size=1),
            st.sampled_from(
                [
                    "ascii",
                    "utf-8",
                    "utf-16",
                    "utf-32",
                    "iso-8859-7",
                    "iso-8859-8",
                    "windows-1255",
                ]
            ),
            st.randoms(),
        )
    
    test.py:160: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    txt = 'Ā𐀀', enc = 'utf-8', _ = HypothesisRandom(generated data)
    
        @given(
            st.text(min_size=1),
            st.sampled_from(
                [
                    "ascii",
                    "utf-8",
                    "utf-16",
                    "utf-32",
                    "iso-8859-7",
                    "iso-8859-8",
                    "windows-1255",
                ]
            ),
            st.randoms(),
        )
        @settings(max_examples=200)
        def test_detect_all_and_detect_one_should_agree(txt, enc, _):
            try:
                data = txt.encode(enc)
            except UnicodeEncodeError:
                assume(False)
            try:
                result = chardet.detect(data)
                results = chardet.detect_all(data)
                assert result["encoding"] == results[0]["encoding"]
            except Exception as exc:
    >           raise RuntimeError(f"{result} != {results}") from exc
    E           RuntimeError: {'encoding': None, 'confidence': 0.0, 'language': None} != [{'encoding': 'utf-8', 'confidence': 0.505, 'language': ''}]
    
    test.py:185: RuntimeError
    ----------------------------------------------------- Hypothesis -----------------------------------------------------
    Falsifying example: test_detect_all_and_detect_one_should_agree(
        txt='Ā𐀀', enc='utf-8', _=HypothesisRandom(generated data),
    )
    ============================================== short test summary info ===============================================
    FAILED test.py::test_detect_all_and_detect_one_should_agree - RuntimeError: {'encoding': None, 'confidence': 0.0, '...
    ================================ 1 failed, 375 passed, 6 xfailed, 1 xpassed in 9.79s =================================
    

    The same steps succeed with Python 3.10.4.

    opened by musicinmybrain 3
Releases(5.1.0)
  • 5.1.0(Dec 1, 2022)

    Features

    • Add should_rename_legacy argument to most functions, which will rename older encodings to their more modern equivalents (e.g., GB2312 becomes GB18030) (#264, @dan-blanchard)
    • Add capital letter sharp S and ISO-8859-15 support (#222, @SimonWaldherr)
    • Add a prober for MacRoman encoding (#5 updated as c292b52a97e57c95429ef559af36845019b88b33, Rob Speer and @dan-blanchard )
    • Add --minimal flag to chardetect command (#214, @dan-blanchard)
    • Add type annotations to the project and run mypy on CI (#261, @jdufresne)
    • Add support for Python 3.11 (#274, @hugovk)

    Fixes

    • Clarify LGPL version in License trove classifier (#255, @musicinmybrain)
    • Remove support for EOL Python 3.6 (#260, @jdufresne)
    • Remove unnecessary guards for non-falsey values (#259, @jdufresne)

    Misc changes

    • Switch to Python 3.10 release in GitHub actions (#257, @jdufresne)
    • Remove setup.py in favor of build package (#262, @jdufresne)
    • Run tests on macos, Windows, and 3.11-dev (#267, @dan-blanchard)
    Source code(tar.gz)
    Source code(zip)
  • 5.0.0(Jun 25, 2022)

    ⚠️ This release is the first release of chardet that no longer supports Python < 3.6 ⚠️

    In addition to that change, it features the following user-facing changes:

    • Added a prober for Johab Korean (#207, @grizlupo)
    • Added a prober for UTF-16/32 BE/LE (#109, #206, @jpz)
    • Added test data for Croatian, Czech, Hungarian, Polish, Slovak, Slovene, Greek, and Turkish, which should help prevent future errors with those languages
    • Improved XML tag filtering, which should improve accuracy for XML files (#208)
    • Tweaked SingleByteCharSetProber confidence to match latest uchardet (#209)
    • Made detect_all return child prober confidences (#210)
    • Updated examples in docs (#223, @domdfcoding)
    • Documentation fixes (#212, #224, #225, #226, #220, #221, #244 from too many to mention)
    • Minor performance improvements (#252, @deedy5)
    • Add support for Python 3.10 when testing (#232, @jdufresne)
    • Lots of little development cycle improvements, mostly thanks to @jdufresne
    Source code(tar.gz)
    Source code(zip)
  • 4.0.0(Dec 10, 2020)

    ⚠️ This will be the last release of chardet to support Python 2.7. chardet 5.0 will only support 3.6+ ⚠️

    Major Changes

    This release is multiple years in the making, and provides some quality of life improvements to chardet. The primary user-facing changes are:

    1. Single-byte charset probers now use nested dictionaries under the hood, so they are usually a little faster than before. (See #121 for details)
    2. The CharsetGroupProber class now properly short-circuits when one of the probers in the group is considered a definite match. This lead to a substantial speedup.
    3. There is now a chardet.detect_all function that returns a list of possible encodings for the input with associated confidences.
    4. We have dropped support for Python 2.6, 3.4, and 3.5 as they are all past end-of-life.

    The changes in this release have also laid the groundwork for retraining the models to make them more accurate, and to support some more encodings/languages (see #99 for progress). This is our main focus for chardet 5.0 (beyond dropping Python 2 support).

    Benchmarks

    Running on a MacBook Pro (15-inch, 2018) with 2.2GHz 6-core i7 processor and 32GB RAM

    old version (chardet 3.0.4)

    Benchmarking chardet 3.0.4 on CPython 3.7.5 (default, Sep  8 2020, 12:19:42)
    [Clang 11.0.3 (clang-1103.0.32.62)]
    --------------------------------------------------------------------------------
    Calls per second for each encoding:
    ascii: 25559.439366240098
    big5: 7.187002209518091
    cp932: 4.71090956645177
    cp949: 2.937256786994428
    euc-jp: 4.870580412090848
    euc-kr: 6.6910755971933416
    euc-tw: 87.71098043480079
    gb2312: 6.614302607154443
    ibm855: 27.595893549680685
    ibm866: 29.93483661732791
    iso-2022-jp: 3379.5052775763434
    iso-2022-kr: 26181.67290886392
    iso-8859-1: 120.63424740403983
    iso-8859-5: 32.65106262196898
    iso-8859-7: 62.480089080556084
    koi8-r: 13.72481001727257
    maccyrillic: 33.018537255804496
    shift_jis: 4.996013583677438
    tis-620: 14.323112928341818
    utf-16: 166771.53081510935
    utf-32: 198782.18009478672
    utf-8: 13.966236809766901
    utf-8-sig: 193732.28637413395
    windows-1251: 23.038910006925768
    windows-1252: 99.48409117053738 
    windows-1255: 6.336261495718825
    
    Total time: 357.05358052253723s (10.054513372323958 calls per second)
    

    new version (chardet 4.0.0)

    
    Benchmarking chardet 4.0.0 on CPython 3.7.5 (default, Sep  8 2020, 12:19:42)
    [Clang 11.0.3 (clang-1103.0.32.62)]
    --------------------------------------------------------------------------------
    .......................................................................................................................................................................................................................................................................................................................................................................
    Calls per second for each encoding:
    ascii: 38176.31067961165
    big5: 12.86915132656389
    cp932: 4.656400877065864
    cp949: 7.282976434315926
    euc-jp: 4.329381447610525
    euc-kr: 8.16386823884839
    euc-tw: 90.230745070368
    gb2312: 14.248865889128146
    ibm855: 33.30225548069821
    ibm866: 44.181691968506
    iso-2022-jp: 3024.2295767539117
    iso-2022-kr: 25055.57945041816
    iso-8859-1: 59.25262902122995
    iso-8859-5: 39.7069713674529
    iso-8859-7: 61.008422013862194
    koi8-r: 41.21560517643845
    maccyrillic: 31.402474369805002
    shift_jis: 4.9091652743515155
    tis-620: 14.408875278821073
    utf-16: 177349.00634249471
    utf-32: 186413.51111111112
    utf-8: 108.62174360115105
    utf-8-sig: 181965.46637744035
    windows-1251: 43.16933400329809
    windows-1252: 211.27653358317968
    windows-1255: 16.15113643694104
    
    Total time: 268.0230791568756s (13.394368915143872 calls per second)
    
    
    

    Thank you to @aaaxx, @edumco, @hrnciar, @hroncok, @jdufresne, @mdamien, @saintamh , @xeor for submitting pull requests, to all of our users for being patient with how long this release has taken.

    Full changelog

    • Convert single-byte charset probers to use nested dicts for language models (#121) @dan-blanchard
    • Add API option to get all the encodings confidence (#111) @mdamien
    • Make sure pyc files are not in tarballs (d7c7343) @dan-blanchard
    • Add benchmark script (d702545, 8dccd00, 726973e, 71a0fad) @dan-blanchard
    • Include license file in the generated wheel package (#141) @jdufresne
    • Drop support for Python 2.6 (#143) @jdufresne
    • Remove unused coverage configuration (#142) @jdufresne
    • Doc the chardet package suitable for production (#144) @jdufresne
    • Pass python_requires argument to setuptools (#150) @jdufresne
    • Update pypi.python.org URL to pypi.org (#155) @jdufresne
    • Typo fix (#159) @saintamh
    • Support pytest 4, don't apply marks directly to parameters (PR #174, Issue #173) @hroncok
    • Test Python 3.7 and 3.8 and document support (#175) @jdufresne
    • Drop support for end-of-life Python 3.4 (#181) @jdufresne
    • Workaround for distutils bug in python 2.7 (#165) @xeor
    • Remove deprecated license_file from setup.cfg (#182) @jdufresne
    • Remove deprecated 'sudo: false' from Travis configuraiton (#200) @jdufresne
    • Add testing for Python 3.9 (#201) @jdufresne
    • Adds explicit os and distro definitions (#140) @edumco
    • Remove shebang from nonexecutable script (#192) @hrnciar
    • Remove use of deprecated 'setup.py test' (#187) @jdufresne
    • Remove unnecessary numeric placeholders from format strings (#176) @jdufresne
    • Update links (#152) @aaaxx
    • Remove shebang and executable bit from chardet/cli/chardetect.py (#171) @jdufresne
    • Handle weird logging edge case in universaldetector.py (056a2a4) @dan-blanchard
    • Switch from Travis to GitHub Actions (#204) @dan-blanchard
    • Properly set CharsetGroupProber.state to FOUND_IT (PR #203, Issue #202) @dan-blanchard
    • Add language to detect_all output (1e208b7) @dan-blanchard
    Source code(tar.gz)
    Source code(zip)
  • 3.0.4(Jun 8, 2017)

    This minor bugfix release just fixes some packaging and documentation issues:

    • Fix issue with setup.py where pytest_runner was always being installed. (PR #119, thanks @zmedico)
    • Make sure test.py is included in the manifest (PR #118, thanks @zmedico)
    • Fix a bunch of old URLs in the README and other docs. (PRs #123 and #129, thanks @qfan and @jdufresne)
    • Update documentation to no longer imply we test/support Python 3 versions before 3.3 (PR #130, thanks @jdufresne)
    Source code(tar.gz)
    Source code(zip)
  • 3.0.3(May 16, 2017)

  • 3.0.2(Apr 12, 2017)

    Fixes an issue where detect would sometimes return None instead of a dict with the keys encoding, language, and confidence (Issue #113, PR #114).

    Source code(tar.gz)
    Source code(zip)
  • 3.0.1(Apr 11, 2017)

  • 3.0.0(Apr 11, 2017)

    This release is long overdue, but still mostly serves as a placeholder for the impending 4.0.0 release, which will have retrained models for better accuracy. For now, this release will get the following improvements up on PyPI:

    • Added support for Turkish ISO-8859-9 detection (PR #41, thanks @queeup)
    • Commented out large unused sections of Big5 and EUC-KR tables to save memory (8bc4b89)
    • Removed Python 3.2 from testing, but add 3.4 - 3.6
    • Ensure that stdin is open with mode 'rb' for chardetect CLI. (PR #38, thanks @lpsinger)
    • Fixed chardetect crash with non-ascii file names (PR #39, thanks @nkanaev)
    • Made naming conventions more Pythonic throughout (no more mTypicalPositiveRatio, and instead typical_positive_ratio)
    • Modernized test scripts and infrastructure so we've got Travis testing and all that stuff
    • Rename filter_without_english_words to filter_international_words and make it match current Mozilla implementation (PR #44, thanks @rsnair2)
    • Updated filter_english_letters to match C implementation (c6654595)
    • Temporarily disabled Hungarian ISO-8859-2 and Windows-1250 detection because it is very inaccurate (da6c0a079)
    • Allow CLI sub-package to be importable (PR #55)
    • Add a hypotheis-based test (PR #66, thanks @DRMacIver)
    • Strip endianness from UTF with BOM predictions so that the encoding can be passed directly to bytes.decode() (PR #73, thanks @snoack)
    • Fixed broken links in docs (PR #90, thanks @roskakori)
    • Added early exit to chardetect when encoding is detected instead of looping through entire file (PR #103, thanks @jpz)
    • Use bytearray objects internally instead of wrap_ord calls, which provides a nice performance boost across the board (PR #106)
    • Add language property to probers and UniversalDetector results (PR #180)
    • Mark the 5 known test failures as such so we can have more useful Travis build results in the meantime (d588407)
    Source code(tar.gz)
    Source code(zip)
  • 2.3.0(Oct 7, 2014)

    In this release, we:

    • Added support for CP932 detection (thanks to @hashy).
    • Fixed an issue where UTF-8 with a BOM would not be detected as UTF-8-SIG (#8).
    • Modified chardetect to use argparse for argument parsing.
    • Moved docs to a gh-pages branch. You can now access them at http://chardet.github.io.
    Source code(tar.gz)
    Source code(zip)
  • 2.2.1(Oct 21, 2014)

  • 2.2.0(Oct 21, 2014)

Owner
Character Encoding Detector
Character Encoding Detector
The project is investigating methods to extract human-marked data from document forms such as surveys and tests.

The project is investigating methods to extract human-marked data from document forms such as surveys and tests. They can read questions, multiple-choice exam papers, and grade.

Harry 5 Mar 27, 2022
Wikipedia Extractive Text Summarizer + Keywords Identification (entropy-based)

Wikipedia Extractive Text Summarizer + Keywords Identification (entropy-based)Wikipedia Extractive Text Summarizer + Keywords Identification (entropy-based)

Kevin Lai 1 Nov 08, 2021
A python Tk GUI that creates, writes text and attaches images into a custom spreadsheet file

A python Tk GUI that creates, writes text and attaches images into a custom spreadsheet file

Mirko Simunovic 13 Dec 09, 2022
Parse Any Text With Python

ParseAnyText A small package to parse strings. What is the work of it? Well It's a module to creates parser that helps to parse a text easily with les

Sayam Goswami 1 Jan 11, 2022
Python library for creating PEG parsers

PyParsing -- A Python Parsing Module Introduction The pyparsing module is an alternative approach to creating and executing simple grammars, vs. the t

Pyparsing 1.7k Dec 27, 2022
Answer some questions and get your brawler csvs ready!

BRAWL-STARS-V11-BRAWLER-MAKER-TOOL Answer some questions and get your brawler csvs ready! HOW TO RUN on android: Install pydroid3 from playstore, and

9 Jan 07, 2023
A simple Python module for parsing human names into their individual components

Name Parser A simple Python (3.2+ & 2.6+) module for parsing human names into their individual components. hn.title hn.first hn.middle hn.last hn.suff

Derek Gulbranson 574 Dec 20, 2022
Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 8.8k Jan 08, 2023
Wikipedia Reader for the GNOME Desktop

Wike Wike is a Wikipedia reader for the GNOME Desktop. Provides access to all the content of this online encyclopedia in a native application, with a

Hugo Olabera 126 Dec 24, 2022
This project aims to test check if your RegExp are being matched by grep.

Bash RegExp This project aims to test check if your RegExp are being matched by grep. It's a local server that starts on the port 8080. It runs the se

Quatrecentquatre 1 Feb 28, 2022
WorldCloud Orçamento de Estado 2022

World Cloud Orçamento de Estado 2022 What it does This script creates a worldcloud, masked on a image, from a txt file How to run it? Install all libr

Jorge Gomes 2 Oct 12, 2021
A Python3 script that simulates the user typing a text on their keyboard.

A Python3 script that simulates the user typing a text on their keyboard. (control the speed, randomness, rate of typos and more!)

Jose Gracia Berenguer 3 Feb 22, 2022
Deasciify-highlighted - A Python script for deasciifying text to Turkish and copying clipboard

deasciify-highlighted is a Python script for deasciifying text to Turkish and copying clipboard.

Ümit Altıntaş 3 Mar 18, 2022
Python Lex-Yacc

PLY (Python Lex-Yacc) Copyright (C) 2001-2020 David M. Beazley (Dabeaz LLC) All rights reserved. Redistribution and use in source and binary forms, wi

David Beazley 2.4k Dec 31, 2022
知乎评论区词云分析

zhihu-comment-wordcloud 知乎评论区词云分析 起源于:如何看待知乎问题“男生真的很不能接受彩礼吗?”的一个回答下评论数超8万条,创单个回答下评论数新记录? 项目代码说明 2.download_comment.py 下载全量评论 2.word_cloud_by_dt 生成词云 2

李国宝 10 Sep 26, 2022
Free & simple way to encipher text

VenSipher VenSipher is a free medium through which text can be enciphered. It can convert any text into an unrecognizable secret text that can only be

3 Jan 28, 2022
Production First and Production Ready End-to-End Keyword Spotting Toolkit

WeKws Production First and Production Ready End-to-End Keyword Spotting Toolkit. The goal of this toolkit it to... Small footprint keyword spotting (K

222 Dec 30, 2022
Fuzz a language by mixing up only few words.

afasi Fuzz a language by mixing up only few words. Status Beta. Note: The default branch is default. Use Examples Version General Help Translate Help

Stefan Hagen 2 Dec 14, 2022
a python package that lets you add custom colors and text formatting to your scripts in a very easy way!

colormate Python script text formatting package What is colormate? colormate is a python library that lets you add text formatting to your scripts, it

Rodrigo 2 Dec 14, 2022
strbind - lapidary text converter for translate an text file to the C-style string

strbind strbind - lapidary text converter for translate an text file to the C-style string. My motivation is fast adding large text chunks to the C co

Mihail Zaytsev 1 Oct 22, 2021