Python bindings for the simdjson project.

Last update: Jan 01, 2023

Related tags

Overview

pysimdjson

Python bindings for the simdjson project, a SIMD-accelerated JSON parser. If SIMD instructions are unavailable a fallback parser is used, making pysimdjson safe to use anywhere.

Bindings are currently tested on OS X, Linux, and Windows for Python version 3.5 to 3.9.

📝 Documentation

The latest documentation can be found at https://pysimdjson.tkte.ch.

If you've checked out the source code (for example to review a PR), you can build the latest documentation by running cd docs && make html.

🎉 Installation

If binary wheels are available for your platform, you can install from pip with no further requirements:

pip install pysimdjson

Binary wheels are available for the following:

	py3.5	py3.6	py3.7	py3.8	py3.9	pypy3
OS X (x86_64)	y	y	y	y	y	y
Windows (x86_64)	x	x	y	y	y	x
Linux (x86_64)	y	y	y	y	y	x
Linux (ARM64)	y	y	y	y	y	x

If binary wheels are not available for your platform, you'll need a C++11-capable compiler to compile the sources:

pip install pysimdjson --no-binary :all:

Both simdjson and pysimdjson support FreeBSD and Linux on ARM when built from source.

⚗ Development and Testing

This project comes with a full test suite. To install development and testing dependencies, use:

pip install -e ".[test]"

To also install 3rd party JSON libraries used for running benchmarks, use:

pip install -e ".[benchmark]"

To run the tests, just type pytest. To also run the benchmarks, use pytest --runslow.

To properly test on Windows, you need both a recent version of Visual Studio (VS) as well as VS2015, patch 3. Older versions of CPython required portable C/C++ extensions to be built with the same version of VS as the interpreter. Use the Developer Command Prompt to easily switch between versions.

How It Works

This project uses pybind11 to generate the low-level bindings on top of the simdjson project. You can use it just like the built-in json module, or use the simdjson-specific API for much better performance.

import simdjson
doc = simdjson.loads('{"hello": "world"}')

🚀 Making things faster

pysimdjson provides an api compatible with the built-in json module for convenience, and this API is pretty fast (beating or tying all other Python JSON libraries). However, it also provides a simdjson-specific API that can perform significantly better.

Don't load the entire document

95% of the time spent loading a JSON document into Python is spent in the creation of Python objects, not the actual parsing of the document. You can avoid all of this overhead by ignoring parts of the document you don't want.

pysimdjson supports this in two ways - the use of JSON pointers via at_pointer(), or proxies for objects and lists.

import simdjson
parser = simdjson.Parser()
doc = parser.parse(b'{"res": [{"name": "first"}, {"name": "second"}]}')

For our sample above, we really just want the second entry in res, we don't care about anything else. We can do this two ways:

assert doc['res'][1]['name'] == 'second' # True
assert doc.at_pointer('res/1/name') == 'second' # True

Both of these approaches will be much faster than using load/s(), since they avoid loading the parts of the document we didn't care about.

Both Object and Array have a mini property that returns their entire content as a minified Python str. A message router for example would only parse the document and retrieve a single property, the destination, and forward the payload without ever turning it into a Python object. Here's a (bad) example:

import simdjson

@app.route('/store', methods=['POST'])
def store():
    parser = simdjson.Parser()
    doc = parser.parse(request.data)
    redis.set(doc['key'], doc.mini)

With this, doc could contain thousands of objects, but the only one loaded into a python object was key, and we even minified the content as we went.

Re-use the parser.

One of the easiest performance gains if you're working on many documents is to re-use the parser.

import simdjson
parser = simdjson.Parser()

for i in range(0, 100):
    doc = parser.parse(b'{"a": "b"}')

This will drastically reduce the number of allocations being made, as it will reuse the existing buffer when possible. If it's too small, it'll grow to fit.

📈 Benchmarks

pysimdjson compares well against most libraries for the default load/loads(), which creates full python objects immediately.

pysimdjson performs significantly better when only part of the document is of interest. For each test file we show the time taken to completely deserialize the document into Python objects, as well as the time to get the deepest key in each file. The second approach avoids all unnecessary object creation.

jsonexamples/canada.json deserialization

Name	Min (μs)	Max (μs)	StdDev	Ops
✨ simdjson-{canada}	10.67130	22.89260	0.00465	60.30257
yyjson-{canada}	11.29230	29.90640	0.00568	53.27890
orjson-{canada}	11.90260	34.88260	0.00507	54.49605
ujson-{canada}	18.17060	48.99410	0.00718	36.24892
simplejson-{canada}	39.24630	52.62860	0.00483	21.81617
rapidjson-{canada}	41.04930	53.10800	0.00445	21.19078
json-{canada}	44.68320	59.44410	0.00440	19.71509

jsonexamples/canada.json deepest key

Name	Min (μs)	Max (μs)	StdDev	Ops
✨ simdjson-{canada}	3.21360	6.88010	0.00044	285.83978
yyjson-{canada}	10.62770	46.10050	0.01000	43.29310
orjson-{canada}	12.54010	39.16080	0.00779	44.28928
ujson-{canada}	17.93980	35.44960	0.00697	36.78481
simplejson-{canada}	38.58160	54.33290	0.00699	21.37382
rapidjson-{canada}	40.69030	58.23460	0.00700	20.30349
json-{canada}	43.88300	65.04480	0.00722	18.55929

jsonexamples/twitter.json deserialization

Name	Min (μs)	Max (μs)	StdDev	Ops
orjson-{twitter}	2.36070	14.03050	0.00123	346.94307
✨ simdjson-{twitter}	2.41350	12.01550	0.00117	359.49272
yyjson-{twitter}	2.48130	12.03680	0.00112	353.03313
ujson-{twitter}	2.62890	11.39370	0.00090	346.87994
simplejson-{twitter}	3.34600	11.08840	0.00098	270.58797
json-{twitter}	3.35270	11.82610	0.00116	260.01943
rapidjson-{twitter}	4.29320	13.81980	0.00128	197.91107

jsonexamples/twitter.json deepest key

Name	Min (μs)	Max (μs)	StdDev	Ops
✨ simdjson-{twitter}	0.33840	0.67200	0.00002	2800.32496
orjson-{twitter}	2.38460	13.53120	0.00131	352.70788
yyjson-{twitter}	2.48180	13.67470	0.00156	320.56731
ujson-{twitter}	2.65230	11.65150	0.00125	331.69430
json-{twitter}	3.34910	12.44890	0.00116	263.25854
simplejson-{twitter}	3.35760	15.61900	0.00137	262.36758
rapidjson-{twitter}	4.31870	12.77490	0.00119	201.86510

jsonexamples/github_events.json deserialization

Name	Min (μs)	Max (μs)	StdDev	Ops
orjson-{github_events}	0.18080	0.67020	0.00004	5041.29485
✨ simdjson-{github_events}	0.19470	0.61450	0.00003	4725.63489
yyjson-{github_events}	0.19710	0.53970	0.00004	4584.50870
ujson-{github_events}	0.23760	1.33490	0.00004	3904.08715
json-{github_events}	0.29030	1.32040	0.00009	3034.22530
simplejson-{github_events}	0.30210	0.82260	0.00005	3067.99997
rapidjson-{github_events}	0.33010	0.92400	0.00005	2793.93274

jsonexamples/github_events.json deepest key

Name	Min (μs)	Max (μs)	StdDev	Ops
✨ simdjson-{github_events}	0.03630	0.66110	0.00001	25259.19598
orjson-{github_events}	0.18210	0.71230	0.00003	5073.48086
yyjson-{github_events}	0.20030	0.61270	0.00003	4589.71299
ujson-{github_events}	0.24260	1.05100	0.00007	3644.08240
json-{github_events}	0.29310	2.38770	0.00011	2967.79019
simplejson-{github_events}	0.30580	1.39670	0.00007	2931.01646
rapidjson-{github_events}	0.33340	0.80440	0.00004	2795.27887

jsonexamples/citm_catalog.json deserialization

Name	Min (μs)	Max (μs)	StdDev	Ops
orjson-{citm_catalog}	5.40140	17.76900	0.00314	130.33847
yyjson-{citm_catalog}	5.77340	23.09490	0.00421	113.78942
✨ simdjson-{citm_catalog}	6.00620	26.87570	0.00444	104.41073
ujson-{citm_catalog}	6.34300	25.06400	0.00473	96.01414
simplejson-{citm_catalog}	9.54910	23.96350	0.00392	78.99315
json-{citm_catalog}	10.21250	23.52610	0.00329	78.72180
rapidjson-{citm_catalog}	10.81700	21.85400	0.00343	73.94939

jsonexamples/citm_catalog.json deepest key

Name	Min (μs)	Max (μs)	StdDev	Ops
✨ simdjson-{citm_catalog}	0.81040	2.11090	0.00015	1088.17698
orjson-{citm_catalog}	5.37260	18.37890	0.00451	120.86345
yyjson-{citm_catalog}	5.61430	23.18500	0.00548	110.29924
ujson-{citm_catalog}	6.25850	30.79090	0.00604	95.50805
simplejson-{citm_catalog}	9.36560	24.44860	0.00510	77.50571
json-{citm_catalog}	10.07650	25.29490	0.00450	76.18267
rapidjson-{citm_catalog}	10.69120	27.84880	0.00493	70.98005

jsonexamples/mesh.json deserialization

Name	Min (μs)	Max (μs)	StdDev	Ops
yyjson-{mesh}	2.33710	13.01130	0.00171	331.50569
✨ simdjson-{mesh}	2.52960	13.19230	0.00159	311.37935
orjson-{mesh}	2.88770	12.13010	0.00152	287.31080
ujson-{mesh}	3.64020	18.23620	0.00227	193.35645
json-{mesh}	5.97130	13.58290	0.00136	150.01621
rapidjson-{mesh}	7.54270	16.14480	0.00155	119.37806
simplejson-{mesh}	8.64370	16.35320	0.00136	106.25888

jsonexamples/mesh.json deepest key

Name	Min (μs)	Max (μs)	StdDev	Ops
✨ simdjson-{mesh}	1.02020	2.74930	0.00013	919.93044
yyjson-{mesh}	2.30970	13.06730	0.00182	347.76076
orjson-{mesh}	2.85260	12.41860	0.00156	290.19432
ujson-{mesh}	3.59400	16.68610	0.00227	201.03704
json-{mesh}	5.96300	19.18900	0.00185	146.04645
rapidjson-{mesh}	7.43860	16.32260	0.00164	121.84979
simplejson-{mesh}	8.62160	21.89280	0.00221	101.30905

jsonexamples/gsoc-2018.json deserialization

Name	Min (μs)	Max (μs)	StdDev	Ops
✨ simdjson-{gsoc-2018}	5.52590	16.27430	0.00178	145.59797
yyjson-{gsoc-2018}	5.62040	16.46250	0.00168	155.97459
orjson-{gsoc-2018}	5.78420	13.87300	0.00140	148.84293
simplejson-{gsoc-2018}	7.76200	15.26480	0.00142	114.98827
ujson-{gsoc-2018}	7.96570	21.53840	0.00188	110.29162
json-{gsoc-2018}	8.63300	19.26320	0.00172	102.78744
rapidjson-{gsoc-2018}	10.55570	19.20210	0.00159	85.84087

jsonexamples/gsoc-2018.json deepest key

Name	Min (μs)	Max (μs)	StdDev	Ops
✨ simdjson-{gsoc-2018}	1.56020	4.20200	0.00024	570.15046
yyjson-{gsoc-2018}	5.49930	14.89760	0.00158	161.14242
orjson-{gsoc-2018}	5.72650	15.88270	0.00160	153.18169
simplejson-{gsoc-2018}	7.70780	18.78120	0.00169	116.90299
ujson-{gsoc-2018}	7.91720	21.35300	0.00227	103.06755
json-{gsoc-2018}	8.65190	19.99580	0.00188	103.86934
rapidjson-{gsoc-2018}	10.52410	20.98870	0.00158	87.78973

Comments

Rewrite for code quality and move to simdjson 0.4.*. (Issue #31)
This will become the version 2.0.0 release.

[x] Update embedded simdjson to 0.3.0 (#31)

[x] Update embedded simdjson to 0.4.0 (#31)

[x] Move from cython to pybind11

[ ] Rewrite documentation

[ ] Better CI-generated benchmarks against json, ujson, rapidjson, and orjson.

[x] Try to match the json.load, json.loads, json.dump and json.dumps interfaces. Will impact performance over the native simdjson API but users want plug-and-play.

[x] Move from appveyor and circleci to github actions for CI tasks.

[x] simdjson no longer requires C++17. We can greatly expand the versions of Python on Windows we can provide binary wheels for. This comes from older versions of CPython requiring C extensions to be built with the same compiler they were.

packaging
opened by TkTech 44
The Python overhead is about 95% of the processing time
From simdjson/scripts/javascript, I generated a file called large.json. In C++, parsing this file takes about 0.25 s.

$parse large.json Min: 0.252188 bytes read: 203130424 Gigabytes/second: 0.805471

I wrote the following Python script...

from timeit import default_timer as timer with open('large.json', 'rb') as fin: x = fin.read() for i in range(10): start = timer() doc = simdjson.loads(x) end = timer() print(end - start)

I get...

$ time python3 test.py 3.471898762974888 3.9210079659242183 3.3614078611135483 3.72252986789681 3.7506914171390235 3.756883286871016 3.752689895918593 3.751842977013439 3.7484844669234008 (...)

If my analysis is correct (and it could be wrong), pysimdjson takes 3.7 s to parse the file, and of that, 0.25 s are due to simdjson, leaving about 95% of the processing time to overhead.

I know that this is known, but I wanted to provide a data point.
opened by lemire 24
This parser can't support a document that big

[email protected]:~$ time python convert-to-pickle.py Traceback (most recent call last): File "convert-to-pickle.py", line 10, in data = simdjson.loads(ch.read()) File "/usr/local/lib/python3.8/dist-packages/simdjson/init.py", line 52, in loads return parser.parse(s, True) File "simdjson/csimdjson.pyx", line 468, in csimdjson.Parser.parse ValueError: This parser can't support a document that big
invalid zero-effort

opened by ghost 17
File causes a crash in pysimdjson (reliably)
I am copying over issue https://github.com/simdjson/simdjson/issues/921 from simdjson. We do not see a crash in simdjson itself, but there is a crash in pysimdjson:

import simdjson a = open("test.txt").read() b = simdjson.loads(a.encode())

Using the file https://github.com/simdjson/simdjson/files/4749603/test.txt
opened by lemire 17
Unable to serialize simdjson Objects into Pickle

Hello all!

When I try to serialize simdjson Object into Pickle, I get the following error:

TypeError: self.c_element,self.c_parser cannot be converted to a Python object for pickling

Would it be possible to add support for serializing/pickling simdjson instances directly, without converting them to dict? If not pickling, then at least an ability to serialize into .json would be lovely as well.

opened by vovavili 9
Fairly high overhead on the boundary Python/C++
We are parsing a very high number of ~2KB JSON files in our Python-based application.

The native (C++) SIMDJSON library delivers ~700k parser cycles per second.

The pysimdjson delivers ~350k parser cycles per second.

The Cython-based PoC implementation (in-house, so far) delivers ~700k parser cycles per second (very close to C++ implementation).

I also conducted a rather artificial test of "how many parser cycles" can I get with basically empty JSON ({}). The issue here is quite visible, the overhead of the Python<->pysymdjson boundary crossing is high relatively to other possible implementations.

A "parser cycle" is defined as a one call to parser.parse(json) on the existing parser instance.

I'm not 100% sure if this is a priority of this library, so feel free to close this one as irrelevant.
opened by ateska 9
Segfault when not assigning the parser to a variable
Here is a Python session that segfaults:

>>> import simdjson >>> pa=simdjson.Parser().parse('{"a": 9999}') >>> pa["a"] zsh: segmentation fault (core dumped) python

And here is one that works:

>>> import simdjson >>> p = simdjson.Parser() >>> pa = p.parse('{"a": 9999}') >>> pa["a"] 9999

It's unclear to me why the first one segfaults, and it looks like a bug?

I imagine the parser is garbage collected by Python in the first example, but it's still clearly in use by the "pa" variable?
bug
opened by palkeo 9
Consider upgrading to simdjson 0.4
Version 0.4 of simdjson is now available

Highlights

Test coverage has been greatly improved and we have resolved many static-analysis warnings on different systems.

New features:

We added a fast (8GB/s) minifier that works directly on JSON strings.

We added fast (10GB/s) UTF-8 validator that works directly on strings (any strings, including non-JSON).

The array and object elements have a constant-time size() method.

Performance:

Performance improvements to the API (type(), get<>()).

The parse_many function (ndjson) has been entirely reworked. It now uses a single secondary thread instead of several new threads.

We have introduced a faster UTF-8 validation algorithm (lookup3) for all kernels (ARM, x64 SSE, x64 AVX).

System support:

C++11 support for older compilers and systems.

FreeBSD support (and tests).

We support the clang front-end compiler (clangcl) under Visual Studio.

It is now possible to target ARM platforms under Visual Studio.

The simdjson library will never abort or print to standard output/error.

Version 0.3 of simdjson is now available

Highlights

Multi-Document Parsing: Read a bundle of JSON documents (ndjson) 2-4x faster than doing it individually. API docs / Design Details

Simplified API: The API has been completely revamped for ease of use, including a new JSON navigation API and fluent support for error code and exception styles of error handling with a single API. Docs

Exact Float Parsing: Now simdjson parses floats flawlessly without any performance loss (https://github.com/simdjson/simdjson/pull/558). Blog Post

Even Faster: The fastest parser got faster! With a shiny new UTF-8 validator and meticulously refactored SIMD core, simdjson 0.3 is 15% faster than before, running at 2.5 GB/s (where 0.2 ran at 2.2 GB/s).

Minor Highlights

Fallback implementation: simdjson now has a non-SIMD fallback implementation, and can run even on very old 64-bit machines.

Automatic allocation: as part of API simplification, the parser no longer has to be preallocated-it will adjust automatically when it encounters larger files.

Runtime selection API: We've exposed simdjson's runtime CPU detection and implementation selection as an API, so you can tell what implementation we detected and test with other implementations.

Error handling your way: Whether you use exceptions or check error codes, simdjson lets you handle errors in your style. APIs that can fail return simdjson_result, letting you check the error code before using the result. But if you are more comfortable with exceptions, skip the error code and cast straight to T, and exceptions will be thrown automatically if an error happens. Use the same API either way!

Error chaining: We also worked to keep non-exception error-handling short and sweet. Instead of having to check the error code after every single operation, now you can chain JSON navigation calls like looking up an object field or array element, or casting to a string, so that you only have to check the error code once at the very end.
opened by lemire 8
Windows 3.6 Binary?

Hi! Thanks again for this fantastic project. ^_^

I ran into some CI errors where my Windows 64-bit builds were dying due to compile errors with CPython 3.6. I noticed there's no wheel on PyPI for it.

Would it be possible to fix?

Thanks!
enhancement packaging

opened by william-silversmith 6

Pysimdjson fails to install on python 3.6

  Using cached https://files.pythonhosted.org/packages/9b/f6/c63260f8788574de8fdd0bbe70f803328cb058141c0903ba29637d89f863/pysimdjson-2.5.0.tar.gz
Installing collected packages: pysimdjson
  Running setup.py install for pysimdjson ... error
    Complete output from command /home/ubuntu/ctix-2/venv/bin/python3.6 -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-wzmco2i3/pysimdjson/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-jelehspx/install-record.txt --single-version-externally-managed --compile --install-headers /home/ubuntu/ctix-2/venv/include/site/python3.6/pysimdjson:
    running install
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.6
    creating build/lib.linux-x86_64-3.6/simdjson
    copying simdjson/__init__.py -> build/lib.linux-x86_64-3.6/simdjson
    running build_ext
    building 'csimdjson' extension
    creating build/temp.linux-x86_64-3.6
    creating build/temp.linux-x86_64-3.6/simdjson
    x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/home/ubuntu/ctix-2/venv/lib/python3.6/site-packages/pybind11/include -I/home/ubuntu/ctix-2/venv/include -I/usr/include/python3.6m -c simdjson/binding.cpp -o build/temp.linux-x86_64-3.6/simdjson/binding.o -std=c++11
    In file included from /home/ubuntu/ctix-2/venv/lib/python3.6/site-packages/pybind11/include/pybind11/pytypes.h:12:0,
                     from /home/ubuntu/ctix-2/venv/lib/python3.6/site-packages/pybind11/include/pybind11/cast.h:13,
                     from /home/ubuntu/ctix-2/venv/lib/python3.6/site-packages/pybind11/include/pybind11/attr.h:13,
                     from /home/ubuntu/ctix-2/venv/lib/python3.6/site-packages/pybind11/include/pybind11/pybind11.h:44,
                     from simdjson/binding.cpp:5:
    /home/ubuntu/ctix-2/venv/lib/python3.6/site-packages/pybind11/include/pybind11/detail/common.h:112:20: fatal error: Python.h: No such file or directory
    compilation terminated.
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

    ----------------------------------------
Command "/home/ubuntu/ctix-2/venv/bin/python3.6 -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-wzmco2i3/pysimdjson/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-jelehspx/install-record.txt --single-version-externally-managed --compile --install-headers /home/ubuntu/ctix-2/venv/include/site/python3.6/pysimdjson" failed with error code 1 in /tmp/pip-install-wzmco2i3/pysimdjson/
You are using pip version 19.0, however version 20.2.4 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.```

opened by anudeepsamaiya 6

Build binary packages using clang-cl on Windows

Support for clang-cl is coming. As part of the PR that allows CPython to build against clang-cl, distutils is updated to build with clang-cl (https://github.com/python/cpython/pull/18371). Once this PR is merged and a new CPython release includes it we can start using it for our binary releases.

Clang has reached a point where it's safe enough for us to use with CPython's built with MSVC2015 or newer. https://clang.llvm.org/docs/MSVCCompatibility.html

This would alleviate poor windows performance caused by MSVC issues (https://github.com/simdjson/simdjson/issues/847, but not entirely, https://github.com/simdjson/simdjson/issues/848).

We only need to do this if upstream simdjson doesn't figure out what's up with MSVC. @lemire
enhancement packaging blocked

opened by TkTech 6
Float aware mini

simdjson minify drops the trailing '.0' from floats, which is fine by JSON spec, but matters in practice. For example, Elasticsearch dynamic field type detection is affected. In general, Python distinguishes between int and float, so various type guarantees may fail. The dump/load cycle should not convert types for a few byte gain. Let users explicitly convert types, if they need to.

This modifies minify, so it does not drop the '.0'.

Note: simdjson started dropping '.0' with d0821adf0e7934f27a8eb5c2fe9b8254e4.

opened by edgarsi 8
Performance penalty when reading items
I'm getting increased latency in my application from simdjson but I can't figure out why.

This is a snippet from profiling the function that gets items from the simdjson object. The time is in seconds.

Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 26 0.017 0.001 0.017 0.001 {method 'get' of 'csimdjson.Object' objects}

When I time getting individual items from the same object I get timings of about 15 microseconds which seems comparable with getting items from a normal python dictionary. However when I test the whole function the performance is much worse.
opened by jonathan-kosgei 1
Improve user experience of memory safety.
We've added a check in v4 (https://github.com/TkTech/pysimdjson/blob/master/simdjson/csimdjson.pyx#L437) that prevents parsing new documents while references continue to exist to the old one. This is correct, in that it ensures no errors. I wasn't terribly happy with this, but it's better then segfaulting.

It has downsides:

It sucks as a user (https://github.com/TkTech/pysimdjson/issues/53#issuecomment-850494991), where you might have to del the old objects, even if you didn't intend to use them again. Very un-pythonic.

Doesn't work on PyPy, where del is unreliable. The objects may not be garbage collected until much later.

Brainstorming welcome. Alternatives:

Probably the easiest approach would be for a Parser to keep a list of Object and Array proxies that hold a reference to it, and set a dirty bit on them when parse() is called with a different document. The performance of this would probably be unacceptable - I might be wrong.

Use the new parse_into_document() and create a new document for every parse. This is potentially both slow and very wasteful with memory, but would let us keep a document around and valid for as long as Object or Array reference it.

enhancement help wanted
opened by TkTech 3
Provide the ability to link to system simdjson

Bundling a library is a serious sin in our book, so provide the ability to link to the system library. I've also done some refactoring to avoid exponential growth of Extension calls. The default behavior remains the same, so it shouldn't affect existing users.

That said, the patch isn't perfect. It still uses the bundled headers instead of system headers but it should be good enough for us.

opened by mgorny 2
Expose document_stream interface

The pysimdjson library could support our document_stream interface (parse_many function). It is well tested as of release 0.7 (with fuzz testing) and works well today. It supports streams of indefinite size.

See https://github.com/simdjson/simdjson/blob/master/doc/parse_many.md

Related to https://github.com/TkTech/pysimdjson/issues/70
enhancement

opened by lemire 4

Releases(v4.0.0)

v4.0.0(May 23, 2021)

Source code(tar.gz)
Source code(zip)
v3.2.0(Feb 11, 2021)

Source code(tar.gz)
Source code(zip)
v3.0.0(Aug 21, 2020)
Updated to upstream simdjson 0.5.0.

at() has become at_pointer() and more closely follows the JSON pointer RFC. Paths now require a leading /.

Fixed an error when manually choosing the parser implementation.

Source code(tar.gz)
Source code(zip)

Python bindings for the simdjson project.

Related tags

Overview

pysimdjson

📝 Documentation

🎉 Installation

⚗ Development and Testing

How It Works

🚀 Making things faster

Don't load the entire document

Re-use the parser.

📈 Benchmarks

jsonexamples/canada.json deserialization

jsonexamples/canada.json deepest key

jsonexamples/twitter.json deserialization

jsonexamples/twitter.json deepest key

jsonexamples/github_events.json deserialization

jsonexamples/github_events.json deepest key

jsonexamples/citm_catalog.json deserialization

jsonexamples/citm_catalog.json deepest key

jsonexamples/mesh.json deserialization

jsonexamples/mesh.json deepest key

jsonexamples/gsoc-2018.json deserialization

jsonexamples/gsoc-2018.json deepest key

Comments

Version 0.4 of simdjson is now available

Version 0.3 of simdjson is now available

Highlights

Minor Highlights

Releases(v4.0.0)

v4.0.0(May 23, 2021)

v3.2.0(Feb 11, 2021)

v3.0.0(Aug 21, 2020)

Owner

Tyler Kennedy

Yet another serialization library on top of dataclasses, inspired by serde-rs.

Python wrapper around rapidjson

Ultra fast JSON decoder and encoder written in C with Python bindings

A lightweight library for converting complex objects to and from simple Python datatypes.

Python bindings for the simdjson project.