CaskDB is a disk-based, embedded, persistent, key-value store based on the Riak's bitcask paper, written in Python.

Overview

CaskDB - Disk based Log Structured Hash Table Store

made-with-python build codecov MIT license

architecture

CaskDB is a disk-based, embedded, persistent, key-value store based on the Riak's bitcask paper, written in Python. It is more focused on the educational capabilities than using it in production. The file format is platform, machine, and programming language independent. Say, the database file created from Python on macOS should be compatible with Rust on Windows.

This project aims to help anyone, even a beginner in databases, build a persistent database in a few hours. There are no external dependencies; only the Python standard library is enough.

If you are interested in writing the database yourself, head to the workshop section.

Features

  • Low latency for reads and writes
  • High throughput
  • Easy to back up / restore
  • Simple and easy to understand
  • Store data much larger than the RAM

Limitations

Most of the following limitations are of CaskDB. However, there are some due to design constraints by the Bitcask paper.

  • Single file stores all data, and deleted keys still take up the space
  • CaskDB does not offer range scans
  • CaskDB requires keeping all the keys in the internal memory. With a lot of keys, RAM usage will be high
  • Slow startup time since it needs to load all the keys in memory

Dependencies

CaskDB does not require any external libraries to run. For local development, install the packages from requirements_dev.txt:

pip install -r requirements_dev.txt

Installation

PyPi is not used for CaskDB yet (issue #5), and you'd have to install it directly from the repository by cloning.

Usage

disk: DiskStorage = DiskStore(file_name="books.db")
disk.set(key="othello", value="shakespeare")
author: str = disk.get("othello")
# it also supports dictionary style API too:
disk["hamlet"] = "shakespeare"

Prerequisites

The workshop is for intermediate-advanced programmers. Knowing Python is not a requirement, and you can build the database in any language you wish.

Not sure where you stand? You are ready if you have done the following in any language:

  • If you have used a dictionary or hash table data structure
  • Converting an object (class, struct, or dict) to JSON and converting JSON back to the things
  • Open a file to write or read anything. A common task is dumping a dictionary contents to disk and reading back

Workshop

NOTE: I don't have any workshops scheduled shortly. Follow me on Twitter for updates. Drop me an email if you wish to arrange a workshop for your team/company.

CaskDB comes with a full test suite and a wide range of tools to help you write a database quickly. A Github action is present with an automated tests runner, code formatter, linter, type checker and static analyser. Fork the repo, push the code, and pass the tests!

Throughout the workshop, you will implement the following:

  • Serialiser methods take a bunch of objects and serialise them into bytes. Also, the procedures take a bunch of bytes and deserialise them back to the things.
  • Come up with a data format with a header and data to store the bytes on the disk. The header would contain metadata like timestamp, key size, and value.
  • Store and retrieve data from the disk
  • Read an existing CaskDB file to load all keys

Tasks

  1. Read the paper. Fork this repo and checkout the start-here branch
  2. Implement the fixed-sized header, which can encode timestamp (uint, 4 bytes), key size (uint, 4 bytes), value size (uint, 4 bytes) together
  3. Implement the key, value serialisers, and pass the tests from test_format.py
  4. Figure out how to store the data on disk and the row pointer in the memory. Implement the get/set operations. Tests for the same are in test_disk_store.py
  5. Code from the task #2 and #3 should be enough to read an existing CaskDB file and load the keys into memory

Use make lint to run mypy, black, and pytype static analyser. Run make test to run the tests locally. Push the code to Github, and tests will run on different OS: ubuntu, mac, and windows.

Not sure how to proceed? Then check the hints file which contains more details on the tasks and hints.

Hints

  • Check out the documentation of struck.pack for serialisation methods in Python
  • Not sure how to come up with a file format? Read the comment in the format module

What next?

I often get questions about what is next after the basic implementation. Here are some challenges (with different levels of difficulties)

Level 1:

  • Crash safety: the bitcask paper stores CRC in the row, and while fetching the row back, it verifies the data
  • Key deletion: CaskDB does not have a delete API. Read the paper and implement it
  • Instead of using a hash table, use a data structure like the red-black tree to support range scans
  • CaskDB accepts only strings as keys and values. Make it generic and take other data structures like int or bytes.

Level 2:

  • Hint file to improve the startup time. The paper has more details on it
  • Implement an internal cache which stores some of the key-value pairs. You may explore and experiment with different cache eviction strategies like LRU, LFU, FIFO etc.
  • Split the data into multiple files when the files hit a specific capacity

Level 3:

  • Support for multiple processes
  • Garbage collector: keys which got updated and deleted remain in the file and take up space. Write a garbage collector to remove such stale data
  • Add SQL query engine layer
  • Store JSON in values and explore making CaskDB as a document database like Mongo
  • Make CaskDB distributed by exploring algorithms like raft, paxos, or consistent hashing

Name

This project was named cdb earlier and now renamed to CaskDB.

Line Count

$ tokei -f format.py disk_store.py
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Python                  2          391          261          103           27
-------------------------------------------------------------------------------
 disk_store.py                      204          120           70           14
 format.py                          187          141           33           13
===============================================================================
 Total                   2          391          261          103           27
===============================================================================

License

The MIT license. Please check LICENSE for more details.

Owner
I git stuff done
A minimal configuration for a dockerized kafka project.

Docker Kafka Quickstart A minimal configuration for a dockerized kafka project. Usage: Run this command to build kafka and zookeeper containers, and c

Nouamane Tazi 5 Jan 12, 2022
Python library and cli util for https://www.zerochan.net/

Zerochan Library for Zerochan.net with pics parsing and downloader included! Features CLI utility for pics downloading from zerochan.net Library for c

kiriharu 10 Oct 11, 2022
This is a Python program I wrote to simulate the solar system with 79 lines of code.

Solar System With Python This is a Python program I wrote to simulate the solar system with 79 lines of code. Required modules tkinter, math, time Why

Mehmet Aydoğmuş 1 Oct 26, 2021
Material de apoio da oficina de SAST apresentada pelo CAIS no Webinar de 28/05/21.

CAIS-CAIS Conjunto de Aplicações Intencionamente Sem-Vergonha do CAIS Material didático do Webinar "EP1. Oficina - Práticas de análise estática de cód

Fausto Filho 14 Jul 25, 2022
This package tries to emulate the behaviour of syntax proposed in PEP 671 via a decorator

Late-Bound Arguments This package tries to emulate the behaviour of syntax proposed in PEP 671 via a decorator. Usage Mention the names of the argumen

Shakya Majumdar 0 Feb 06, 2022
Just messing around with AI for fun coding 😂

Python-AI Projects 🤖 World Clock ⏰ ⚙︎ Steps to run world-clock.py file Download and open the file in your Python IDE. Run the file a type the name of

Danish Saleem 0 Feb 10, 2022
Meower a social media platform written in Scratch 3.0 and Python

Meower Meower is a social media platform written in Scratch 3.0 and Python, ported to HTML for self-hosting. Try Beta 4.6 Changelog for 4.6 Start impl

Meower Media Co. 23 Dec 02, 2022
This repo contains scripts that add functionality to xbar.

xbar-custom-plugins This repo contains scripts that add functionality to xbar. Usage You have to add scripts to xbar plugin folder. If you don't find

osman uygar 1 Jan 10, 2022
Python Script to add OpenGapps, Magisk, libhoudini translation library and libndk translation library to waydroid !

Waydroid Extras Script Script to add gapps and other stuff to waydroid ! Installation/Usage "lzip" is required for this script to work, install it usi

Casu Al Snek 331 Jan 02, 2023
Packaging tools for shanty services.

parcel Packaging tools for shanty services. What? Services are docker containers deployed by shanty on a hosting appliance. Each service consists of t

0 Jan 20, 2022
Manipulation OpenAI Gym environments to simulate robots at the STARS lab

liegroups Python implementation of SO2, SE2, SO3, and SE3 matrix Lie groups using numpy or PyTorch. [Documentation] Installation To install, cd into t

STARS Laboratory 259 Dec 11, 2022
Hopefully the the next-generation backend server of bgm.tv

Hopefully the the next-generation backend server of bgm.tv

Bangumi 475 Jan 01, 2023
Google Scholar App Using Python

Google Scholar App Watch the tutorial video How to build a Google Scholar App | Streamlit #30 Demo Launch the web app: Reproducing this web app To rec

Chanin Nantasenamat 4 Jun 05, 2022
Flexible constructor to create dynamic list of heterogeneous properties for some kind of entity

Flexible constructor to create dynamic list of heterogeneous properties for some kind of entity. This set of helpers useful to create properties like contacts or attributes for describe car/computer/

Django Stars 24 Jul 21, 2022
A tool converting rpk (记乎) to apkg (Anki Package)

RpkConverter This tool is used to convert rpk file to Anki apkg. 如果遇到任何问题,请发起issue,并描述情况。如果转换rpk出现问题,请将文件发到邮箱 ssqyang [AT] outlook.com,我会debug并修复问题。 下

9 Nov 01, 2021
Basic-Killfeed - A simple DayZ Console Killfeed

Basic-Killfeed A simple DayZ Console Killfeed. Setup Install Python Version 3.10

Nick 1 Apr 25, 2022
A python based app to improve your presentation workflow

Presentation Remote A remote made for making presentations easier by enabling all the members to have access to change the slide and control the flow

Parnav 1 Oct 28, 2021
AHP Calculator - A method for organizing and evaluating complicated decisions, using Maths and Psychology

AHP Calculator - A method for organizing and evaluating complicated decisions, using Maths and Psychology

16 Aug 08, 2022
Project repository of Apache Airflow, deployed on Docker in Amazon EC2 via GitLab.

Airflow on Docker in EC2 + GitLab's CI/CD Personal project for simple data pipeline using Airflow. Airflow will be installed inside Docker container,

Ammar Chalifah 13 Nov 29, 2022
chiarose(XCR) based on chia(XCH) source code fork, open source public chain

chia-rosechain 一个无耻的小活动 | A shameless little event 如果您喜欢这个项目,请点击star 将赠送您520朵玫瑰,可以去 facebook 留下您的(xcr)地址,和github用户名。 If you like this project, please

ddou123 376 Dec 14, 2022