Extract XML from the OS X dictionaries.

Overview

Before You Start

Apple-peeler was written using python 3.9 (but it should be trivial to support earlier versions of python 3.5+).

Installation

pip install apple-peeler

Dependencies

BeautifulSoup 4, lxml, and click

Usage

Apple likes to move around the dictionaries location from macOS version to macOS version. So if the dictionaries are no longer at the path below you can tell apple-peeler where to look by exporting DICT_BASE in your environment or using the --base option below.

export DICT_BASE="/System/Library/AssetsV2/com_apple_MobileAsset_DictionaryServices_dictionaryOSX/"

After that, useage is straightforward.

Usage: apple-peeler [OPTIONS]

Extract XML from Apple Dictionary files.

Options:
--base DIRECTORY                The root directory of the OS X dictionaries.
                                (Default: /System/Library/AssetsV2/com_apple
                                _MobileAsset_DictionaryServices_dictionaryOS
                                X/) [Env var DICT_BASE]
--out DIRECTORY                 The path to place extracted XML files.
-d, --dictionary [
    all|Arabic - English|Danish|Duden Dictionary Data Set I|Dutch|
    Dutch - English|French|French - English|French - German|German - English|
    Hebrew|Hindi|Hindi - English|Indonesian - English|Italian|
    Italian - English|Korean|Korean - English|New Oxford American Dictionary|
    Norwegian|Oxford American Writer's Thesaurus|
    Oxford Dictionary of English|Oxford Thesaurus of English|
    Polish - English|Portuguese|Portuguese - English|Russian|
    Russian - English|Sanseido Super Daijirin|
    Sanseido The WISDOM English-Japanese Japanese-English Dictionary|
    Simplified Chinese - English|Simplified Chinese - Japanese|Spanish|
    Spanish - English|Swedish|Thai|Thai - English|
    The Standard Dictionary of Contemporary Chinese|Traditional Chinese|
    Traditional Chinese - English|Turkish|Vietnamese - English]
                                The dictionary to extract or 'all'.
                                (Default: all) [Accepts multiple]
--format-xml / --no-format-xml  Format the XML files using BeautifulSoup.
                                (Default: False)
--debug                         Output debug information to STDERR.
                                (Default: False)
--help                          Show this message and exit.

Introduction

I need a ton of dictionary data for prototyping my learning a language tool, Parsnip, and licensing 40 dictionaries seems too expensive for a bootstrapper prototyping / working on an MVP (I look forward to the day this is no longer true). [Note: I am not planning to redistribute or otherwise use the data in an unlicensed manner.]

Parsnip uses Natural Language Processing and Dictionaries to decouple the word <-> sentence tug-of-war that's existed as long as flashcards have been used for language learning. I.e., should I make a word (concept) or a sentence (example) flashcard?

I care about what words I know for tracking purposes, but I want those words in context when I'm practicing. So the learning system breaks down sentences into lemmas (or dictionary form of a word) and a database of example sentences that the words appear in. This resolves the conceptual tug-of-war for flashcards.

But by removing reference data from the flashcards themselves, I need to integrate reference material directly into Parsnip's UI. JMDict is a great open source project for this, but that only covers a single language. So, I've been keeping my eyes open for people working on extracting the data from Apple's bundled dictionaries.

This has been a community effort that's spanned several years. My contribution is to collect the results, clear up some details about the file format, and package it into a general command-line tool.

References

This is inspired by Reverse-Engineering Apple Dictionary. And the discussion on Hacker News Hacker News: Reverse-Engineering Apple Dictionary (2020). Special thanks to tim-- and enragedcacti who introduced me to binwalk. And dunham who mentioned the random bytes looking like ints of payload sizes.

Additionally, I've found these posts informative:

You might also like...
extract gene TSS/TES site form gencode/ensembl/gencode database GTF file and export bed format file.

GetTsite python Package extract gene TSS/TES site form gencode/ensembl/gencode database GTF file and export bed format file. Install $ pip install Get

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

Parsel Parsel is a BSD-licensed Python library to extract and remove data from HTML and XML using XPath and CSS selectors, optionally combined with re

Finally decent dictionaries based on Wiktionary for your beloved eBook reader.
Finally decent dictionaries based on Wiktionary for your beloved eBook reader.

eBook Reader Dictionaries Finally, decent dictionaries based on Wiktionary for your beloved eBook reader. Dictionaries Catalan 🚧 Ελληνικά (help welco

Generates password lists/dictionaries based on keywords written in python3.

dicbyru Introduction Generates password lists/dictionaries based on keywords. It uses the keywords and adds capital letters, numbers and special chara

This utility synchronises spelling dictionaries from various tools with each other.

This utility synchronises spelling dictionaries from various tools with each other. This way the words that have been trained on MS Office are also correctly checked in vim or Firefox. And vice versa of course.

Safely add untrusted strings to HTML/XML markup.

MarkupSafe MarkupSafe implements a text object that escapes characters so it is safe to use in HTML and XML. Characters that have special meanings are

Converts XML to Python objects

untangle Documentation Converts XML to a Python object. Siblings with similar names are grouped into a list. Children can be accessed with parent.chil

Python module that makes working with XML feel like you are working with JSON

xmltodict xmltodict is a Python module that makes working with XML feel like you are working with JSON, as in this "spec": print(json.dumps(xmltod

Create Open XML PowerPoint documents in Python

python-pptx is a Python library for creating and updating PowerPoint (.pptx) files. A typical use would be generating a customized PowerPoint presenta

The lxml XML toolkit for Python

What is lxml? lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language. It's also very fast and memory

PubMed Mapper: A Python library that map PubMed XML to Python object

pubmed-mapper: A Python Library that map PubMed XML to Python object 中文文档 1. Philosophy view UML Programmatically access PubMed article is a common ta

Simple app for visual editing of Page XML files

Name nw-page-editor - Simple app for visual editing of Page XML files. Version: 2021.02.22 Description nw-page-editor is an application for viewing/ed

PAGE XML format collection for document image page content and more
PAGE XML format collection for document image page content and more

PAGE-XML PAGE XML format collection for document image page content and more For an introduction, please see the following publication: http://www.pri

A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file
A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Udemy Scraper A Web Scraper built with beautiful soup, that fetches udemy course information. Installation Virtual Environment Firstly, it is recommen

A repository that shares tuning results of trained models generated by TensorFlow / Keras. Post-training quantization (Weight Quantization, Integer Quantization, Full Integer Quantization, Float16 Quantization), Quantization-aware training. TensorFlow Lite. OpenVINO. CoreML. TensorFlow.js. TF-TRT. MediaPipe. ONNX. [.tflite,.h5,.pb,saved_model,tfjs,tftrt,mlmodel,.xml/.bin, .onnx]
A python script to convert an ucompressed Gnucash XML file to a text file for Ledger and hledger.

README 1 gnucash2ledger gnucash2ledger is a Python script based on the Github Gist by nonducor (nonducor/gcash2ledger.py). This Python script will tak

Json2Xml tool will help you convert from json COCO format to VOC xml format in Object Detection Problem.

JSON 2 XML All codes assume running from root directory. Please update the sys path at the beginning of the codes before running. Over View Json2Xml t

Txt2Xml tool will help you convert from txt COCO format to VOC xml format in Object Detection Problem.

TXT 2 XML All codes assume running from root directory. Please update the sys path at the beginning of the codes before running. Over View Txt2Xml too

Releases(v0.1.1)
Owner
Joshua Olson
Joshua Olson
Parse URLs for DOIs, PubMed identifiers, PMC identifiers, arXiv identifiers, etc.

citation-url Parse URLs for DOIs, PubMed identifiers, PMC identifiers, arXiv identifiers, etc. This module has a single parse() function that takes in

Charles Tapley Hoyt 2 Feb 12, 2022
Homebase Name Changer for Fortnite: Save the World.

Homebase Name Changer This program allows you to change the Homebase name in Fortnite: Save the World. How to use it? After starting the HomebaseNameC

PRO100KatYT 7 May 21, 2022
Compute the fair market value (FMV) of staking rewards at time of receipt.

tendermint-tax A tool to help calculate the tax liability of staking rewards on Tendermint chains. Specifically, this tool calculates the fair market

5 Jan 07, 2022
New time-based UUID formats which are suited for use as a database key

uuid6 New time-based UUID formats which are suited for use as a database key. This module extends immutable UUID objects (the UUID class) with the fun

26 Dec 30, 2022
Script for generating Hearthstone card spoilers & checklists

This is a script for generating text spoilers and set checklists for Hearthstone. Installation & Running Python 3.6 or higher is required. Copy/clone

John T. Wodder II 1 Oct 11, 2022
✨ Un générateur de lien raccourcis en fonction d'un lien totalement fait en Python par moi, et en français.

Shorter Link ❗ Un générateur de lien raccourcis en fonction d'un lien totalement fait en Python par moi, et en français. Dépendences : pip install pys

MrGabin 3 Jun 06, 2021
A simple, console based nHentai Code Generator

nHentai Code Generator A simple, console based nHentai Code Generator. How to run? Windows Android Windows Make sure you have python and git installed

5 Jun 02, 2022
Local backup made easy, with Python and shutil

KTBackup BETA Local backup made easy, with Python and shutil Features One-command backup and restore Minimalistic (only using stdlib) Convenient direc

kelptaken 1 Dec 27, 2021
Python HTTP Agent Parser

Features Fast Detects OS and Browser. Does not aim to be a full featured agent parser Will not turn into django-httpagentparser ;) Usage import ht

Shekhar 213 Dec 06, 2022
Check the basic quality of any dataset

Data Quality Checker in Python Check the basic quality of any dataset. Sneak Peek Read full tutorial at Medium. Explore the app Requirements python 3.

MalaDeep 8 Feb 23, 2022
Casefy (/keɪsfaɪ/) is a lightweight Python package to convert the casing of strings

Casefy (/keɪsfaɪ/) is a lightweight Python package to convert the casing of strings. It has no third-party dependencies and supports Unicode.

Diego Miguel Lozano 12 Jan 08, 2023
Greenery - tools for parsing and manipulating regular expressions

Greenery - tools for parsing and manipulating regular expressions

qntm 242 Dec 15, 2022
Simple profile athena generator for Fortnite Private Servers.

Profile-Athena-Generator A simple profile athena generator for Fortnite Private Servers. This profile athena generrator features: Item variants Get al

Fevers 10 Aug 27, 2022
Experimental python optimistic rollup fraud-proof generation

Macula Experimental python optimistic rollup fraud-proof generation tech by @protolambda. Working on a python version for brevity and simplicity. See

Diederik Loerakker 30 Sep 01, 2022
A meme error handler for python

Pwython OwO what's this? Pwython is project aiming to fill in one of the biggest problems with python, which is that it is slow lacks owoified text. N

SystematicError 23 Jan 15, 2022
A small utility that sorts your files.

FileSorter A small utility that sorts your files. TODO: Scan directory to find files(thanks @corruptmemry for this!) Split extensions to determine fil

2 Jun 16, 2022
This utility lets you draw using your laptop's touchpad on Linux.

FingerPaint This utility lets you draw using your laptop's touchpad on Linux. Pressing any key or clicking the touchpad will finish the drawing

Wazzaps 95 Dec 17, 2022
A simulator for xkcd 2529's weirdly concrete problem

What is this? This is a quick hack implementation of a simulator for xkcd 2529's weirdly concrete problem. This is barely tested and I suck at computa

Reuben Steenekamp 6 Oct 27, 2021
API Rate Limit Decorator

ratelimit APIs are a very common way to interact with web services. As the need to consume data grows, so does the number of API calls necessary to re

Tomas Basham 575 Jan 05, 2023
A collection of tools for biomedical research assay analysis in Python.

waltlabtools A collection of tools for biomedical research assay analysis in Python. Key Features Analysis for assays such as digital ELISA, including

Tyler Dougan 1 Apr 18, 2022