Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

Overview

japanese-ebook-analysis

This aim of this project is to make analysing the contents of a japanese ebook easy and streamline the process for non-technical users. You can analyse an ebook, and see the following information:

  • The length of the book in words
  • The length of the book in characters
  • The number of unique words used in the book
  • The number of unique words that are only used once in the book
  • The percentage of unique words that are only used once
  • The number of unique characters used
  • The number of unique characters that are only used once
  • The percentage of unique characters that are only used once
  • A list of all the words used in the book as well as how often they are used
  • A list of all the characters used in the book as well as how often they are used

For text processing, we use MeCab

Usage

Currently, the project is not deployed anywhere, so to use the service, you will need to follow the steps below in the development section to get the server running.

  1. Upload a .epub file containing japanese text to the server
  2. The server will redirect you to a page showing you information about the ebook. You can then also click the 'See more details' button to see all the generated data, including a list of all the words used together with how many occurences there are for each word, and the same for the characters as well.

Development

  1. Clone repository: git clone https://github.com/christofferaakre/japanese-ebook-analysis.git
  2. Make sure you have mecab set up on your system. See http://www.robfahey.co.uk/blog/japanese-text-analysis-in-python/
    (Only required if you will actually upload ebooks or run the analyse_epub.py script), which you will not need to do to contribute to other parts of the app. for a good guide on how to set it up.
  3. Install python dependencies: pip install -r requirements.txt
  4. Install other dependencies (these all need to be in your system path):
    • pandoc
  5. Run ./app.py to start the flask dev server

Contributing

I'm very happy for any happy contributions! Before contributing, please have a look at CONTRIBUTING.md.

To see what needs work on, have a look at the repo's Issues and its Pull requests.

Feel free to submit your own issue or pull request about a new feature or anything else. When submitting a pull request, don't be afraid to modify any of the files; I'm not very attached to the coding style used in the repo.

Owner
Christoffer Aakre
Christoffer Aakre
Translators - is a library which aims to bring free, multiple, enjoyable translation to individuals and students in Python

Translators - is a library which aims to bring free, multiple, enjoyable translation to individuals and students in Python

UlionTse 907 Dec 27, 2022
This is an incredibly powerful calculator that is capable of many useful day-to-day functions.

Description 💻 This is an incredibly powerful calculator that is capable of many useful day-to-day functions. Such functions include solving basic ari

Jordan Leich 37 Nov 19, 2022
This repository is home to the Optimus data transformation plugins for various data processing needs.

Transformers Optimus's transformation plugins are implementations of Task and Hook interfaces that allows execution of arbitrary jobs in optimus. To i

Open Data Platform 37 Dec 14, 2022
This is Assignment1 code for the Web Data Processing System.

This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata).

3 Dec 04, 2022
Easy, fast, effective, and automatic g-code compression!

Getting to the meat of g-code. Easy, fast, effective, and automatic g-code compression! MeatPack nearly doubles the effective data rate of a standard

Scott Mudge 97 Nov 21, 2022
Binary LSTM model for text classification

Text Classification The purpose of this repository is to create a neural network model of NLP with deep learning for binary classification of texts re

Nikita Elenberger 1 Mar 11, 2022
This is a Prototype of an Ai ChatBot "Tea and Coffee Supplier" using python.

Ai-ChatBot-Python A chatbot is an intelligent system which can hold a conversation with a human using natural language in real time. Due to the rise o

1 Oct 30, 2021
News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

NLP T5 Project proposal Topic Modeling and Clustering of News-Articles-and-Essays Students: Nasser Alshehri Abdullah Bushnag Abdulrhman Alqurashi OVER

2 Jan 18, 2022
Codes to pre-train Japanese T5 models

t5-japanese Codes to pre-train a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts. The model is available at https://hug

Megagon Labs 37 Dec 25, 2022
Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph",

K-BERT Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph", which is implemented based on the UER framework. R

Weijie Liu 834 Jan 09, 2023
NLP: SLU tagging

NLP: SLU tagging

北海若 3 Jan 14, 2022
Natural language Understanding Toolkit

Natural language Understanding Toolkit TOC Requirements Installation Documentation CLSCL NER References Requirements To install nut you need: Python 2

Peter Prettenhofer 119 Oct 08, 2022
Python generation script for BitBirds

BitBirds generation script Intro This is published under MIT license, which means you can do whatever you want with it - entirely at your own risk. Pl

286 Dec 06, 2022
Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products.

Leah Pathan Khan 2 Jan 12, 2022
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 6k Dec 30, 2022
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 01, 2022
Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

simple_diarizer Simplified diarization pipeline using some pretrained models. Made to be a simple as possible to go from an input audio file to diariz

Chau 65 Dec 30, 2022
Japanese synonym library

chikkarpy chikkarpyはchikkarのPython版です。 chikkarpy is a Python version of chikkar. chikkarpy は Sudachi 同義語辞書を利用し、SudachiPyの出力に同義語展開を追加するために開発されたライブラリです。

Works Applications 48 Dec 14, 2022
Share constant definitions between programming languages and make your constants constant again

Introduction Reconstant lets you share constant and enum definitions between programming languages. Constants are defined in a yaml file and converted

Natan Yellin 47 Sep 10, 2022
Behavioral Testing of Clinical NLP Models

Behavioral Testing of Clinical NLP Models This repository contains code for testing the behavior of clinical prediction models based on patient letter

Betty van Aken 2 Sep 20, 2022