Brownant is a web data extracting framework.

Overview

Build Status Coverage Status PyPI Version Downloads Wheel Status

Brownant

Brownant is a lightweight web data extracting framework.

Who uses it?

At the moment, dongxi.douban.com (a.k.a. Douban Dongxi) uses Brownant in production environment.

Installation

$ pip install brownant

Links

Issues

If you want to report bugs or request features, please create issues on GitHub Issues.

Contributes

You can send a pull reueqst on GitHub.

You might also like...
a small library for extracting rich content from urls
a small library for extracting rich content from urls

A small library for extracting rich content from urls. what does it do? micawber supplies a few methods for retrieving rich metadata about a variety o

A tool for extracting text from scanned documents (via OCR), with user-defined post-processing.

The project is based on older versions of tesseract and other tools, and is now superseded by another project which allows for more granular control o

A machine learning software for extracting information from scholarly documents

GROBID GROBID documentation Visit the GROBID documentation for more detailed information. Summary GROBID (or Grobid, but not GroBid nor GroBiD) means

A Poetry plugin for dynamically extracting the package version.

Poetry Version Plugin A Poetry plugin for dynamically extracting the package version. It can read the version from a file __init__.py with: # __init__

discovering subdomains, hidden paths, extracting unique links

python-website-crawler discovering subdomains, hidden paths, extracting unique links pip install -r requirements.txt discover subdomain: You can give

Quick script for automatically extracting syscall numbers for an OS

Syscalls-Extractor Quick script for automatically extracting syscall numbers for an OS $ python3 .\syscalls-extractor.py --help usage: syscalls-extrac

Extracting Summary Knowledge Graphs from Long Documents

GraphSum This repo contains the data and code for the G2G model in the paper: Extracting Summary Knowledge Graphs from Long Documents. The other basel

A python library for extracting text from PDFs without losing the formatting of the PDF content.

Multilingual PDF to Text Install Package from Pypi Install it using pip. pip install multilingual-pdf2text The library uses Tesseract which can be ins

PRIN/SPRIN: On Extracting Point-wise Rotation Invariant Features
PRIN/SPRIN: On Extracting Point-wise Rotation Invariant Features

PRIN/SPRIN: On Extracting Point-wise Rotation Invariant Features Overview This repository is the Pytorch implementation of PRIN/SPRIN: On Extracting P

Python script for extracting audio from video files and creating Mel spectrograms

video2spectrogram About This package is meant to automate the process of extracting audio files from videos and saving the plots computed from these a

Sample scripts to show extracting details directly from the AIQUM database

Sample scripts to show extracting details directly from the AIQUM database

 Extracting frames from video and create video using frames
Extracting frames from video and create video using frames

Extracting frames from video and create video using frames This program uses opencv library to extract the frames from video and create video from ext

An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Extracting OpenAI CLIP (Global/Grid) Features from Image and Text This repo aims at providing an easy to use and efficient code for extracting image &

Malicious Document IoC Extractor is a collection of scripts that helps extracting IoCs from various maldoc families.

MDIExtractor Malicious Document IoC Extractor (MDIExtractor) is a collection of scripts that helps extracting IoCs from various maldoc families. Prere

Analysiscsv.py for extracting analysis and exporting as CSV
Analysiscsv.py for extracting analysis and exporting as CSV

wcc_analysis Lichess page documentation: https://lichess.org/page/world-championships Each WCC has a study, studies are fetched using: https://lichess

This repository contains Python scripts for extracting linguistic features from Filipino texts.

Filipino Text Linguistic Feature Extractors This repository contains scripts for extracting linguistic features from Filipino texts. The scripts were

Utility for Extracting all passwords from ConnectWise Automate
Utility for Extracting all passwords from ConnectWise Automate

CWA Password Extractor Utility for Extracting all passwords from ConnectWise Automate (E.g. while migrating to a new system). Outputs a csv file with

A Telegram bot to extracting text from images. All languages supported.

OCR Bot A Telegram bot to extracting text from images. All languages supported. Deploy to Heroku Local Deploying Clone the repo git clone https://gith

WIP: extracting Geometry utilities from datacube-core

odc.geo This is still work in progress. This repository contains geometry related code extracted from Open Datacube. For details and motivation see OD

Comments
  • Provide some example code?

    Provide some example code?

    Right now in order to know the purpose framework('data extracting framework' is accurate enough but is a vague term without examples) we have to click though 'Documents' and then go to this page. The usage and purpose of this framework should be explicitly displayed on README.md and the demo scripts should be put in repository. My two cents.

    enhancement 
    opened by CNBorn 3
  • change the lib desc: crawling -> extracting.

    change the lib desc: crawling -> extracting.

    With the suggest of @CNBorn, this branch changed the library description by using "extracting" instead of "crawling".

    @CNBorn @VeryCB Please review my changeset. Thank you.

    opened by tonyseek 3
  • fix using lxml and requests gets unicode error.

    fix using lxml and requests gets unicode error.

    when I use The Declarative Demo,I receive this error - Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

    use content instead of text, it can work:-)

    opened by Caratpine 1
  • Known Bugs

    Known Bugs

    • [x] The UnicodeDecodeError will be raised while path_info including non-ascii characters. Convert it into ascii string with urllib.quote.
    • [x] The url.hostname may be None for invalid URLs. Prevent those invalid input.
    bug 
    opened by tonyseek 1
Releases(0.1.6)
Zotero2Readwise - A Python Library to retrieve annotations and notes from Zotero and upload them to your Readwise

Zotero ➡️ Readwise zotero2readwise is a Python library that retrieves all Zotero

Essi Alizadeh 49 Dec 20, 2022
fast python port of arc90's readability tool, updated to match latest readability.js!

python-readability Given a html document, it pulls out the main body text and cleans it up. This is a python port of a ruby port of arc90's readabilit

Yuri Baburov 2.2k Dec 28, 2022
RSS feed generator website with user friendly interface

RSS feed generator website with user friendly interface

Alexandr Nesterenko 331 Jan 02, 2023
Web Content Retrieval for Humans™

Lassie Lassie is a Python library for retrieving basic content from websites. Usage import lassie lassie.fetch('http://www.youtube.com/watch?v

Mike Helmick 571 Dec 29, 2022
Pythonic HTML Parsing for Humans™

Requests-HTML: HTML Parsing for Humans™ This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible. When us

Python Software Foundation 12.9k Jan 01, 2023
a small library for extracting rich content from urls

A small library for extracting rich content from urls. what does it do? micawber supplies a few methods for retrieving rich metadata about a variety o

Charles Leifer 588 Dec 27, 2022
Module for automatic summarization of text documents and HTML pages.

Automatic text summarizer Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains sim

Mišo Belica 3k Jan 03, 2023
Combine XPath, CSS Selectors and JSONPath for Web data extracting.

Data Extractor Combine XPath, CSS Selectors and JSONPath for Web data extracting. Quickstarts Installation Install the stable version from PYPI. pip i

林玮 (Jade Lin) 27 Oct 22, 2022
Fast and robust date extraction from web pages, with Python or on the command-line

Find original and updated publication dates of any web page. From the command-line or within Python, all the steps needed from web page download to HTML parsing, scraping, and text analysis are inclu

Adrien Barbaresi 60 Dec 14, 2022
Open clone of OpenAI's unreleased WebText dataset scraper.

Open clone of OpenAI's unreleased WebText dataset scraper. This version uses pushshift.io files instead of the API for speed.

Joshua C Peterson 471 Dec 30, 2022
News, full-text, and article metadata extraction in Python 3. Advanced docs:

Newspaper3k: Article scraping & curation Inspired by requests for its simplicity and powered by lxml for its speed: "Newspaper is an amazing python li

Lucas Ou-Yang 12.3k Jan 01, 2023
Every web site provides APIs.

Toapi Overview Toapi give you the ability to make every web site provides APIs. Version v2.0.0, Completely rewrote. More elegant. More pythonic v1.0.0

Jiuli Gao 3.3k Jan 05, 2023
Convert HTML to Markdown-formatted text.

html2text html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to

Alireza Savand 1.3k Dec 31, 2022
Brownant is a web data extracting framework.

Brownant Brownant is a lightweight web data extracting framework. Who uses it? At the moment, dongxi.douban.com (a.k.a. Douban Dongxi) uses Brownant i

Douban Inc. 157 Jan 06, 2022
Export your data from Xiami

Xiami Exporter 导出虾米音乐的个人数据,功能: 导出歌曲为 json 收藏歌曲 收藏专辑 播放列表 导出收藏艺人为 json 导出收藏专辑为 json 导出播放列表为 json (个人和收藏) 将导出的数据整理至 sqlite 数据库 收藏歌曲 收藏艺人 收藏专辑 播放列表 下载已导出

Xiao Meng 59 Nov 13, 2021
Web-Extractor - Simple Tool To Extract IP-Adress From Website

IP-Adress Extractor Simple Tool To Extract IP-Adress From Website Socials: Langu

ميخائيل 7 Jan 16, 2022
Github Actions采集RSS, 打造无广告内容优质的头版头条超赞宝藏页

Github Actions Rss (garss, 嘎RSS! 已收集69个RSS源, 生成时间: 2021-02-26 11:23:45) 信息茧房是指人们关注的信息领域会习惯性地被自己的兴趣所引导,从而将自己的生活桎梏于像蚕茧一般的“茧房”中的现象。

zhaoolee 721 Jan 02, 2023