PyQuery-based scraping micro-framework.

Related tags

Web Crawlingdemiurge
Overview

demiurge

PyQuery-based scraping micro-framework. Supports Python 2.x and 3.x.

Build Status

Documentation: http://demiurge.readthedocs.org

Installing demiurge

$ pip install demiurge

Quick start

Define items to be scraped using a declarative (Django-inspired) syntax:

import demiurge

class TorrentDetails(demiurge.Item):
    label = demiurge.TextField(selector='strong')
    value = demiurge.TextField()

    def clean_value(self, value):
        unlabel = value[value.find(':') + 1:]
        return unlabel.strip()

    class Meta:
        selector = 'div#specifications p'

class Torrent(demiurge.Item):
    url = demiurge.AttributeValueField(
        selector='td:eq(2) a:eq(1)', attr='href')
    name = demiurge.TextField(selector='td:eq(2) a:eq(2)')
    size = demiurge.TextField(selector='td:eq(3)')
    details = demiurge.RelatedItem(
        TorrentDetails, selector='td:eq(2) a:eq(2)', attr='href')

    class Meta:
        selector = 'table.maintable:gt(0) tr:gt(0)'
        base_url = 'http://www.mininova.org'


>>> t = Torrent.one('/search/ubuntu/seeds')
>>> t.name
'Ubuntu 7.10 Desktop Live CD'
>>> t.size
u'695.81\xa0MB'
>>> t.url
'/get/1053846'
>>> t.html
u'<td>19\xa0Dec\xa007</td><td><a href="/cat/7">Software</a></td><td>...'

>>> results = Torrent.all('/search/ubuntu/seeds')
>>> len(results)
116
>>> for t in results[:3]:
...     print t.name, t.size
...
Ubuntu 7.10 Desktop Live CD 695.81 MB
Super Ubuntu 2008.09 - VMware image 871.95 MB
Portable Ubuntu 9.10 for Windows 559.78 MB
...

>>> t = Torrent.one('/search/ubuntu/seeds')
>>> for detail in t.details:
...     print detail.label, detail.value
... 
Category: Software > GNU/Linux
Total size: 695.81 megabyte
Added: 2467 days ago by Distribution
Share ratio: 17 seeds, 2 leechers
Last updated: 35 minutes ago
Downloads: 29,085

See documentation for details: http://demiurge.readthedocs.org

Why demiurge?

Plato, as the speaker Timaeus, refers to the Demiurge frequently in the Socratic dialogue Timaeus, c. 360 BC. The main character refers to the Demiurge as the entity who "fashioned and shaped" the material world. Timaeus describes the Demiurge as unreservedly benevolent, and hence desirous of a world as good as possible. The world remains imperfect, however, because the Demiurge created the world out of a chaotic, indeterminate non-being.

http://en.wikipedia.org/wiki/Demiurge

Contributors

  • Martín Gaitán (@mgaitan)
Comments
  • Reausable cleaning functions

    Reausable cleaning functions

    You can now add a "clean" kwarg containing a function to a field.

    This makes it easy to use quick filtering (I want this data to be an int) and to re-use functions such as parsedatetime.

        score = demiurge.TextField(selector=".score .upvoted", clean=int)
    
    opened by traverseda 5
  • proof of concept: subitem field

    proof of concept: subitem field

    short rationale: Sometimes I need to scrap a page to retrieve the actual links where the items are. I would like a way to nest Item classes, analog (in some way) to a in ForeignKey / ManyToManyField in Django.

    This is a first PR as a proof of concept, to discuss the idea and its API.

    opened by mgaitan 5
  • RelatedItems only work across urls

    RelatedItems only work across urls

    An obvious use of RelatedItems (or a similar construct) is recursively mapping a comment tree. Right now there's no elegant way to do that.

    An example

    http://pastebin.com/WDL4RjkE


    Reading through the actual code, I think I might be wrong about this. I'll try and make the docs clearer.

    opened by traverseda 2
  • Use lib

    Use lib "requests" for downloading

    I'm right now making use of https://pypi.python.org/pypi/requests-cache which creates a cache of the downloaded stuff magically, and it's awesome. So, I would like to be able to take advantage of it using demiurge.

    I don't know if just as an option or as a replacement of pyquery downloader.

    What do you think?

    opened by jmansilla 2
  • docs: fix simple typo, ocurrence -> occurrence

    docs: fix simple typo, ocurrence -> occurrence

    There is a small typo in docs/index.rst.

    Should read occurrence rather than ocurrence.

    Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

    opened by timgates42 1
  • Fix when no selector defined

    Fix when no selector defined

    the default selector is the whole page ('html') but this is applied through PyQuery.find wich traverses down. example:

    In [2]: PyQuery('<html>hello</html>').find('html')
    Out[2]: []
    
    In [3]: PyQuery('<html>hello</html>')('html')
    Out[3]: [<html>]
    
    opened by mgaitan 1
  • support self reference in RelatedItem

    support self reference in RelatedItem

    RelatedItem('self'). Also, the relateditem's item class could be given by its name (i.eRelatedItem("ItemClass")` ) A typical use case is a listing page with a "next page" link.

    opened by mgaitan 0
Releases(v0.2)
Poolbooru gelscraper - a simple python script for scraping images off gelbooru pools.

poolbooru_gelscraper a simple python script for scraping images off gelbooru pools. modules required:requests_html, and os by default saves files with

savantshuia 1 Jan 02, 2022
河南工业大学 完美校园 自动校外打卡

HAUT-checkin 河南工业大学自动校外打卡 由于github actions存在明显延迟,建议直接使用腾讯云函数 特点 多人打卡 使用简单,仅需账号密码以及用于微信推送的uid 自动获取上一次打卡信息用于打卡 向所有成员微信单独推送打卡状态 完美校园服务器繁忙时造成打卡失败会自动重新打卡

36 Oct 27, 2022
Raspi-scraper is a configurable python webscraper that checks raspberry pi stocks from verified sellers

Raspi-scraper is a configurable python webscraper that checks raspberry pi stocks from verified sellers.

Louie Cai 13 Oct 15, 2022
Auto Join: A GitHub action script to automatically invite everyone to the organization who star your repository.

Auto Invite To The Organization By Star A GitHub Action script to automatically invite everyone to your organization that stars your repository. What

Max Base 11 Dec 11, 2022
一个m3u8视频流下载脚本

一个Python的m3u8流视频下载脚本 介绍 m3u8流视频日益常见,目前好用的下载器也有很多,我把之前自己写的一个小脚本分享出来,供广大网友使用。写此程序的目的在于给视频下载爱好者提供一个下载样例,可直接调用,勿再重复造轮子。 使用方法 在python中直接运行程序或进行外部调用 import

Nchu 0 Oct 10, 2021
An IpVanish Proxies Scraper

EzProxies Tired of searching for good proxies for hours? Just get an IpVanish account and get thousands of good proxies in few seconds! Showcase Watch

11 Nov 13, 2022
Instagram profile scrapper with python

IG Profile Scrapper Instagram profile Scrapper Just type the username, and boo! :D Instalation clone this repo to your computer git clone https://gith

its Galih 6 Nov 07, 2022
Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

trafilatura: Web scraping tool for text discovery and retrieval Description Trafilatura is a Python package and command-line tool which seamlessly dow

Adrien Barbaresi 704 Jan 06, 2023
a small library for extracting rich content from urls

A small library for extracting rich content from urls. what does it do? micawber supplies a few methods for retrieving rich metadata about a variety o

Charles Leifer 588 Dec 27, 2022
robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

RoboBrowser: Your friendly neighborhood web scraper Homepage: http://robobrowser.readthedocs.org/ RoboBrowser is a simple, Pythonic library for browsi

Joshua Carp 3.7k Dec 27, 2022
AssistScraper - program for /r/nba to use to find list of all players a player assisted and how many assists each player recieved

AssistScraper - program for /r/nba to use to find list of all players a player assisted and how many assists each player recieved

5 Nov 25, 2021
Tool to scan for secret files on HTTP servers

snallygaster Finds file leaks and other security problems on HTTP servers. what? snallygaster is a tool that looks for files accessible on web servers

Hanno Böck 2k Dec 28, 2022
京东秒杀商品抢购Python脚本

Jd_Seckill 非常感谢原作者 https://github.com/zhou-xiaojun/jd_mask 提供的代码 也非常感谢 https://github.com/wlwwu/jd_maotai 进行的优化 主要功能 登陆京东商城(www.jd.com) cookies登录 (需要自

Andy Zou 1.5k Jan 03, 2023
A high-level distributed crawling framework.

Cola: high-level distributed crawling framework Overview Cola is a high-level distributed crawling framework, used to crawl pages and extract structur

Xuye (Chris) Qin 1.5k Jan 04, 2023
Using Python and Pushshift.io to Track stocks on the WallStreetBets subreddit

wallstreetbets-tracker Using Python and Pushshift.io to Track stocks on the WallStreetBets subreddit.

91 Dec 08, 2022
Current Antarctic large iceberg positions derived from ASCAT and OSCAT-2

Iceberg Locations Antarctic large iceberg positions derived from ASCAT and OSCAT-2. All data collected here are from the NASA SCP website Overview Thi

Joel Hanson 5 Jul 27, 2022
A package that provides you Latest Cyber/Hacker News from website using Web-Scraping.

cybernews A package that provides you Latest Cyber/Hacker News from website using Web-Scraping. Latest Cyber/Hacker News Using Webscraping Developed b

Hitesh Rana 4 Jun 02, 2022
🐞 Douban Movie / Douban Book Scarpy

Python3-based Douban Movie/Douban Book Scarpy crawler for cover downloading + data crawling + review entry.

Xingbo Jia 1 Dec 03, 2022
Extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file.

GetTss python Package extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file. Install $ pip install GetTss Us

laojunjun 6 Nov 21, 2022
对于有验证码的站点爆破,用于安全合法测试

使用方法 python3 main.py + 配置好的文件 python3 main.py Verify.json python3 main.py NoVerify.json 以上分别对应有验证码的demo和无验证码的demo Tips: 你可以以域名作为配置文件名字加载:python3 main

47 Nov 09, 2022