a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

Overview

This is George's Scraping Project

  • To get started cd into the theZoo file and run:

  • chmod +x script.sh

  • then: ./script.sh

  • This will spin up a Postgres container, the Python environment, a Redis container, a Squid container (for the proxy) and a Splash container

  • The docker container will automaticaly run the JS spider which is the most complicated one. The other spiders are located under the spider directory and there are some tests under the /validate directory. These tests will use pandas to sql query postgres to make sure the data was added to the DB.

  • The project took me 2 days to complete. I spent most of my time learning about docker compose and the networking aspect of containers as well as the rotating proxies/user agents people add to their spiders.

Below I have outlined the steps I took as I completed the project

Docker

  • I downloaded the Docker Desktop application for MacOS
  • Then as I read through the pdf I looked up docker images for the technologies used, and I found some for postgres, squid, splash and redis

Python Environment

  • I setup a Python virtual environment in my IDE, here I developed the whole project to keep my packages enclosed so they did not conflict with my global packages in my machine. Once I was finished and tested the spiders to make sure they worked properly I dockerized everything and zipped it up to turn in
  • Packages I downloaded: pip, setuptools, wheel, Scrapy, Pandas, SQLAlchemy, scrapy-splash, scrapy-redis and psycopg2-binary
  • I created a requirements.txt file so I could cat the pip list of my package versions into the file for easy replication
  • The models.py file contains the SQLAlchemy code and the database schema
  • The pipelines.py file is where our data is sent to Postgres

The Default Spider

This crawler grabs quotes from the Default endpoint using pagination.

The data is scraped and sent to Postgres as well as downloaded to a json file called items.json

The Scroll Spider

This crawler uses scrolling to grab quotes from the Scroll endpoint.

Previously I had used a puppeter like bot where you can input how much padding the bot should scroll to scrape your desired data. In this instance using Scrapy I did not know how to do that, so I ended up looking up an alternative method. I found that the data is still being paginated in the request. When you google inspect you can see a console log that names the page you are on, so I looked at the request body and found how the data was being loaded. At this point I could have used the requests library, but instead found how to do it using Scrapy. This scraper works the same as the default one where the page number is added to the end of the url to retreive the next batch of data.

The JS Spider

This crawler uses a JS rendering service called Splash to query the JavaScript endpoint in order to grab the quotes.

I had to add Splash specific middlewares to the Scrapy settings in order to make this work. I also created a docker image in my docker compose file that holds the Splash instance. Then the scraping worked just like the default spider.

The Login Spider

This crawler scrapes the input field for the csrf token. It then submits a form request, authenticates and scrapes the rest of the data as the default spider does.

Notes

  • I added a user agent that makes me look like a more realistic person in the settings file. I also added the item pipeline and some configuration for the docker containers. I also added a download delay of 2 seconds so that the scraper does not scrape too fast.

  • Adding the Proxy was a bit tricky for me. I tried using a project called Scylla, however it did not end up working with my envirnonment so I was looking for alternatives. I ended up using Squid, created a docker image and added the proxy configuration in the middleware.py file.

  • The pause/resume scraping functionality comes from scheduler_persist being set to True in the settings using the scrapy-redis package.

  • While containerizing my application I have never had to use Docker Compose, SQLAlchemy or Redis so I quickly learned in order to integrate them into my project.

Potential Features in the Future

  • I did not collect much metadata but I saw a package called scrapy-magic fields and I would have liked to implement it to add the timestamps and urls scraped to the DB items

  • I did not create GUI tools for the Postgres and Redis to make it easier to view, this would have been a nice addition

  • Since only the JS spider is triggered by the script the other ones are manual I only set up a single table, but for a more distributed process I think making more models and tables for each spider would have been good. I wanted to reuse the code so I left it how it is.

  • Cron job functionality

Owner
George Reyes
currently looking for a job
George Reyes
一些爬虫相关的签名、验证码破解

cracking4crawling 一些爬虫相关的签名、验证码破解,目前已有脚本: 小红书App接口签名(shield)(2020.12.02) 小红书滑块(数美)验证破解(2020.12.02) 海南航空App接口签名(hnairSign)(2020.12.05) 说明: 脚本按目标网站、App命

XNFA 90 Feb 09, 2021
TikTok Username Swapper/Claimer/etc

TikTok-Turbo TikTok Username Swapper/Claimer/etc I wanted to create it as fast as possible but i eventually gave up and recoded it many many many many

Kevin 12 Dec 19, 2022
A simple python web scraper.

Dissec A simple python web scraper. It gets a website and its contents and parses them with the help of bs4. Installation To install the requirements,

11 May 06, 2022
Scraping Thailand COVID-19 data from the DDC's tableau dashboard

Scraping COVID-19 data from DDC Dashboard Scraping Thailand COVID-19 data from the DDC's tableau dashboard. Data is updated at 07:30 and 08:00 daily.

Noppakorn Jiravaranun 5 Jan 04, 2022
京东茅台抢购最新优化版本,京东茅台秒杀,优化了茅台抢购进程队列

京东茅台抢购最新优化版本,京东茅台秒杀,优化了茅台抢购进程队列

MaoTai 129 Dec 14, 2022
A database scraper created with mechanical soup and sqlite

WebscrapingDatabases a database scraper created with mechanical soup and sqlite author: Mariya Sha Watch on YouTube: This repository was created to su

Mariya 30 Aug 08, 2022
WebScraper - A script that prints out a list of all EXTERNAL references in the HTML response to an HTTP/S request

Project A: WebScraper A script that prints out a list of all EXTERNAL references

2 Apr 26, 2022
Jobinja.ir jobs scraper.

Jobinja.ir Dataset Introduction This project is a simple web scraper that scraps pages of jobinja.ir concurrently and writes and update (if file gets

Iman Kermani 3 Apr 15, 2022
SmartScraper: 简单、自动、快捷的Python网络爬虫

SmartScraper: 简单、自动、快捷的Python网络爬虫 Note: The origin developer of SmartScraper is Alireza Mika, I only change a little code of AutoScraper. SmartScraper

DaDeng 9 Apr 16, 2022
Telegram Group Scrapper

this programe is make your work so much easy on telegrame. do you want to send messages on everyone to your group or others group. use this script it will do your work automatically with one click. a

HackArrOw 3 Dec 03, 2022
Scrape plants scientific name information from Agroforestry Species Switchboard 2.0.

Agroforestry Species Switchboard 2.0 Scraper Scrape plants scientific name information from Species Switchboard 2.0. Requirements python = 3.10 (you

Mgs. M. Rizqi Fadhlurrahman 2 Dec 23, 2021
This repo has the source code for the crawler and data crawled from auto-data.net

This repo contains the source code for crawler and crawled data of cars specifications from autodata. The data has roughly 45k cars

Tô Đức Anh 5 Nov 22, 2022
Scrape data on SpaceX: Capsules, Rockets, Cores, Roadsters, SpaceX Info

SpaceX Sofware I developed software to scrape data on SpaceX: Capsules, Rockets, Cores, Roadsters, SpaceX Info to use the software you need Python a

Maxence Rémy 16 Aug 02, 2022
Scrapy uses Request and Response objects for crawling web sites.

Requests and Responses¶ Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and p

Md Rashidul Islam 1 Nov 03, 2021
Divar.ir Ads scrapper

Divar.ir Ads Scrapper Introduction This project first asynchronously grab Divar.ir Ads and then save to .csv and .xlsx files named data.csv and data.x

Iman Kermani 4 Aug 29, 2022
Simple tool to scrape and download cross country ski timings and results from live.skidor.com

LiveSkidorDownload Simple tool to scrape and download cross country ski timings and results from live.skidor.com Usage: Put the python file in a dedic

0 Jan 07, 2022
A Python module to bypass Cloudflare's anti-bot page.

cloudscraper A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Requests.

VeNoMouS 2.6k Dec 31, 2022
Amazon web scraping using Scrapy Framework

Amazon-web-scraping-using-Scrapy-Framework Scrapy Scrapy is an application framework for crawling web sites and extracting structured data which can b

Sejal Rajput 1 Jan 25, 2022
A Python package that scrapes Google News article data while remaining undetected by Google.

A Python package that scrapes Google News article data while remaining undetected by Google. Our scraper can scrape page data up until the last page and never trigger a CAPTCHA (download stats: https

Geminid Systems, Inc 6 Aug 10, 2022
This tool crawls a list of websites and download all PDF and office documents

This tool crawls a list of websites and download all PDF and office documents. Then it analyses the PDF documents and tries to detect accessibility issues.

AccessibilityLU 7 Sep 30, 2022