Python tutorial for implementing Oxylabs' Residential Proxies with AIOHTTP

Overview

Integrating Oxylabs' Residential Proxies with AIOHTTP

Requirements for the Integration

For the integration to work you'll need to install aiohttp library, use Python 3.6 version or higher and Residential Proxies.
If you don't have aiohttp library, you can install it by using pip command:

pip install aiohttp

You can get Residential Proxies here: https://oxylabs.io/products/residential-proxy-pool

Proxy Authentication

There are 2 ways to authenticate proxies with aiohttp.
The first way is to authorize and pass credentials along with the proxy URL using aiohttp.BasicAuth:

import aiohttp

USER = "user"
PASSWORD = "pass"
END_POINT = "pr.oxylabs.io:7777"
 
async def fetch():
    async with aiohttp.ClientSession() as session:
        proxy_auth = aiohttp.BasicAuth(USER, PASSWORD)
        async with session.get(
                "http://ip.oxylabs.io", 
                proxy="http://pr.oxylabs.io:7777", 
                proxy_auth=proxy_auth ,
        ) as resp:
            print(await resp.text())

The second one is by passing authentication credentials in proxy URL:

import aiohttp

USER = "user"
PASSWORD = "pass"
END_POINT = "pr.oxylabs.io:7777"

async def fetch():
    async with aiohttp.ClientSession() as session:
        async with session.get(
                "http://ip.oxylabs.io", 
                proxy=f"http://{USER}:{PASSWORD}@{END_POINT}",
        ) as resp: 
            print(await resp.text())

In order to use your own proxies, adjust user and pass fields with your Oxylabs account credentials.

Testing Proxies

To see if the proxy is working, try visiting https://ip.oxylabs.io. If everything is working correctly, it will return an IP address of a proxy that you're currently using.

Sample Project: Extracting Data From Multiple Pages

To better understand how residential proxies can be utilized for asynchronous data extracting operations, we wrote a sample project to scrape product listing data and save the output to a CSV file. The proxy rotation allows us to send multiple requests at once risk-free – meaning that we don't need to worry about CAPTCHA or getting blocked. This makes the web scraping process extremely fast and efficient – now you can extract data from thousands of products in a matter of seconds!

li > article.product_pod"): data = { "title": product_data.select_one("h3 > a")["title"], "url": product_data.select_one("h3 > a").get("href")[5:], "product_price": product_data.select_one("p.price_color").text, "stars": product_data.select_one("p")["class"][1], } results_list.append(data) # Fill results_list by reference. print(f"Extracted data for a book: {data['title']}") async def fetch(session, sem, url, results_list): async with sem: async with session.get( url, proxy=f"http://{USER}:{PASSWORD}@{END_POINT}", ) as response: await parse_data(await response.text(), results_list) async def create_jobs(results_list): sem = asyncio.Semaphore(4) async with aiohttp.ClientSession() as session: await asyncio.gather( *[fetch(session, sem, url, results_list) for url in url_list] ) if __name__ == "__main__": results = [] start = time.perf_counter() # Different EventLoopPolicy must be loaded if you're using Windows OS. # This helps to avoid "Event Loop is closed" error. if sys.platform.startswith("win") and sys.version_info.minor >= 8: asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy()) try: asyncio.run(create_jobs(results)) except Exception as e: print(e) print("We broke, but there might still be some results") print( f"\nTotal of {len(results)} products from {len(url_list)} pages " f"gathered in {time.perf_counter() - start:.2f} seconds.", ) df = pd.DataFrame(results) df["url"] = df["url"].map( lambda x: "".join(["https://books.toscrape.com/catalogue", x]) ) filename = "scraped-books.csv" df.to_csv(filename, encoding="utf-8-sig", index=False) print(f"\nExtracted data can be found at {os.path.join(os.getcwd(), filename)}") ">
import asyncio
import time
import sys
import os

import aiohttp
import pandas as pd
from bs4 import BeautifulSoup

USER = "user"
PASSWORD = "pass"
END_POINT = "pr.oxylabs.io:7777"

# Generate a list of URLs to scrape.
url_list = [
    f"https://books.toscrape.com/catalogue/category/books_1/page-{page_num}.html"
    for page_num in range(1, 51)
]


async def parse_data(text, results_list):
    soup = BeautifulSoup(text, "lxml")
    for product_data in soup.select("ol.row > li > article.product_pod"):
        data = {
            "title": product_data.select_one("h3 > a")["title"],
            "url": product_data.select_one("h3 > a").get("href")[5:],
            "product_price": product_data.select_one("p.price_color").text,
            "stars": product_data.select_one("p")["class"][1],
        }
        results_list.append(data)  # Fill results_list by reference.
        print(f"Extracted data for a book: {data['title']}")


async def fetch(session, sem, url, results_list):
    async with sem:
        async with session.get(
            url,
            proxy=f"http://{USER}:{PASSWORD}@{END_POINT}",
        ) as response:
            await parse_data(await response.text(), results_list)


async def create_jobs(results_list):
    sem = asyncio.Semaphore(4)
    async with aiohttp.ClientSession() as session:
        await asyncio.gather(
            *[fetch(session, sem, url, results_list) for url in url_list]
        )


if __name__ == "__main__":
    results = []
    start = time.perf_counter()

    # Different EventLoopPolicy must be loaded if you're using Windows OS.
    # This helps to avoid "Event Loop is closed" error.
    if sys.platform.startswith("win") and sys.version_info.minor >= 8:
        asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

    try:
        asyncio.run(create_jobs(results))
    except Exception as e:
        print(e)
        print("We broke, but there might still be some results")

    print(
        f"\nTotal of {len(results)} products from {len(url_list)} pages "
        f"gathered in {time.perf_counter() - start:.2f} seconds.",
    )
    df = pd.DataFrame(results)
    df["url"] = df["url"].map(
        lambda x: "".join(["https://books.toscrape.com/catalogue", x])
    )
    filename = "scraped-books.csv"
    df.to_csv(filename, encoding="utf-8-sig", index=False)
    print(f"\nExtracted data can be found at {os.path.join(os.getcwd(), filename)}")

If you want to test the project's script by yourself, you'll need to install some additional packages. To do that, simply download requirements.txt file and use pip command:

pip install -r requirements.txt

If you're having any trouble integrating proxies with aiohttp and this guide didn't help you - feel free to contact Oxylabs customer support at [email protected].

Owner
Oxylabs.io
Oxylabs.io
MoreIP 一款基于Python的面向 MacOS/Linux 用户用于查询IP/域名信息的日常渗透小工具

MoreIP 一款基于Python的面向 MacOS/Linux 用户用于查询IP/域名信息的日常渗透小工具

xq17 9 Sep 21, 2022
KoreaVPN - Create a VPN App for Mac Using Automator

VPN app 만들기 (a.k.a. KoreaVPN) VPN을 사용하기 위해 들어가는 10초의 시간을 아끼고, 귀찮음을 최소화 하기 위해 크롤링

DongHee 6 Jan 17, 2022
league-connection is a python package to communicate to riot client and league client

league-connection is a python package to communicate to riot client and league client.

Sandbox 1 Sep 13, 2022
Typhon is a macOS specific payload aimed at targetting Jamf managed devices.

Typhon is a macOS specific payload aimed at targetting Jamf managed devices. This payload can be used to manipulate macOS devices into communicating with a Mythic instance, which acts as a Jamf serve

Mythic Agents 29 Dec 23, 2022
WebRTC and ORTC implementation for Python using asyncio

aiortc What is aiortc? aiortc is a library for Web Real-Time Communication (WebRTC) and Object Real-Time Communication (ORTC) in Python. It is built o

3.2k Jan 07, 2023
This is a Client-Server-System which can share the screen from the server to client and in the other direction.

Screenshare-Streaming-Python This is a Client-Server-System which can share the screen from the server to client and in the other direction. You have

VFX / Videoeffects Creator 1 Nov 19, 2021
This tool extracts Credit card numbers, NTLM(DCE-RPC, HTTP, SQL, LDAP, etc), Kerberos (AS-REQ Pre-Auth etype 23), HTTP Basic, SNMP, POP, SMTP, FTP, IMAP, etc from a pcap file or from a live interface.

This tool extracts Credit card numbers, NTLM(DCE-RPC, HTTP, SQL, LDAP, etc), Kerberos (AS-REQ Pre-Auth etype 23), HTTP Basic, SNMP, POP, SMTP, FTP, IMAP, etc from a pcap file or from a live interface

1.6k Jan 01, 2023
Godzilla traffic decoder Godzilla Decoder 是一个用于 哥斯拉Godzilla 加密流量分析的辅助脚本。

Godzilla Decoder 简介 Godzilla Decoder 是一个用于 哥斯拉Godzilla 加密流量分析的辅助脚本。 Godzilla Decoder 基于 mitmproxy,是mitmproxy的addon脚本。 目前支持 哥斯拉3.0.3 PhpDynamicPayload的

He Ruiliang 40 Dec 25, 2022
Tsunami-Fi is simple multi-tool bash application for Wi-Fi attacks

🪴 Tsunami-Fi 🪴 Русская версия README 🌿 Description 🌿 Tsunami-Fi is simple multi-tool bash application for Wi-Fi WPS PixieDust and NullPIN attack,

【Kiko】 35 Dec 09, 2022
A pure-Python KSUID implementation

Svix - Webhooks as a service Svix-KSUID This library is inspired by Segment's KSUID implementation: https://github.com/segmentio/ksuid What is a ksuid

Svix 83 Dec 16, 2022
A Python server and client app that tracks player session times and server status

MC Outpost A Python server and client application that tracks player session times and server status About MC Outpost provides a session graph and ser

Grant Scrits 0 Jul 23, 2021
A TCP Chatroom built with python and TCP/IP sockets, consisting of a server and multiple clients which can connect with the server and chat with each other.

A TCP Chatroom built with python and TCP/IP sockets, consisting of a server and multiple clients which can connect with the server and chat with each other. It also provides an Admin role with featur

3 May 22, 2022
Control your Puffco Peak Pro from your computer!

PuffcoPC Control your Puffco Peak Pro from your computer! Contributions Pull requests are welcome. For major changes, please open an issue first to di

Bryan Muschter 5 Nov 02, 2022
An advanced real time threat intelligence framework to identify threats and malicious web traffic on the basis of IP reputation and historical data.

ARTIF is a new advanced real time threat intelligence framework built that adds another abstraction layer on the top of MISP to identify threats and malicious web traffic on the basis of IP reputatio

CRED 225 Dec 31, 2022
A Simplest TCP client and echo server

Простейшие TCP-клиент и эхо-сервер Цель работы Познакомиться с приемами работы с сетевыми сокетами в языке программирования Python. Задания для самост

Юля Нагубнева 1 Oct 25, 2021
Arp-spoofing, this script was written for people who want to spoof any vulnerable machine such as Wİndows, of course it could have been more sophisticatedly created but these repos will be updated constantly

ARP-SPOOF ARP spoofing is a type of attack in which a malicious actor sends falsified ARP (Address Resolution Protocol) messages over a local area net

2 Dec 28, 2021
The sequel to SquidNet. It has many of the previous features that were in the original script, however a lot of the functions that do not serve much functionality have been removed.

SquidNet2 The sequel to SquidNet. It has many of the previous features that were in the original script, however a lot of the functions that do not se

DrSquidX 5 Mar 25, 2022
Public HTTPS access to Home Assistant with Dataplicity service

Custom component for public HTTPS access to Home Assistant with Dataplicity service. Should work on any Linux PC or ARM, not only Raspberry as Dataplicity service said. Don't work on Windows.

Alex X 70 Oct 03, 2022
Nexum is an open-source, remote administration tool written in Python 3

A full-featured remote administration tool written in Python 3. The goal of this project is to make the use of a remote administration tool as simple

z3phyrus 2 Nov 26, 2021
A network address manipulation library for Python

netaddr A system-independent network address manipulation library for Python 2.7 and 3.5+. (Python 2.7 and 3.5 support is deprecated). Provides suppor

711 Jan 05, 2023