当前位置:网站首页>Use the fast proxy to build your own proxy pool (mom doesn't have to worry about IP being blocked anymore)
Use the fast proxy to build your own proxy pool (mom doesn't have to worry about IP being blocked anymore)
2022-06-26 06:13:00 【An Muxi】
Use the fast agent to build your own IP Agent pool
Come on, agent url as follows :https://www.kuaidaili.com/free
notes : Only used to record their own learning !!!
Don't use it for commercial purposes
Seeing that the online proxy pool is not very friendly to Xiaobai , So I built my own IP Agent pool , You don't have to worry about your own IP Blocked by anti climbing !!!
Knowledge point :
- utilize faker bring user-agent randomization
- Save data to MongoDB Can be called at any time
- Select any one from the database IP Used instead of local IP
# -*- coding = utf-8 -*-
# @Time:2020-10-07 15:20
# @Author: Have a bottle of anmuxi
# @File: Set up agent pool .py
# @ Start a good day @[email protected]、
""" brief introduction : 1. Use the fast agent to build your own IP Agent pool 2. Will crawl effectively IP Store in MongoDB in 3. From database MongoDB Select a valid one at random IP Call this program description : 1. First call get_proxy(page_num) And pass it to the specific edge variable 2. The second call read_ip No need to pass parameters And pass it to the concrete variable """
# ----------------------------------------------------------------------------------------------------------------------
import random
import requests
from lxml import etree
import time
from faker import Factory
from pymongo import MongoClient
# ----------------------------------------- Generate random request header User-Agent-----------------------------------------------------
def get_user_agent(num):
factory = Factory.create()
user_agent = []
for i in range(num):
user_agent.append({
'User-Agent': factory.user_agent()})
return user_agent
# ----------------------------------------- Crawl agent IP Main crawl cache agent -----------------------------------------------------
def get_proxy(page_num):
""" Crawl agent IP And check IP The effectiveness of the :param page_num: The number of pages to climb :return: proxies_list_use: Returns the valid for crawling IP Address """
headers = get_user_agent(5)
# ip The format of {' Protocol type ':'ip: port '}
proxies_list = []
for i in range(1, 5):
print(' Crawling proxy {} All proxies for page ip'.format(i))
header_i = random.randint(0, len(headers) - 1)
headers = headers[header_i]
base_url = 'https://www.kuaidaili.com/free/inha/{}/'.format(i)
page_text = requests.get(url=base_url, headers=headers).text
tree1 = etree.HTML(page_text)
tr_list = tree1.xpath('//table[@class="table table-bordered table-striped"]/tbody/tr')
for tr in tr_list:
http_type = tr.xpath('./td[@data-title=" type "]/text()')[0]
ip = tr.xpath('./td[@data-title="IP"]/text()')[0]
port = tr.xpath('./td[@data-title="PORT"]/text()')[0]
proxies = {
http_type: ip + ':' + port}
proxies_list.append(proxies)
# print(proxies)
time.sleep(1)
proxies_list_use = check_ip(proxies_list)
save_ip(proxies_list_use)
return proxies_list_use
# ----------------------------------------- Use Baidu to test Crawling agent IP The effectiveness of the -------------------------------------------------
def check_ip(proxies_list):
""" Inspection agency IP The quality of the Use crawling proxies directly IP Visit Baidu Set the response time to 0.1 :param proxies_list: :return: """
headers = get_user_agent(5)
header_i = random.randint(0, len(headers) - 1)
headers = headers[header_i]
can_use = []
for proxy in proxies_list:
try:
response = requests.get('https://www.baidu.com', headers=headers, proxies=proxy, timeout=0.1)
if response.status_code == 200:
can_use.append(proxy)
except Exception as e:
print(' agent IP The error is :', e)
return can_use
# ----------------------------------------- Persistent storage to MongoDB in It is convenient to get effective agents from them later IP------------------------------
def save_ip(ip_list):
""" Will crawl to the proxy IP Store in MongoDB in :param ip_list: Crawling is effective IP ( list ) :return: """
client = MongoClient()
time_info = time.strftime("%Y-%m-%d", time.localtime())
collection = client[' Come on, agent '][time_info + ' Crawling agent IP']
collection.insert_many(ip_list)
# ----------------------------------------- Read MongoDB Effective in IP To disguise IP Address crawling to other websites url------------------------------
def read_ip():
""" from MongoDB A valid... Is called randomly in IP :return:proxy Returns random valid data in the database IP Address """
client = MongoClient()
time_info = time.strftime("%Y-%m-%d", time.localtime())
collection = client[' Come on, agent '][time_info + ' Crawling agent IP']
ip_list = list(collection.find())
proxy_i = random.randint(0, len(ip_list) - 1)
proxy = {
ip_list[proxy_i]['http_type']: ip_list[proxy_i]['ip_port']}
return proxy
# ------------------------------------------------- The main function main()----------------------------------------------------------
def main():
useful_ip = get_user_agent(5)
proxy = read_ip()
- MongoDB The data stored in is as follows (2020-10-07 Crawling agent IP)

A reptile is as deep as the sea , The cost of learning is too big !!!
边栏推荐
猜你喜欢

Implementation of third-party wechat authorized login for applet

Logstash——Logstash向Email发送告警邮件

Vs2022 offline installation package download and activation

Combined mode, transparent mode and secure mode

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

在web页面播放rtsp流视频(webrtc)

Message queue - message transaction management comparison

如何设计好的技术方案

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Playing RTSP streaming video on Web pages (webrtc)
随机推荐
NPM private server problem of peanut shell intranet penetration mapping
Mongodb -- use mongodb to intercept the string content in the field and perform grouping statistics
Solve the problem that Cmdr cannot use find command under win10
实时数仓方案如何选型和构建
Level signal and differential signal
Logstash——Logstash向Email发送告警邮件
数据治理工作的几种推进套路
05. basic data type - Dict
EFK升级到ClickHouse的日志存储实战
tf.nn.top_k()
Day4 branch and loop
Data visualization practice: Experimental Report
Adapter mode
canal部署、原理和使用介绍
冒泡排序(Bubble Sort)
Interface oriented programming
numpy. log
A tragedy triggered by "yyyy MM DD" and vigilance before New Year's Day~
Thread status and stop
Mongodb——使用Mongodb对字段中字符串内容进行截取,并进行分组统计