当前位置:网站首页>Crawler exercise (IV) -- IP address problem
Crawler exercise (IV) -- IP address problem
2022-06-29 01:54:00 【Recall_ Perseverance】
Preface :
Most websites will be based on the request header , Analyze whether it is a human request
reason :
Python Request header for
Host: 127.0.0.1:5000
User-Agent: python-requests/2.21.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
- 1.
- 2.
- 3.
- 4.
- 5.
The crawler
@app.route('/getInfo')
def hello_world():
if(str(request.headers.get('User-Agent')).startswith('python')):
return " The kid , Use reptiles, right ? Fuck you "
else:
return " Pretend there's a lot of data here "
if __name__ == "__main__":
app.run(debug=True)
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
Notes:
There are many free agents on the Internet Just search, there's a pile But it's not stable
You can make one IP Agent pool
IP Agent pool
Self built agent pool :
Climb to the free agent provided above the Western thorn agent ip:
Tutorial links :
https://www.jianshu.com/p/2daa34a435df
from bs4 import BeautifulSoup
import requests
from urllib import request, error
import threading
inFile = open('proxy.txt')
verifiedtxt = open('verified.txt')
lock = threading.Lock()
def getProxy(url):
# Open the txt file
proxyFile = open('proxy.txt', 'a')
# Set up UA identification
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'
}
# page Is how many pages we need to get ip, Here we get the first 9 page
for page in range(1, 10):
# Through observation URL, We found the original website + Page number is the website we need , there page Need to be converted to str type
urls = url + str(page)
# adopt requests To get the web source code
rsp = requests.get(urls, headers=headers)
html = rsp.text
# adopt BeautifulSoup, Parsing html page
soup = BeautifulSoup(html)
# Through analysis, we find that the data are in id by ip_list Of table In the tag tr In the label
trs = soup.find('table', id='ip_list').find_all('tr') # What we get here is a list list
# Let's loop through the list
for item in trs[1:]:
# And at least every tr All in td label
tds = item.find_all('td')
# We'll find some img The label is empty , So here we need to add a judgment
if tds[0].find('img') is None:
nation = ' Unknown '
locate = ' Unknown '
else:
nation = tds[0].find('img')['alt'].strip()
locate = tds[3].text.strip()
# adopt td The data in the list , We extract them separately
ip = tds[1].text.strip()
port = tds[2].text.strip()
anony = tds[4].text.strip()
protocol = tds[5].text.strip()
speed = tds[6].find('div')['title'].strip()
time = tds[8].text.strip()
# Write the obtained data in the specified format txt In the text , This makes it easy for us to get
proxyFile.write('%s|%s|%s|%s|%s|%s|%s|%s\n' % (nation, ip, port, locate, anony, protocol, speed, time))
def verifyProxyList():
verifiedFile = open('verified.txt', 'a')
while True:
lock.acquire()
ll = inFile.readline().strip()
lock.release()
if len(ll) == 0: break
line = ll.strip().split('|')
ip = line[1]
port = line[2]
realip = ip + ':' + port
code = verifyProxy(realip)
if code == 200:
lock.acquire()
print("---Success:" + ip + ":" + port)
verifiedFile.write(ll + "\n")
lock.release()
else:
print("---Failure:" + ip + ":" + port)
def verifyProxy(ip):
'''
Verify the effectiveness of the agent
'''
requestHeader = {
'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36"
}
url = "http://www.baidu.com"
# Fill in the agency address
proxy = {'http': ip}
# establish proxyHandler
proxy_handler = request.ProxyHandler(proxy)
# establish opener
proxy_opener = request.build_opener(proxy_handler)
# install opener
request.install_opener(proxy_opener)
try:
req = request.Request(url, headers=requestHeader)
rsq = request.urlopen(req, timeout=5.0)
code = rsq.getcode()
return code
except error.URLError as e:
return e
if __name__ == '__main__':
tmp = open('proxy.txt', 'w')
tmp.write("")
tmp.close()
tmp1 = open('verified.txt', 'w')
tmp1.write("")
tmp1.close()
getProxy("http://www.xicidaili.com/nn/")
getProxy("http://www.xicidaili.com/nt/")
getProxy("http://www.xicidaili.com/wn/")
getProxy("http://www.xicidaili.com/wt/")
all_thread = []
for i in range(30):
t = threading.Thread(target=verifyProxyList)
all_thread.append(t)
t.start()
for t in all_thread:
t.join()
inFile.close()
verifiedtxt.close()
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.
- 40.
- 41.
- 42.
- 43.
- 44.
- 45.
- 46.
- 47.
- 48.
- 49.
- 50.
- 51.
- 52.
- 53.
- 54.
- 55.
- 56.
- 57.
- 58.
- 59.
- 60.
- 61.
- 62.
- 63.
- 64.
- 65.
- 66.
- 67.
- 68.
- 69.
- 70.
- 71.
- 72.
- 73.
- 74.
- 75.
- 76.
- 77.
- 78.
- 79.
- 80.
- 81.
- 82.
- 83.
- 84.
- 85.
- 86.
- 87.
- 88.
- 89.
- 90.
- 91.
- 92.
- 93.
- 94.
- 95.
- 96.
- 97.
- 98.
- 99.
- 100.
- 101.
- 102.
- 103.
- 104.
- 105.
- 106.
- 107.
- 108.
- 109.
- 110.
- 111.
- 112.
- 113.
- 114.
- 115.
- 116.
- 117.
- 118.
- 119.
- 120.
- 121.
Follow up improvement :
Good open source ip Agent pool :
https://github.com/Python3WebSpider/ProxyPool.git
According to the above github A hint of , You need to install a redis:
course :
https://www.runoob.com/redis/redis-install.html
Then put the file directory of the agent pool in the specified file , then cmd Jump to change directory :

Then start IP dealership :
from proxypool.scheduler import Scheduler
import argparse
parser = argparse.ArgumentParser(description='ProxyPool')
parser.add_argument('--processor', type=str, help='processor to run')
args = parser.parse_args()
if __name__ == '__main__':
# if processor set, just run it
if args.processor:
getattr(Scheduler(), f'run_{args.processor}')()
else:
Scheduler().run()
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
边栏推荐
- 利用kubernetes资源锁完成自己的HA应用
- ASP. Design and implementation of net+sql online alumni list
- PAT甲级真题1165
- 如何成为一名高级数字 IC 设计工程师(6-5)数字 IC 验证篇:覆盖率收集
- Exclusive analysis | about resume and interview
- 如何成为一名高级数字 IC 设计工程师(4-5)脚本篇:Shell 脚本实现的文件比较操作
- Flex application: realize two rows and five columns of data, self-adaptive
- [從零開始學習FPGA編程-49]:視野篇 - 芯片是如何被設計出來的?
- 什么叫股票线上开户?网上开户安全么?
- How does flush open an account? Is it safe to open an account online now?
猜你喜欢

Configurable FFT system design based on risc-v SOC (1) Introduction

Exclusive analysis | about resume and interview

Typescript (7) generic

如何成为一名高级数字 IC 设计工程师(4-5)脚本篇:Shell 脚本实现的文件比较操作

TiFlash 面向编译器的自动向量化加速

Introduction to super dongle scheme

Statistical learning method (3/22) k-nearest neighbor method

Stm32l4xx serial port log configuration analysis

Oculusrifts and unity UI interaction (1) - Overview

Pat grade a real problem 1165
随机推荐
In MySQL database, the two data written when creating tables with foreign keys are the same. Do I copy them or fail to display them
Configurable FFT system design based on risc-v SOC (1) Introduction
TypeScript(7)泛型
如何成为一名高级数字 IC 设计工程师(4-2)脚本篇:Verilog HDL 代码实现的文件读写操作
如何成为一名高级数字 IC 设计工程师(6-7)数字 IC 验证篇:DEBUG 技巧
Database - optimizer
4276. good at C
Would like to ask how to choose a securities firm? Is it safe to open an account online now?
[從零開始學習FPGA編程-49]:視野篇 - 芯片是如何被設計出來的?
MySQL realizes data comparison between two tables by calculating intersection and difference sets
Typescript (5) class, inheritance, polymorphism
[机缘参悟-33]:眼见不一定为实,大多数时候“眼见为虚”
Live broadcast preview | can SQL also play industrial machine learning? Mlops meetup V3 takes you to the bottom!
基于 FPGA 的 RISC CPU 设计(4)关于项目的 36 个问题及其答案
请问etf基金是否靠谱,安全吗
Test a CSDN free download software
How to select database
Testing until you're 35? The 35 + test will lead to unemployment?
IPFS简述
Niuke.com Huawei question bank (41~50)