当前位置:网站首页>Crawl with requests
Crawl with requests
2022-07-03 11:05:00 【hflag168】
Use Requests Crawling
One HTTP agreement
1.1 http Overview of the agreement
HTTP yes Hyper Text Transfer Protocol( Hypertext transfer protocol ) Abbreviation .HTTP It's based on " Request and response " Mode , Stateless application layer protocol .http The agreement TCP/IP The position in the protocol stack is shown in the figure below :
![[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-bv5uVWOn-1618788216403)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20210418134854577.png)]](/img/a3/28e2335da45d27a5c6a676cb0038b6.jpg)
HTTP Agreements are usually carried on TCP The agreement above , Sometimes it also carries TLS or SSL Above the protocol layer , This is the time , That's what we often say HTTPS, Default HTTP The port number of is 80,HTTPS The port number of is 443.
1.2 http Request response model for
http Protocols are always requests from clients , Server echo response . As shown in the figure below :
![[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-yCqoeLNf-1618788216407)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20210418135128431.png)]](/img/3d/b91b009fe1a02c1a06e3176c1aedd2.jpg)
http The agreement is a stateless agreement , That is to say, this request of the same client has no corresponding relationship with the last request .
1.3 Workflow
once http The operation generally includes the following steps :
First, the client establishes a connection with the server .
The client sends a resource request to the server
The server receives the request , Respond accordingly
The client receives the returned information and parses it
1.4 http Request method
HTTP/1.1 Eight methods are defined in the protocol ( Also called “ action ”) To operate the specified resources in different ways , The following are common request methods :
| Method | explain |
|---|---|
| GET | The request for URL Location resources |
| HEAD | The request for URL Response message report for location resource , That is to get the header information of the resource |
| POST | Request to URL The new data is attached after the resource of the location |
| PUT | Request to URL Location stores a resource , Covering the original URL Location resources |
| PATCH | Request partial update URL Location resources , That is, to change part of the resources of the service |
| DELETE | Request to delete URL Location stored resources |
1.5 URL
URL(Uniform Resource Location) Uniform resource locator , That is, the web address . Is the address of the standard resource on the Internet .
HTTP The agreement adopts URL As the identification of locating resources .
1.5.1 URL Format
http://host[:port][path]
host: legal Internet Host domain name or IP Address .port: Port number , The default port is 80path: The path of the request resource
1.5.2 URL Example
https://www.cup.edu.cn/ It refers to China University of petroleum ( Beijing ) Campus network homepage .
https://www.cup.edu.cn/cise It refers to China University of petroleum ( Beijing ) Under this host domain cise Directory of resources , That is the homepage of the school of information science and Engineering
URL It can be understood in this way : It is HTTP Protocol access to resources Internet route , One URL Corresponding to a data resource .
Two Requests library
Requests yes Python An elegant and simple HTTP library , It is built for human . adopt requests It can be sent very easily http/1.1 request , There is no need to add the query string to url, You don't need to be right post Form code the data .
requests It's a third-party library , Therefore, it must be installed before use . We suggest you use anaconda Integrated environment , It's already installed requests Library and its dependencies .
2.1 Request and Response object
Whenever called requests.get() And its partner approach , In fact, they are doing two main things : First, you are building a Request object , It will be sent to the server to request or query some resources . secondly , Once the request gets a response from the server , Will generate Response object . It contains all the information returned by the server , It also includes the originally created Request object .
Here is an example of a request , Used to from https://httpbin.org Get some information :
>>> import requests
>>> r=requests.get("https://httpbin.org")
If we want to access the header information returned by the server , You can use the following code :
>>> r.headers
{
'Date': 'Sun, 18 Apr 2021 11:15:57 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '9593', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}
However , If we want to get the header information sent to the server , It is through the response request Object to access , such as :
>>> r.request.headers
{
'User-Agent': 'python-requests/2.19.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
2.2 Main interface
all Requests All functions can be through 7 Medium method access . They all return to Response An instance of an object . among requests.request() Method is the most important , It's all the others 6 The basis of three methods .
2.2.1 requests.request(method, url, **kwargs)
This method is used to construct and send a Request.
Parameters :
method: That is, sent
HTTPrequest , It can beGET,HEAD,POST,PUTetc.url:HTTPRequested address .**kwargs: Control access parameters , All are optional
Return value :requests.Response
2.2.1.1 request Control parameter
**kwargs Altogether 13 Optional parameters , They are described as follows :
params: Dictionaries or byte sequences , As a parameter, add tourlin .
>>> import requests
>>> r = requests.request('GET','https://httpbin.org', params={
'key1':'val1', 'key2':'val2'})
>>> print(r.url)
https://httpbin.org/?key1=val1&key2=val2
data: Dictionaries , Byte sequence or file object , AsRequestThe content of
>>> import requests
>>> r = requests.post('https://httpbin.org/post', data={
'key':'value'})
>>> r = requests.post('https://httpbin.org/post', data='main content')
json: JSON Format data , AsRequestThe content of
>>> import requests
>>> kv = {
'key':"value"}
>>> r = requests.request('POST', 'http://python123.io/ws', json=kv)
headers: Dictionaries , HTTP Custom head
>>> hd={
'user-agent': "Chrome/10"} # Browser camouflage
>>> r = requests.request('POST', 'http://python123.io/ws', headers=hd)
>>> r.request.headers
{
'user-agent': 'Chrome/10', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '0'}
cookies: Dictionary orCookieJar,RequestMediumcookieauth: Tuples , Support HTTP Authentication function .files: Dictionary type , Transfer files .
>>> fs = {
'file': open('data.xls', 'rb')}
>>> r = requests.request('POST', 'http://python123.io/ws', files=fs)
timeout: Set timeout , Seconds per unit
>>> r = requests.request('GET', 'http://www.baidu.com', timeout=10)
proxies: Dictionary type , Set the access proxy server , You can add login authentication , Using agents can increase the difficulty of backtracking
>>> pxs = {
'http': 'http://user:[email protected]:1234', 'https':'https://10.10.10.1:4321'}
>>> r = requests.request('GET', 'http://www.baidu.com', proxies=pxs)
allow_redirects: True/False, The default is True, Redirection switchstream:True/False, The default is True, Get content, download now switchverify:True/False, The default is True, authentication SSL Certificate switchcert: Local SSL The certificate path
2.2.2 requests.get(url, params=None, **kwargs)
With HTTP Of GET To initiate a request .
main parameter :
url: To get the pageurllinkparams:urlExtra parameters in , Dictionary or byte stream format , Optional**kwargs: 12 Control range parameters
Return value :requests.Response
2.2.3 requests.head(url, **kwargs)
With HTTP Of HEAD To initiate a request .
Parameters :
url: To get the pageurllink**kwargs: 13 Access control parameters
Return value :requests.Response
2.2.4 requests.post(url, data=None, json=None, **kwargs)
With HTTP Of POST To initiate a request .
Parameters :
url: To get the pageurllinkdata: Dictionaries , Byte sequence or file object , AsRequestThe content ofjson: JSON Format data , AsRequestThe content of**kwargs: 11 Access control parameters
Return value :requests.Response
2.2.5 requests.put(url, data=None, **kwargs)
With HTTP Of PUT To initiate a request .
Parameters :
url: To get the pageurllinkdata: Dictionaries , Byte sequence or file object , AsRequestThe content ofjson: JSON Format data , AsRequestThe content of**kwargs: 11 Access control parameters
Return value :requests.Response
2.2.6 requests.patch(url, data=None, **kwargs)
With HTTP Of PATCH To initiate a request .
Parameters :
url: To get the pageurllinkdata: Dictionaries , Byte sequence or file object , AsRequestThe content ofjson: JSON Format data , AsRequestThe content of**kwargs: 11 Access control parameters
Return value :requests.Response
2.2.7 requests.delete(url, **kwargs)
With HTTP Of delete To initiate a request .
Parameters :
url: To get the pageurllink**kwargs: 13 Access control parameters
Return value :requests.Response
3、 ... and Case study : Crawl the cover pictures of all free courses on Moke website
By opening the website https://www.imooc.com/course/list And review the format of each picture address , Then determine the pattern of the picture address , The whole crawl code is as follows :
import requests
import re
url = "https://www.imooc.com/course/list"
# Request content
try:
r = requests.get(url)
r.raise_for_status()
except:
print(" Something went wrong !")
else:
r.encoding = r.apparent_encoding
# Save the contents in variables html in
html = r.text
# Using regular expressions , Find the address of all pictures
images=re.findall(r'//img\d\.mukewang\.com/\w+\.(?:png|jpg)', html)
# Construct a complete address
images_url=['https:' + url for url in images]
i = 0
for url in images_url:
# Intercept the suffix of the image file
prefix = re.search(r'\w{3}$',url).group()
# Construct the storage location of downloaded image files , images Established in advance
filename = './images/{}.{}'.format(i,prefix)
# For each picture , Crawl its binary content , And store it in a local file
f = open(filename, 'wb')
r = requests.get(url)
f.write(r.content)
f.close()
i += 1
There are two unsatisfactory aspects of the above code : One is to establish a directory for storing pictures locally in advance , Second, a variable is introduced when naming pictures i. The following is the optimization code for these two problems :
import requests
import re
import os
url = "https://www.imooc.com/course/list"
# Request content
try:
r = requests.get(url)
r.raise_for_status()
except:
print(" Something went wrong !")
else:
r.encoding = r.apparent_encoding
# Save the contents in variables html in
html = r.text
# Using regular expressions , Find the address of all pictures
images=re.findall(r'//img\d\.mukewang\.com/\w+\.(?:png|jpg)', html)
# Construct a complete address
images_url=['https:' + url for url in images]
# Construct a directory for storing pictures
if not os.path.isdir('imgs'):
os.mkdir('imgs')
else:
os.chdir('./imgs')
for f1 in os.listdir():
os.remove(f1)
os.chdir('..')
# Use enumerate Method to traverse the list
for i, url in enumerate(images_url):
prefix = re.search(r'\w{3}$',url).group()
filename = './imgs/{}.{}'.format(i,prefix)
f = open(filename, 'wb')
r = requests.get(url)
f.write(r.content)
f.close()
Welcome to Python Video Course :https://www.bilibili.com/video/BV1sh411Q7mz
边栏推荐
- TypeScript学习总结
- Qt:qss custom qradiobutton instance
- How to realize automatic testing in embedded software testing?
- logstash备份跟踪上报的数据记录
- Strategic management of project organization
- Qt:qss custom QSlider instance
- .Net Core-做一个微信公众号的排队系统
- Flink chain conditional source code analysis
- After 8 years of industry thinking, the test director has a deeper understanding of test thinking
- 如何让让别人畏惧你
猜你喜欢

现在零基础转行软件测试还OK吗?

嵌入式软件测试怎么实现自动化测试?

How does MySQL find the latest data row that meets the conditions?
The normal one inch is 25.4 cm, and the image field is 16 cm

Interviewer: what is the internal implementation of the list in redis?

测试理论概述

Is it OK to test the zero basis software?

Snownlp emotion analysis

QT: QSS custom qtabwidget and qtabbar instances

在职美团测试工程师的这八年,我是如何成长的,愿技术人看完都有收获
随机推荐
QT: QSS custom qtreeview instance
Basic usage of sqlmap
Win10系统下提示“系统组策略禁止安装此设备”的解决方案(家庭版无组策略)
Flink chain conditional source code analysis
15 software testing Trends Worthy of attention
Imread change image display size
First line of code kotlin notes
QT:QSS自定义QToolBar和QToolBox实例
Probability theory: application of convolution in calculating moving average
现在零基础转行软件测试还OK吗?
多路IO转接——前导
8年测试工程师总结出来的《测试核心价值》与《0基础转行软件测试超全学习指南》
做软件测试三年,薪资不到20K,今天,我提出了辞职…
字节跳动大裁员,测试工程师差点遭团灭:大厂招人背后的套路,有多可怕?
11. Provider service registration of Nacos service registration source code analysis
The normal one inch is 25.4 cm, and the image field is 16 cm
QT:QSS自定义 QSlider实例
The role and necessity of implementing serializable interface
使用ML.NET+ONNX预训练模型整活B站经典《华强买瓜》
Typescript learning summary