当前位置:网站首页>Crawl with requests
Crawl with requests
2022-07-03 11:05:00 【hflag168】
Use Requests Crawling
One HTTP agreement
1.1 http Overview of the agreement
HTTP yes Hyper Text Transfer Protocol( Hypertext transfer protocol ) Abbreviation .HTTP It's based on " Request and response " Mode , Stateless application layer protocol .http The agreement TCP/IP The position in the protocol stack is shown in the figure below :
![[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-bv5uVWOn-1618788216403)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20210418134854577.png)]](/img/a3/28e2335da45d27a5c6a676cb0038b6.jpg)
HTTP Agreements are usually carried on TCP The agreement above , Sometimes it also carries TLS or SSL Above the protocol layer , This is the time , That's what we often say HTTPS, Default HTTP The port number of is 80,HTTPS The port number of is 443.
1.2 http Request response model for
http Protocols are always requests from clients , Server echo response . As shown in the figure below :
![[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-yCqoeLNf-1618788216407)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20210418135128431.png)]](/img/3d/b91b009fe1a02c1a06e3176c1aedd2.jpg)
http The agreement is a stateless agreement , That is to say, this request of the same client has no corresponding relationship with the last request .
1.3 Workflow
once http The operation generally includes the following steps :
First, the client establishes a connection with the server .
The client sends a resource request to the server
The server receives the request , Respond accordingly
The client receives the returned information and parses it
1.4 http Request method
HTTP/1.1 Eight methods are defined in the protocol ( Also called “ action ”) To operate the specified resources in different ways , The following are common request methods :
| Method | explain |
|---|---|
| GET | The request for URL Location resources |
| HEAD | The request for URL Response message report for location resource , That is to get the header information of the resource |
| POST | Request to URL The new data is attached after the resource of the location |
| PUT | Request to URL Location stores a resource , Covering the original URL Location resources |
| PATCH | Request partial update URL Location resources , That is, to change part of the resources of the service |
| DELETE | Request to delete URL Location stored resources |
1.5 URL
URL(Uniform Resource Location) Uniform resource locator , That is, the web address . Is the address of the standard resource on the Internet .
HTTP The agreement adopts URL As the identification of locating resources .
1.5.1 URL Format
http://host[:port][path]
host: legal Internet Host domain name or IP Address .port: Port number , The default port is 80path: The path of the request resource
1.5.2 URL Example
https://www.cup.edu.cn/ It refers to China University of petroleum ( Beijing ) Campus network homepage .
https://www.cup.edu.cn/cise It refers to China University of petroleum ( Beijing ) Under this host domain cise Directory of resources , That is the homepage of the school of information science and Engineering
URL It can be understood in this way : It is HTTP Protocol access to resources Internet route , One URL Corresponding to a data resource .
Two Requests library
Requests yes Python An elegant and simple HTTP library , It is built for human . adopt requests It can be sent very easily http/1.1 request , There is no need to add the query string to url, You don't need to be right post Form code the data .
requests It's a third-party library , Therefore, it must be installed before use . We suggest you use anaconda Integrated environment , It's already installed requests Library and its dependencies .
2.1 Request and Response object
Whenever called requests.get() And its partner approach , In fact, they are doing two main things : First, you are building a Request object , It will be sent to the server to request or query some resources . secondly , Once the request gets a response from the server , Will generate Response object . It contains all the information returned by the server , It also includes the originally created Request object .
Here is an example of a request , Used to from https://httpbin.org Get some information :
>>> import requests
>>> r=requests.get("https://httpbin.org")
If we want to access the header information returned by the server , You can use the following code :
>>> r.headers
{
'Date': 'Sun, 18 Apr 2021 11:15:57 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '9593', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}
However , If we want to get the header information sent to the server , It is through the response request Object to access , such as :
>>> r.request.headers
{
'User-Agent': 'python-requests/2.19.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
2.2 Main interface
all Requests All functions can be through 7 Medium method access . They all return to Response An instance of an object . among requests.request() Method is the most important , It's all the others 6 The basis of three methods .
2.2.1 requests.request(method, url, **kwargs)
This method is used to construct and send a Request.
Parameters :
method: That is, sent
HTTPrequest , It can beGET,HEAD,POST,PUTetc.url:HTTPRequested address .**kwargs: Control access parameters , All are optional
Return value :requests.Response
2.2.1.1 request Control parameter
**kwargs Altogether 13 Optional parameters , They are described as follows :
params: Dictionaries or byte sequences , As a parameter, add tourlin .
>>> import requests
>>> r = requests.request('GET','https://httpbin.org', params={
'key1':'val1', 'key2':'val2'})
>>> print(r.url)
https://httpbin.org/?key1=val1&key2=val2
data: Dictionaries , Byte sequence or file object , AsRequestThe content of
>>> import requests
>>> r = requests.post('https://httpbin.org/post', data={
'key':'value'})
>>> r = requests.post('https://httpbin.org/post', data='main content')
json: JSON Format data , AsRequestThe content of
>>> import requests
>>> kv = {
'key':"value"}
>>> r = requests.request('POST', 'http://python123.io/ws', json=kv)
headers: Dictionaries , HTTP Custom head
>>> hd={
'user-agent': "Chrome/10"} # Browser camouflage
>>> r = requests.request('POST', 'http://python123.io/ws', headers=hd)
>>> r.request.headers
{
'user-agent': 'Chrome/10', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '0'}
cookies: Dictionary orCookieJar,RequestMediumcookieauth: Tuples , Support HTTP Authentication function .files: Dictionary type , Transfer files .
>>> fs = {
'file': open('data.xls', 'rb')}
>>> r = requests.request('POST', 'http://python123.io/ws', files=fs)
timeout: Set timeout , Seconds per unit
>>> r = requests.request('GET', 'http://www.baidu.com', timeout=10)
proxies: Dictionary type , Set the access proxy server , You can add login authentication , Using agents can increase the difficulty of backtracking
>>> pxs = {
'http': 'http://user:[email protected]:1234', 'https':'https://10.10.10.1:4321'}
>>> r = requests.request('GET', 'http://www.baidu.com', proxies=pxs)
allow_redirects: True/False, The default is True, Redirection switchstream:True/False, The default is True, Get content, download now switchverify:True/False, The default is True, authentication SSL Certificate switchcert: Local SSL The certificate path
2.2.2 requests.get(url, params=None, **kwargs)
With HTTP Of GET To initiate a request .
main parameter :
url: To get the pageurllinkparams:urlExtra parameters in , Dictionary or byte stream format , Optional**kwargs: 12 Control range parameters
Return value :requests.Response
2.2.3 requests.head(url, **kwargs)
With HTTP Of HEAD To initiate a request .
Parameters :
url: To get the pageurllink**kwargs: 13 Access control parameters
Return value :requests.Response
2.2.4 requests.post(url, data=None, json=None, **kwargs)
With HTTP Of POST To initiate a request .
Parameters :
url: To get the pageurllinkdata: Dictionaries , Byte sequence or file object , AsRequestThe content ofjson: JSON Format data , AsRequestThe content of**kwargs: 11 Access control parameters
Return value :requests.Response
2.2.5 requests.put(url, data=None, **kwargs)
With HTTP Of PUT To initiate a request .
Parameters :
url: To get the pageurllinkdata: Dictionaries , Byte sequence or file object , AsRequestThe content ofjson: JSON Format data , AsRequestThe content of**kwargs: 11 Access control parameters
Return value :requests.Response
2.2.6 requests.patch(url, data=None, **kwargs)
With HTTP Of PATCH To initiate a request .
Parameters :
url: To get the pageurllinkdata: Dictionaries , Byte sequence or file object , AsRequestThe content ofjson: JSON Format data , AsRequestThe content of**kwargs: 11 Access control parameters
Return value :requests.Response
2.2.7 requests.delete(url, **kwargs)
With HTTP Of delete To initiate a request .
Parameters :
url: To get the pageurllink**kwargs: 13 Access control parameters
Return value :requests.Response
3、 ... and Case study : Crawl the cover pictures of all free courses on Moke website
By opening the website https://www.imooc.com/course/list And review the format of each picture address , Then determine the pattern of the picture address , The whole crawl code is as follows :
import requests
import re
url = "https://www.imooc.com/course/list"
# Request content
try:
r = requests.get(url)
r.raise_for_status()
except:
print(" Something went wrong !")
else:
r.encoding = r.apparent_encoding
# Save the contents in variables html in
html = r.text
# Using regular expressions , Find the address of all pictures
images=re.findall(r'//img\d\.mukewang\.com/\w+\.(?:png|jpg)', html)
# Construct a complete address
images_url=['https:' + url for url in images]
i = 0
for url in images_url:
# Intercept the suffix of the image file
prefix = re.search(r'\w{3}$',url).group()
# Construct the storage location of downloaded image files , images Established in advance
filename = './images/{}.{}'.format(i,prefix)
# For each picture , Crawl its binary content , And store it in a local file
f = open(filename, 'wb')
r = requests.get(url)
f.write(r.content)
f.close()
i += 1
There are two unsatisfactory aspects of the above code : One is to establish a directory for storing pictures locally in advance , Second, a variable is introduced when naming pictures i. The following is the optimization code for these two problems :
import requests
import re
import os
url = "https://www.imooc.com/course/list"
# Request content
try:
r = requests.get(url)
r.raise_for_status()
except:
print(" Something went wrong !")
else:
r.encoding = r.apparent_encoding
# Save the contents in variables html in
html = r.text
# Using regular expressions , Find the address of all pictures
images=re.findall(r'//img\d\.mukewang\.com/\w+\.(?:png|jpg)', html)
# Construct a complete address
images_url=['https:' + url for url in images]
# Construct a directory for storing pictures
if not os.path.isdir('imgs'):
os.mkdir('imgs')
else:
os.chdir('./imgs')
for f1 in os.listdir():
os.remove(f1)
os.chdir('..')
# Use enumerate Method to traverse the list
for i, url in enumerate(images_url):
prefix = re.search(r'\w{3}$',url).group()
filename = './imgs/{}.{}'.format(i,prefix)
f = open(filename, 'wb')
r = requests.get(url)
f.write(r.content)
f.close()
Welcome to Python Video Course :https://www.bilibili.com/video/BV1sh411Q7mz
边栏推荐
- 正常一英寸25.4厘米,在影像领域是16厘米
- UI自动化测试如何走出困境?价值又如何体现?
- Qt:qss custom qscrollbar instance
- QT: QSS custom qtoolbar and qtoolbox instances
- C language project: student achievement system
- What is the salary level of 17k? Let's take a look at the whole interview process of post-95 Test Engineers
- First line of code kotlin notes
- Game test related tests a hero's skills (spring moves are asked more questions)
- I, a tester from a large factory, went to a state-owned enterprise with a 50% pay cut. I regret it
- Windows security center open blank
猜你喜欢

What kind of living condition is a tester with a monthly salary of more than 10000?

QT:QSS自定义 QProgressBar实例

“测试人”,有哪些厉害之处?

Pour vous amener dans le monde des bases de données natives du cloud

How to monitor the incoming and outgoing traffic of the server host?

Activity and fragment lifecycle

Basic theoretical knowledge of software testing -- app testing

snownlp情感分析

QT: QSS custom qtabwidget and qtabbar instances
正常一英寸25.4厘米,在影像领域是16厘米
随机推荐
Software testing redis database
Strategic management of project organization
ConstraintLayout跟RelativeLayout嵌套出现的莫名奇妙的问题
QT: QSS custom qtreeview instance
QT:QSS自定义 QTreeView实例
[true question of the Blue Bridge Cup trials 44] scratch eliminate the skeleton Legion children programming explanation of the true question of the Blue Bridge Cup trials
2022 pinduogai 100000 sales tutorial
8年测试总监的行业思考,看完后测试思维认知更深刻
有些能力,是工作中学不来的,看看这篇超过90%同行
Solution: jupyter notebook does not pop up the default browser
QT:QSS自定义 QSplitter实例
How can UI automated testing get out of trouble? How to embody the value?
How does MySQL find the latest data row that meets the conditions?
"Core values of testing" and "super complete learning guide for 0 basic software testing" summarized by test engineers for 8 years
TypeScript学习总结
Day 7 small exercise
使用ML.NET+ONNX预训练模型整活B站经典《华强买瓜》
多路IO转接——前导
Qt:qss custom qscrollbar instance
My understanding of testing (summarized by senior testers)