当前位置:网站首页>Crawl with requests
Crawl with requests
2022-07-03 11:05:00 【hflag168】
Use Requests Crawling
One HTTP agreement
1.1 http Overview of the agreement
HTTP yes Hyper Text Transfer Protocol( Hypertext transfer protocol ) Abbreviation .HTTP It's based on " Request and response " Mode , Stateless application layer protocol .http The agreement TCP/IP The position in the protocol stack is shown in the figure below :
![[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-bv5uVWOn-1618788216403)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20210418134854577.png)]](/img/a3/28e2335da45d27a5c6a676cb0038b6.jpg)
HTTP Agreements are usually carried on TCP The agreement above , Sometimes it also carries TLS or SSL Above the protocol layer , This is the time , That's what we often say HTTPS, Default HTTP The port number of is 80,HTTPS The port number of is 443.
1.2 http Request response model for
http Protocols are always requests from clients , Server echo response . As shown in the figure below :
![[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-yCqoeLNf-1618788216407)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20210418135128431.png)]](/img/3d/b91b009fe1a02c1a06e3176c1aedd2.jpg)
http The agreement is a stateless agreement , That is to say, this request of the same client has no corresponding relationship with the last request .
1.3 Workflow
once http The operation generally includes the following steps :
First, the client establishes a connection with the server .
The client sends a resource request to the server
The server receives the request , Respond accordingly
The client receives the returned information and parses it
1.4 http Request method
HTTP/1.1 Eight methods are defined in the protocol ( Also called “ action ”) To operate the specified resources in different ways , The following are common request methods :
| Method | explain |
|---|---|
| GET | The request for URL Location resources |
| HEAD | The request for URL Response message report for location resource , That is to get the header information of the resource |
| POST | Request to URL The new data is attached after the resource of the location |
| PUT | Request to URL Location stores a resource , Covering the original URL Location resources |
| PATCH | Request partial update URL Location resources , That is, to change part of the resources of the service |
| DELETE | Request to delete URL Location stored resources |
1.5 URL
URL(Uniform Resource Location) Uniform resource locator , That is, the web address . Is the address of the standard resource on the Internet .
HTTP The agreement adopts URL As the identification of locating resources .
1.5.1 URL Format
http://host[:port][path]
host: legal Internet Host domain name or IP Address .port: Port number , The default port is 80path: The path of the request resource
1.5.2 URL Example
https://www.cup.edu.cn/ It refers to China University of petroleum ( Beijing ) Campus network homepage .
https://www.cup.edu.cn/cise It refers to China University of petroleum ( Beijing ) Under this host domain cise Directory of resources , That is the homepage of the school of information science and Engineering
URL It can be understood in this way : It is HTTP Protocol access to resources Internet route , One URL Corresponding to a data resource .
Two Requests library
Requests yes Python An elegant and simple HTTP library , It is built for human . adopt requests It can be sent very easily http/1.1 request , There is no need to add the query string to url, You don't need to be right post Form code the data .
requests It's a third-party library , Therefore, it must be installed before use . We suggest you use anaconda Integrated environment , It's already installed requests Library and its dependencies .
2.1 Request and Response object
Whenever called requests.get() And its partner approach , In fact, they are doing two main things : First, you are building a Request object , It will be sent to the server to request or query some resources . secondly , Once the request gets a response from the server , Will generate Response object . It contains all the information returned by the server , It also includes the originally created Request object .
Here is an example of a request , Used to from https://httpbin.org Get some information :
>>> import requests
>>> r=requests.get("https://httpbin.org")
If we want to access the header information returned by the server , You can use the following code :
>>> r.headers
{
'Date': 'Sun, 18 Apr 2021 11:15:57 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '9593', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}
However , If we want to get the header information sent to the server , It is through the response request Object to access , such as :
>>> r.request.headers
{
'User-Agent': 'python-requests/2.19.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
2.2 Main interface
all Requests All functions can be through 7 Medium method access . They all return to Response An instance of an object . among requests.request() Method is the most important , It's all the others 6 The basis of three methods .
2.2.1 requests.request(method, url, **kwargs)
This method is used to construct and send a Request.
Parameters :
method: That is, sent
HTTPrequest , It can beGET,HEAD,POST,PUTetc.url:HTTPRequested address .**kwargs: Control access parameters , All are optional
Return value :requests.Response
2.2.1.1 request Control parameter
**kwargs Altogether 13 Optional parameters , They are described as follows :
params: Dictionaries or byte sequences , As a parameter, add tourlin .
>>> import requests
>>> r = requests.request('GET','https://httpbin.org', params={
'key1':'val1', 'key2':'val2'})
>>> print(r.url)
https://httpbin.org/?key1=val1&key2=val2
data: Dictionaries , Byte sequence or file object , AsRequestThe content of
>>> import requests
>>> r = requests.post('https://httpbin.org/post', data={
'key':'value'})
>>> r = requests.post('https://httpbin.org/post', data='main content')
json: JSON Format data , AsRequestThe content of
>>> import requests
>>> kv = {
'key':"value"}
>>> r = requests.request('POST', 'http://python123.io/ws', json=kv)
headers: Dictionaries , HTTP Custom head
>>> hd={
'user-agent': "Chrome/10"} # Browser camouflage
>>> r = requests.request('POST', 'http://python123.io/ws', headers=hd)
>>> r.request.headers
{
'user-agent': 'Chrome/10', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '0'}
cookies: Dictionary orCookieJar,RequestMediumcookieauth: Tuples , Support HTTP Authentication function .files: Dictionary type , Transfer files .
>>> fs = {
'file': open('data.xls', 'rb')}
>>> r = requests.request('POST', 'http://python123.io/ws', files=fs)
timeout: Set timeout , Seconds per unit
>>> r = requests.request('GET', 'http://www.baidu.com', timeout=10)
proxies: Dictionary type , Set the access proxy server , You can add login authentication , Using agents can increase the difficulty of backtracking
>>> pxs = {
'http': 'http://user:[email protected]:1234', 'https':'https://10.10.10.1:4321'}
>>> r = requests.request('GET', 'http://www.baidu.com', proxies=pxs)
allow_redirects: True/False, The default is True, Redirection switchstream:True/False, The default is True, Get content, download now switchverify:True/False, The default is True, authentication SSL Certificate switchcert: Local SSL The certificate path
2.2.2 requests.get(url, params=None, **kwargs)
With HTTP Of GET To initiate a request .
main parameter :
url: To get the pageurllinkparams:urlExtra parameters in , Dictionary or byte stream format , Optional**kwargs: 12 Control range parameters
Return value :requests.Response
2.2.3 requests.head(url, **kwargs)
With HTTP Of HEAD To initiate a request .
Parameters :
url: To get the pageurllink**kwargs: 13 Access control parameters
Return value :requests.Response
2.2.4 requests.post(url, data=None, json=None, **kwargs)
With HTTP Of POST To initiate a request .
Parameters :
url: To get the pageurllinkdata: Dictionaries , Byte sequence or file object , AsRequestThe content ofjson: JSON Format data , AsRequestThe content of**kwargs: 11 Access control parameters
Return value :requests.Response
2.2.5 requests.put(url, data=None, **kwargs)
With HTTP Of PUT To initiate a request .
Parameters :
url: To get the pageurllinkdata: Dictionaries , Byte sequence or file object , AsRequestThe content ofjson: JSON Format data , AsRequestThe content of**kwargs: 11 Access control parameters
Return value :requests.Response
2.2.6 requests.patch(url, data=None, **kwargs)
With HTTP Of PATCH To initiate a request .
Parameters :
url: To get the pageurllinkdata: Dictionaries , Byte sequence or file object , AsRequestThe content ofjson: JSON Format data , AsRequestThe content of**kwargs: 11 Access control parameters
Return value :requests.Response
2.2.7 requests.delete(url, **kwargs)
With HTTP Of delete To initiate a request .
Parameters :
url: To get the pageurllink**kwargs: 13 Access control parameters
Return value :requests.Response
3、 ... and Case study : Crawl the cover pictures of all free courses on Moke website
By opening the website https://www.imooc.com/course/list And review the format of each picture address , Then determine the pattern of the picture address , The whole crawl code is as follows :
import requests
import re
url = "https://www.imooc.com/course/list"
# Request content
try:
r = requests.get(url)
r.raise_for_status()
except:
print(" Something went wrong !")
else:
r.encoding = r.apparent_encoding
# Save the contents in variables html in
html = r.text
# Using regular expressions , Find the address of all pictures
images=re.findall(r'//img\d\.mukewang\.com/\w+\.(?:png|jpg)', html)
# Construct a complete address
images_url=['https:' + url for url in images]
i = 0
for url in images_url:
# Intercept the suffix of the image file
prefix = re.search(r'\w{3}$',url).group()
# Construct the storage location of downloaded image files , images Established in advance
filename = './images/{}.{}'.format(i,prefix)
# For each picture , Crawl its binary content , And store it in a local file
f = open(filename, 'wb')
r = requests.get(url)
f.write(r.content)
f.close()
i += 1
There are two unsatisfactory aspects of the above code : One is to establish a directory for storing pictures locally in advance , Second, a variable is introduced when naming pictures i. The following is the optimization code for these two problems :
import requests
import re
import os
url = "https://www.imooc.com/course/list"
# Request content
try:
r = requests.get(url)
r.raise_for_status()
except:
print(" Something went wrong !")
else:
r.encoding = r.apparent_encoding
# Save the contents in variables html in
html = r.text
# Using regular expressions , Find the address of all pictures
images=re.findall(r'//img\d\.mukewang\.com/\w+\.(?:png|jpg)', html)
# Construct a complete address
images_url=['https:' + url for url in images]
# Construct a directory for storing pictures
if not os.path.isdir('imgs'):
os.mkdir('imgs')
else:
os.chdir('./imgs')
for f1 in os.listdir():
os.remove(f1)
os.chdir('..')
# Use enumerate Method to traverse the list
for i, url in enumerate(images_url):
prefix = re.search(r'\w{3}$',url).group()
filename = './imgs/{}.{}'.format(i,prefix)
f = open(filename, 'wb')
r = requests.get(url)
f.write(r.content)
f.close()
Welcome to Python Video Course :https://www.bilibili.com/video/BV1sh411Q7mz
边栏推荐
- QT:QSS自定义QMenu实例
- Solution: jupyter notebook does not pop up the default browser
- What experience is there only one test in the company? Listen to what they say
- Exclusive analysis | truth about resume and interview
- Communication software development and Application
- Day 7 small exercise
- Activity and fragment lifecycle
- 软件测试必学基本理论知识——APP测试
- 测试Leader应该做哪些事
- Software testing redis database
猜你喜欢
![[true question of the Blue Bridge Cup trials 44] scratch eliminate the skeleton Legion children programming explanation of the true question of the Blue Bridge Cup trials](/img/e0/c2b1fbe99939d44201401abf1b5a72.png)
[true question of the Blue Bridge Cup trials 44] scratch eliminate the skeleton Legion children programming explanation of the true question of the Blue Bridge Cup trials

嵌入式軟件測試怎麼實現自動化測試?

"Core values of testing" and "super complete learning guide for 0 basic software testing" summarized by test engineers for 8 years

Qt:qss custom qpprogressbar instance

How can UI automated testing get out of trouble? How to embody the value?

11. Provider service registration of Nacos service registration source code analysis

QT: QSS custom qtreeview instance

2021 reading summary (continuously updating)

What happened to those who focused on automated testing?

那些一门心思研究自动化测试的人,后来怎样了?
随机推荐
Qt:qss custom qspinbox instance
2022 pinduogai 100000 sales tutorial
Que se passe - t - il ensuite pour ceux qui se sont concentrés sur les tests automatisés?
Basic usage of sqlmap
Communication software development and Application
软件测试——Redis数据库
年中了,准备了少量的自动化面试题,欢迎来自测
Nuget add reference error while installing packages
QT:QSS自定义QToolButton实例
Qt:qss custom qstatusbar instance
QT: QSS custom qtabwidget and qtabbar instances
Software testing (test case) writing: vulgar, native and skillful
《通信软件开发与应用》
QT: QSS custom qtableview instance
测试理论概述
The solution that prompts "system group policy prohibits the installation of this device" under win10 system (home version has no group policy)
你真的需要自动化测试吗?
“测试人”,有哪些厉害之处?
T5 的尝试
QT:QSS自定义 QScrollBar实例