当前位置:网站首页>Crawl with requests
Crawl with requests
2022-07-03 11:05:00 【hflag168】
Use Requests Crawling
One HTTP agreement
1.1 http Overview of the agreement
HTTP
yes Hyper Text Transfer Protocol
( Hypertext transfer protocol ) Abbreviation .HTTP
It's based on " Request and response " Mode , Stateless application layer protocol .http The agreement TCP/IP The position in the protocol stack is shown in the figure below :
HTTP Agreements are usually carried on TCP The agreement above , Sometimes it also carries TLS or SSL Above the protocol layer , This is the time , That's what we often say HTTPS, Default HTTP The port number of is 80,HTTPS The port number of is 443.
1.2 http Request response model for
http Protocols are always requests from clients , Server echo response . As shown in the figure below :
http The agreement is a stateless agreement , That is to say, this request of the same client has no corresponding relationship with the last request .
1.3 Workflow
once http The operation generally includes the following steps :
First, the client establishes a connection with the server .
The client sends a resource request to the server
The server receives the request , Respond accordingly
The client receives the returned information and parses it
1.4 http Request method
HTTP/1.1
Eight methods are defined in the protocol ( Also called “ action ”) To operate the specified resources in different ways , The following are common request methods :
Method | explain |
---|---|
GET | The request for URL Location resources |
HEAD | The request for URL Response message report for location resource , That is to get the header information of the resource |
POST | Request to URL The new data is attached after the resource of the location |
PUT | Request to URL Location stores a resource , Covering the original URL Location resources |
PATCH | Request partial update URL Location resources , That is, to change part of the resources of the service |
DELETE | Request to delete URL Location stored resources |
1.5 URL
URL
(Uniform Resource Location) Uniform resource locator , That is, the web address . Is the address of the standard resource on the Internet .
HTTP
The agreement adopts URL
As the identification of locating resources .
1.5.1 URL
Format
http://host[:port][path]
host
: legal Internet Host domain name or IP Address .port
: Port number , The default port is 80path
: The path of the request resource
1.5.2 URL
Example
https://www.cup.edu.cn/
It refers to China University of petroleum ( Beijing ) Campus network homepage .
https://www.cup.edu.cn/cise
It refers to China University of petroleum ( Beijing ) Under this host domain cise
Directory of resources , That is the homepage of the school of information science and Engineering
URL
It can be understood in this way : It is HTTP Protocol access to resources Internet
route , One URL
Corresponding to a data resource .
Two Requests library
Requests
yes Python An elegant and simple HTTP library , It is built for human . adopt requests
It can be sent very easily http/1.1
request , There is no need to add the query string to url
, You don't need to be right post
Form code the data .
requests
It's a third-party library , Therefore, it must be installed before use . We suggest you use anaconda
Integrated environment , It's already installed requests
Library and its dependencies .
2.1 Request
and Response
object
Whenever called requests.get()
And its partner approach , In fact, they are doing two main things : First, you are building a Request
object , It will be sent to the server to request or query some resources . secondly , Once the request gets a response from the server , Will generate Response
object . It contains all the information returned by the server , It also includes the originally created Request
object .
Here is an example of a request , Used to from https://httpbin.org
Get some information :
>>> import requests
>>> r=requests.get("https://httpbin.org")
If we want to access the header information returned by the server , You can use the following code :
>>> r.headers
{
'Date': 'Sun, 18 Apr 2021 11:15:57 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '9593', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}
However , If we want to get the header information sent to the server , It is through the response request
Object to access , such as :
>>> r.request.headers
{
'User-Agent': 'python-requests/2.19.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
2.2 Main interface
all Requests
All functions can be through 7 Medium method access . They all return to Response
An instance of an object . among requests.request()
Method is the most important , It's all the others 6 The basis of three methods .
2.2.1 requests.request(method, url, **kwargs)
This method is used to construct and send a Request
.
Parameters :
method: That is, sent
HTTP
request , It can beGET
,HEAD
,POST
,PUT
etc.url
:HTTP
Requested address .**kwargs
: Control access parameters , All are optional
Return value :requests.Response
2.2.1.1 request Control parameter
**kwargs
Altogether 13 Optional parameters , They are described as follows :
params
: Dictionaries or byte sequences , As a parameter, add tourl
in .
>>> import requests
>>> r = requests.request('GET','https://httpbin.org', params={
'key1':'val1', 'key2':'val2'})
>>> print(r.url)
https://httpbin.org/?key1=val1&key2=val2
data
: Dictionaries , Byte sequence or file object , AsRequest
The content of
>>> import requests
>>> r = requests.post('https://httpbin.org/post', data={
'key':'value'})
>>> r = requests.post('https://httpbin.org/post', data='main content')
json
: JSON Format data , AsRequest
The content of
>>> import requests
>>> kv = {
'key':"value"}
>>> r = requests.request('POST', 'http://python123.io/ws', json=kv)
headers
: Dictionaries , HTTP Custom head
>>> hd={
'user-agent': "Chrome/10"} # Browser camouflage
>>> r = requests.request('POST', 'http://python123.io/ws', headers=hd)
>>> r.request.headers
{
'user-agent': 'Chrome/10', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '0'}
cookies
: Dictionary orCookieJar
,Request
Mediumcookie
auth
: Tuples , Support HTTP Authentication function .files
: Dictionary type , Transfer files .
>>> fs = {
'file': open('data.xls', 'rb')}
>>> r = requests.request('POST', 'http://python123.io/ws', files=fs)
timeout
: Set timeout , Seconds per unit
>>> r = requests.request('GET', 'http://www.baidu.com', timeout=10)
proxies
: Dictionary type , Set the access proxy server , You can add login authentication , Using agents can increase the difficulty of backtracking
>>> pxs = {
'http': 'http://user:[email protected]:1234', 'https':'https://10.10.10.1:4321'}
>>> r = requests.request('GET', 'http://www.baidu.com', proxies=pxs)
allow_redirects
: True/False, The default is True, Redirection switchstream
:True/False, The default is True, Get content, download now switchverify
:True/False, The default is True, authentication SSL Certificate switchcert
: Local SSL The certificate path
2.2.2 requests.get(url, params=None, **kwargs)
With HTTP Of GET To initiate a request .
main parameter :
url
: To get the pageurl
linkparams
:url
Extra parameters in , Dictionary or byte stream format , Optional**kwargs
: 12 Control range parameters
Return value :requests.Response
2.2.3 requests.head(url, **kwargs)
With HTTP Of HEAD
To initiate a request .
Parameters :
url
: To get the pageurl
link**kwargs
: 13 Access control parameters
Return value :requests.Response
2.2.4 requests.post(url, data=None, json=None, **kwargs)
With HTTP Of POST
To initiate a request .
Parameters :
url
: To get the pageurl
linkdata
: Dictionaries , Byte sequence or file object , AsRequest
The content ofjson
: JSON Format data , AsRequest
The content of**kwargs
: 11 Access control parameters
Return value :requests.Response
2.2.5 requests.put(url, data=None, **kwargs)
With HTTP Of PUT
To initiate a request .
Parameters :
url
: To get the pageurl
linkdata
: Dictionaries , Byte sequence or file object , AsRequest
The content ofjson
: JSON Format data , AsRequest
The content of**kwargs
: 11 Access control parameters
Return value :requests.Response
2.2.6 requests.patch(url, data=None, **kwargs)
With HTTP Of PATCH
To initiate a request .
Parameters :
url
: To get the pageurl
linkdata
: Dictionaries , Byte sequence or file object , AsRequest
The content ofjson
: JSON Format data , AsRequest
The content of**kwargs
: 11 Access control parameters
Return value :requests.Response
2.2.7 requests.delete(url, **kwargs)
With HTTP Of delete
To initiate a request .
Parameters :
url
: To get the pageurl
link**kwargs
: 13 Access control parameters
Return value :requests.Response
3、 ... and Case study : Crawl the cover pictures of all free courses on Moke website
By opening the website https://www.imooc.com/course/list
And review the format of each picture address , Then determine the pattern of the picture address , The whole crawl code is as follows :
import requests
import re
url = "https://www.imooc.com/course/list"
# Request content
try:
r = requests.get(url)
r.raise_for_status()
except:
print(" Something went wrong !")
else:
r.encoding = r.apparent_encoding
# Save the contents in variables html in
html = r.text
# Using regular expressions , Find the address of all pictures
images=re.findall(r'//img\d\.mukewang\.com/\w+\.(?:png|jpg)', html)
# Construct a complete address
images_url=['https:' + url for url in images]
i = 0
for url in images_url:
# Intercept the suffix of the image file
prefix = re.search(r'\w{3}$',url).group()
# Construct the storage location of downloaded image files , images Established in advance
filename = './images/{}.{}'.format(i,prefix)
# For each picture , Crawl its binary content , And store it in a local file
f = open(filename, 'wb')
r = requests.get(url)
f.write(r.content)
f.close()
i += 1
There are two unsatisfactory aspects of the above code : One is to establish a directory for storing pictures locally in advance , Second, a variable is introduced when naming pictures i
. The following is the optimization code for these two problems :
import requests
import re
import os
url = "https://www.imooc.com/course/list"
# Request content
try:
r = requests.get(url)
r.raise_for_status()
except:
print(" Something went wrong !")
else:
r.encoding = r.apparent_encoding
# Save the contents in variables html in
html = r.text
# Using regular expressions , Find the address of all pictures
images=re.findall(r'//img\d\.mukewang\.com/\w+\.(?:png|jpg)', html)
# Construct a complete address
images_url=['https:' + url for url in images]
# Construct a directory for storing pictures
if not os.path.isdir('imgs'):
os.mkdir('imgs')
else:
os.chdir('./imgs')
for f1 in os.listdir():
os.remove(f1)
os.chdir('..')
# Use enumerate Method to traverse the list
for i, url in enumerate(images_url):
prefix = re.search(r'\w{3}$',url).group()
filename = './imgs/{}.{}'.format(i,prefix)
f = open(filename, 'wb')
r = requests.get(url)
f.write(r.content)
f.close()
Welcome to Python Video Course :https://www.bilibili.com/video/BV1sh411Q7mz
边栏推荐
- Flink < --> how to use redis +with parameter
- ExecutorException: Statement returned more than one row, where no more than one was expected.
- Flink < --> Introduction to JDBC +with parameter
- 文件上传下载测试点
- Exclusive analysis | truth about resume and interview
- Is it OK to test the zero basis software?
- 软件测试必学基本理论知识——APP测试
- Do you really need automated testing?
- How can UI automated testing get out of trouble? How to embody the value?
- Game test related tests a hero's skills (spring moves are asked more questions)
猜你喜欢
STM32F1与STM32CubeIDE编程实例-TM1637驱动4位7段数码管
QT: QSS custom qtableview instance
Exclusive analysis | truth about resume and interview
嵌入式軟件測試怎麼實現自動化測試?
. Net core - a queuing system for wechat official account
T5 attempt
Take you into the cloud native database industry, Amazon Aurora
使用ML.NET+ONNX预训练模型整活B站经典《华强买瓜》
Do you really need automated testing?
Clion debug
随机推荐
月薪过万的测试员,是一种什么样的生活状态?
Error installing the specified version of pilot
UI自动化测试如何走出困境?价值又如何体现?
Qt:qss custom qstatusbar instance
11. Provider service registration of Nacos service registration source code analysis
Cache routing component
[true question of the Blue Bridge Cup trials 44] scratch eliminate the skeleton Legion children programming explanation of the true question of the Blue Bridge Cup trials
2021 reading summary (continuously updating)
软件测试必学基本理论知识——APP测试
游戏测试相关 测试一个英雄的技能(春招被问比较多的一道题)
栈,单调栈,队列,单调队列
QT:QSS自定义QTableView实例
My understanding of testing (summarized by senior testers)
Basic usage of sqlmap
QT:QSS自定义QMenu实例
“测试人”,有哪些厉害之处?
值得关注的15种软件测试趋势
Wechat applet training notes 1
《通信软件开发与应用》
Have you learned the new technology to improve sales in 2021?