当前位置：网站首页>Crawl with requests

Crawl with requests

2022-07-03 11:05:00 【hflag168】

Use Requests Crawling

One HTTP agreement

1.1 http Overview of the agreement

HTTP yes Hyper Text Transfer Protocol（ Hypertext transfer protocol ） Abbreviation .HTTP It's based on " Request and response " Mode , Stateless application layer protocol .http The agreement TCP/IP The position in the protocol stack is shown in the figure below :

$[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-bv5uVWOn-1618788216403)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20210418134854577.png)]$

HTTP Agreements are usually carried on TCP The agreement above , Sometimes it also carries TLS or SSL Above the protocol layer , This is the time , That's what we often say HTTPS, Default HTTP The port number of is 80,HTTPS The port number of is 443.

1.2 http Request response model for

http Protocols are always requests from clients , Server echo response . As shown in the figure below :

$[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-yCqoeLNf-1618788216407)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20210418135128431.png)]$

http The agreement is a stateless agreement , That is to say, this request of the same client has no corresponding relationship with the last request .

1.3 Workflow

once http The operation generally includes the following steps :

First, the client establishes a connection with the server .
The client sends a resource request to the server
The server receives the request , Respond accordingly
The client receives the returned information and parses it

1.4 http Request method

HTTP/1.1 Eight methods are defined in the protocol （ Also called “ action ”） To operate the specified resources in different ways , The following are common request methods ：

Method	explain
GET	The request for URL Location resources
HEAD	The request for URL Response message report for location resource , That is to get the header information of the resource
POST	Request to URL The new data is attached after the resource of the location
PUT	Request to URL Location stores a resource , Covering the original URL Location resources
PATCH	Request partial update URL Location resources , That is, to change part of the resources of the service
DELETE	Request to delete URL Location stored resources

1.5 `URL`

URL(Uniform Resource Location) Uniform resource locator , That is, the web address . Is the address of the standard resource on the Internet .

HTTP The agreement adopts URL As the identification of locating resources .

1.5.1 `URL` Format

http://host[:port][path]

host: legal Internet Host domain name or IP Address .
port: Port number , The default port is 80
path: The path of the request resource

1.5.2 `URL` Example

https://www.cup.edu.cn/ It refers to China University of petroleum ( Beijing ) Campus network homepage .

https://www.cup.edu.cn/cise It refers to China University of petroleum ( Beijing ) Under this host domain cise Directory of resources , That is the homepage of the school of information science and Engineering

URL It can be understood in this way : It is HTTP Protocol access to resources Internet route , One URL Corresponding to a data resource .

Two Requests library

Requests yes Python An elegant and simple HTTP library , It is built for human . adopt requests It can be sent very easily http/1.1 request , There is no need to add the query string to url, You don't need to be right post Form code the data .

requests It's a third-party library , Therefore, it must be installed before use . We suggest you use anaconda Integrated environment , It's already installed requests Library and its dependencies .

2.1 `Request` and `Response` object

Whenever called requests.get() And its partner approach , In fact, they are doing two main things : First, you are building a Request object , It will be sent to the server to request or query some resources . secondly , Once the request gets a response from the server , Will generate Response object . It contains all the information returned by the server , It also includes the originally created Request object .

Here is an example of a request , Used to from https://httpbin.org Get some information :

>>> import requests
>>> r=requests.get("https://httpbin.org")

If we want to access the header information returned by the server , You can use the following code :

>>> r.headers
{
    'Date': 'Sun, 18 Apr 2021 11:15:57 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '9593', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}

However , If we want to get the header information sent to the server , It is through the response request Object to access , such as :

>>> r.request.headers
{
    'User-Agent': 'python-requests/2.19.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

2.2 Main interface

all Requests All functions can be through 7 Medium method access . They all return to Response An instance of an object . among requests.request() Method is the most important , It's all the others 6 The basis of three methods .

2.2.1 `requests.request(method, url, **kwargs)`

This method is used to construct and send a Request.

Parameters :

method: That is, sent HTTP request , It can be GET, HEAD, POST,PUT etc.
url: HTTP Requested address .
**kwargs: Control access parameters , All are optional

Return value :requests.Response

2.2.1.1 request Control parameter

**kwargs Altogether 13 Optional parameters , They are described as follows :

params: Dictionaries or byte sequences , As a parameter, add to url in .

>>> import requests
>>> r = requests.request('GET','https://httpbin.org', params={
    'key1':'val1', 'key2':'val2'})
>>> print(r.url)
https://httpbin.org/?key1=val1&key2=val2

data: Dictionaries , Byte sequence or file object , As Request The content of

>>> import requests
>>> r = requests.post('https://httpbin.org/post', data={
    'key':'value'})
>>> r = requests.post('https://httpbin.org/post', data='main content')

json: JSON Format data , As Request The content of

>>> import requests
>>> kv = {
    'key':"value"}
>>> r = requests.request('POST', 'http://python123.io/ws', json=kv)

headers: Dictionaries , HTTP Custom head

>>> hd={
    'user-agent': "Chrome/10"} #  Browser camouflage 
>>> r = requests.request('POST', 'http://python123.io/ws', headers=hd)
>>> r.request.headers
{
    'user-agent': 'Chrome/10', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '0'}

cookies: Dictionary or CookieJar, Request Medium cookie
auth: Tuples , Support HTTP Authentication function .
files: Dictionary type , Transfer files .

>>> fs = {
    'file': open('data.xls', 'rb')}
>>> r = requests.request('POST', 'http://python123.io/ws', files=fs)

timeout: Set timeout , Seconds per unit

>>> r = requests.request('GET', 'http://www.baidu.com', timeout=10)

proxies: Dictionary type , Set the access proxy server , You can add login authentication , Using agents can increase the difficulty of backtracking

>>> pxs = {
    'http': 'http://user:[email protected]:1234', 'https':'https://10.10.10.1:4321'}
>>> r = requests.request('GET', 'http://www.baidu.com', proxies=pxs)

allow_redirects: True/False, The default is True, Redirection switch
stream:True/False, The default is True, Get content, download now switch
verify:True/False, The default is True, authentication SSL Certificate switch
cert: Local SSL The certificate path

2.2.2 `requests.get(url, params=None, **kwargs)`

With HTTP Of GET To initiate a request .

main parameter :

url: To get the page url link
params: url Extra parameters in , Dictionary or byte stream format , Optional
**kwargs: 12 Control range parameters

Return value :requests.Response

2.2.3 `requests.head(url, **kwargs)`

With HTTP Of HEAD To initiate a request .

Parameters :

url: To get the page url link
**kwargs: 13 Access control parameters

Return value :requests.Response

2.2.4 `requests.post(url, data=None, json=None, **kwargs)`

With HTTP Of POST To initiate a request .

Parameters :

url: To get the page url link
data: Dictionaries , Byte sequence or file object , As Request The content of
json: JSON Format data , As Request The content of
**kwargs: 11 Access control parameters

Return value :requests.Response

2.2.5 `requests.put(url, data=None, **kwargs)`

With HTTP Of PUT To initiate a request .

Parameters :

url: To get the page url link
data: Dictionaries , Byte sequence or file object , As Request The content of
json: JSON Format data , As Request The content of
**kwargs: 11 Access control parameters

Return value :requests.Response

2.2.6 `requests.patch(url, data=None, **kwargs)`

With HTTP Of PATCH To initiate a request .

Parameters :

url: To get the page url link
data: Dictionaries , Byte sequence or file object , As Request The content of
json: JSON Format data , As Request The content of
**kwargs: 11 Access control parameters

Return value :requests.Response

2.2.7 `requests.delete(url, **kwargs)`

With HTTP Of delete To initiate a request .

Parameters :

url: To get the page url link
**kwargs: 13 Access control parameters

Return value :requests.Response

3、 ... and Case study : Crawl the cover pictures of all free courses on Moke website

By opening the website https://www.imooc.com/course/list And review the format of each picture address , Then determine the pattern of the picture address , The whole crawl code is as follows :

import requests
import re

url = "https://www.imooc.com/course/list"
#  Request content 
try:
    r = requests.get(url)
    r.raise_for_status()
except:
    print(" Something went wrong !")

else:
    r.encoding = r.apparent_encoding

#  Save the contents in variables html in 
html = r.text

#  Using regular expressions ,  Find the address of all pictures 
images=re.findall(r'//img\d\.mukewang\.com/\w+\.(?:png|jpg)', html)

#  Construct a complete address 
images_url=['https:' + url for url in images]

i = 0
for url in images_url:
    #  Intercept the suffix of the image file 
    prefix = re.search(r'\w{3}$',url).group()
    #  Construct the storage location of downloaded image files , images Established in advance 
    filename = './images/{}.{}'.format(i,prefix)

    #  For each picture , Crawl its binary content ,  And store it in a local file 
    f = open(filename, 'wb')
    r = requests.get(url)
    f.write(r.content)
    f.close()

    i += 1

There are two unsatisfactory aspects of the above code : One is to establish a directory for storing pictures locally in advance , Second, a variable is introduced when naming pictures i. The following is the optimization code for these two problems :

import requests
import re
import os

url = "https://www.imooc.com/course/list"
#  Request content 
try:
    r = requests.get(url)
    r.raise_for_status()
except:
    print(" Something went wrong !")

else:
    r.encoding = r.apparent_encoding

#  Save the contents in variables html in 
html = r.text

#  Using regular expressions ,  Find the address of all pictures 
images=re.findall(r'//img\d\.mukewang\.com/\w+\.(?:png|jpg)', html)

#  Construct a complete address 
images_url=['https:' + url for url in images]

#  Construct a directory for storing pictures 
if not os.path.isdir('imgs'):
    os.mkdir('imgs')
else:
    os.chdir('./imgs')
    for f1 in os.listdir():
        os.remove(f1)
    os.chdir('..')

#  Use enumerate Method to traverse the list 
for i, url in enumerate(images_url):
    prefix = re.search(r'\w{3}$',url).group()
    filename = './imgs/{}.{}'.format(i,prefix)
    
    f = open(filename, 'wb')
    r = requests.get(url)
    f.write(r.content)
    f.close()

Welcome to Python Video Course :https://www.bilibili.com/video/BV1sh411Q7mz

原网站

版权声明
本文为[hflag168]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202150531477997.html

当前位置：网站首页>Crawl with requests

Crawl with requests

Use Requests Crawling

One HTTP agreement

1.1 http Overview of the agreement

1.2 http Request response model for

1.3 Workflow

1.4 http Request method

1.5 `URL`

1.5.1 `URL` Format

1.5.2 `URL` Example

Two Requests library

2.1 `Request` and `Response` object

2.2 Main interface

2.2.1 `requests.request(method, url, **kwargs)`

2.2.1.1 request Control parameter

2.2.2 `requests.get(url, params=None, **kwargs)`

2.2.3 `requests.head(url, **kwargs)`

2.2.4 `requests.post(url, data=None, json=None, **kwargs)`

2.2.5 `requests.put(url, data=None, **kwargs)`

2.2.6 `requests.patch(url, data=None, **kwargs)`

2.2.7 `requests.delete(url, **kwargs)`

3、 ... and Case study : Crawl the cover pictures of all free courses on Moke website

边栏推荐

猜你喜欢

随机推荐

当前位置：网站首页>Crawl with requests

Crawl with requests

Use Requests Crawling

One HTTP agreement

1.1 http Overview of the agreement

1.2 http Request response model for

1.3 Workflow

1.4 http Request method

1.5 URL

1.5.1 URL Format

1.5.2 URL Example

Two Requests library

2.1 Request and Response object

2.2 Main interface

2.2.1 requests.request(method, url, **kwargs)

2.2.1.1 request Control parameter

2.2.2 requests.get(url, params=None, **kwargs)

2.2.3 requests.head(url, **kwargs)

2.2.4 requests.post(url, data=None, json=None, **kwargs)

2.2.5 requests.put(url, data=None, **kwargs)

2.2.6 requests.patch(url, data=None, **kwargs)

2.2.7 requests.delete(url, **kwargs)

3、 ... and Case study : Crawl the cover pictures of all free courses on Moke website

边栏推荐

猜你喜欢

随机推荐

1.5 `URL`

1.5.1 `URL` Format

1.5.2 `URL` Example

2.1 `Request` and `Response` object

2.2.1 `requests.request(method, url, **kwargs)`

2.2.2 `requests.get(url, params=None, **kwargs)`

2.2.3 `requests.head(url, **kwargs)`

2.2.4 `requests.post(url, data=None, json=None, **kwargs)`

2.2.5 `requests.put(url, data=None, **kwargs)`

2.2.6 `requests.patch(url, data=None, **kwargs)`

2.2.7 `requests.delete(url, **kwargs)`