当前位置：网站首页>Requests library simple method usage notes

Requests library simple method usage notes

2022-07-29 08:33:00 【Mr match】

List of articles

1 urllib

Simply understand this part
A simple demo

from urllib import request
response=request.urlopen('http://www.baidu.com')
response.read().decode('utf8')

Four modules

urllib.request - Open and read URL.
urllib.error - contain urllib.request Exception thrown .
urllib.parse - analysis URL.
urllib.robotparser - analysis robots.txt file .

Mainly introduce several methods that you are not familiar with or commonly used .

1.1.1 urllib.request.Request

request.Request(
    url,#  Will pass 
    data=None, # This part must use bytes（ Byte stream ）
    headers={
    },
    origin_req_host=None,
    unverifiable=False,
    method=None,
)

demo

from urllib import request,parse
url='http://httpbin.org/post'
headers={
    
    'User-Agent':'Mozilla/5.0 (Linux; Android 8.0.0; SM-G955U Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Mobile Safari/537.36 Edg/103.0.5060.66',
    'Host':'httpbin.org'
}
dict={
     
    'name':'test_user'
}
data=bytes(parse.urlencode(dict),encoding='utf8') # First convert the dictionary type into the encoding format .
req=request.Request(url=url,data=data,headers=headers,method='POST')
response=request.urlopen(req)
print(response.read().decode('utf-8'))

#output
""" { "args": {}, "data": "", "files": {}, "form": { # This part enables us to upload content  "name": "test_user" }, "headers": { "Accept-Encoding": "identity", "Content-Length": "14", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "Mozilla/5.0 (Linux; Android 8.0.0; SM-G955U Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Mobile Safari/537.36 Edg/103.0.5060.66", "X-Amzn-Trace-Id": "Root=1-62c7e429-2dfdef7313887be20022ab31" }, "json": null, "origin": "60.168.149.12", "url": "http://httpbin.org/post" } """

Can also pass add_header() To add .

req.add_header('User-Agent','XXXXX')

1.2 Handler class

It is mainly used for some other advanced operations （cookie、 Agent processing, etc ）

baseHandle class

Supplement when necessary

1.3 exception handling urllib.error

urllib.error The module is urllib.request The exception thrown defines the exception class , The basic exception class is URLError.

urllib.error It contains two methods ,URLError and HTTPError.

# URLError  There's only one property reason
from urllib import request,error
try:
    response=request.urlopen('https://baidumatches999.com')
except error.URLError as e:
    print(e.reason)

HTTPError yes URLError Subclasses of , Has three properties , Namely

code： Status code
reason
headers

from urllib import request,error
try:
    response=request.urlopen('https://baidu.com/test.htm')
except error.HTTPError as e:
    print(e.reason,e.code,e.headers,sep='\n')
    
""" Not Found 404 Content-Length: 206 Content-Type: text/html; charset=iso-8859-1 Date: Fri, 08 Jul 2022 08:24:56 GMT Server: Apache Connection: close """

1.4 urllib.prase

URL Divided into six parts （scheme( agreement )、netloc( domain name )、path(‘ Access path ’)、params(‘ Parameters ’),query(‘ Query criteria ’)）、fragment(‘ Anchor point ’)

scheme://netloc/path;params?query#fragment

Mainly used for processing URL, Split , Merger, etc .

urlprase()： Identification and segmentation
- urlstring： mandatory , To be resolved URL
- scheme： agreement , Default http
- allow_fragments： Whether to ignore fragment
urlprase():
urlsplit():
urlunsplit()
urljoin()
parse_qsl()
quote(): Translate content into URL Encoding format

# parse
from urllib.parse import quote
url="http://www.baidu.com/"+quote(' Hello ')
url
""" 'http://www.baidu.com/%E4%BD%A0%E5%A5%BD' """

unquote(): decode

1.5 Robots agreement

robots The protocol is also called crawler Protocol 、 Crawler rules, etc , It means that a website can be established robots.txt File to tell search engines which pages can be crawled , Which pages can't be crawled , Search engines read robots.txt File to identify whether this page is allowed to be crawled . however , This robots The protocol is not a firewall , There is no enforcement , Search engines can be completely ignored robots.txt File to grab a snapshot of the web page **.**

robotparser Module to parse robots.txt

2 Requests modular

requests In the implementation of certain operations , More convenient , The similarities and differences between the two methods are compared through the same example below .

2.1 Basic usage

The most used is get()

import requests
url='XXXX' 
r=requests.get(url)

Can also pass post、put、delete Implement the corresponding request .

json

import requests 
r=requests.get('http://httpbin.org/get')
print(r.text)
print(r.json()) #  Return to json Format 

"""  Output  { "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.22.0", "X-Amzn-Trace-Id": "Root=1-62c80370-7ed4a0a10c7b27106031d457" }, "origin": "60.168.149.12", "url": "http://httpbin.org/get" } {'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.22.0', 'X-Amzn-Trace-Id': 'Root=1-62c80370-7ed4a0a10c7b27106031d457'}, 'origin': '60.168.149.12', 'url': 'http://httpbin.org/get'} """

2.2 Request header headers

There are many websites , If we don't add the request header , You may not get the content .

For example, the Zhihu discovery page below
Insert picture description here

import requests 
r=requests.get('https://www.zhihu.com/explore')
print(r.text)
""" Output  <html> <head><title>403 Forbidden</title></head> <body bgcolor="white"> <center><h1>403 Forbidden</h1></center> <hr><center>openresty</center> </body> </html> """

After adding the request header, you can get the content normally .

import requests 
headers={
    
        'User-Agent':'Mozilla/5.0 (Linux; Android 8.0.0; SM-G955U Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Mobile Safari/537.36 Edg/103.0.5060.66'
}
r=requests.get('https://www.zhihu.com/explore',headers=headers)
print(r.text)

Of course, the request header has other properties to master , such as Referer,Cookie You also need to understand , Reverse climbing is sometimes useful .

Refer to the previous post ：【 Reptiles 】Web Basics —— Response head 、 Request header 、http&https、 Status code

2.3 post

def post(url, data=None, json=None, **kwargs):
    r"""Sends a POST request. :param url: URL for the new :class:`Request` object. :param data: (optional) Dictionary, list of tuples, bytes, or file-like object to send in the body of the :class:`Request`. :param json: (optional) json data to send in the body of the :class:`Request`. :param \*\*kwargs: Optional arguments that ``request`` takes. :return: :class:`Response <Response>` object :rtype: requests.Response """

demo Let's not list , Data is submitted by hitting him . The parameters in the document are similar to request similar , Below request Some attributes in can be used , such as file( We can upload files through this )

#  Upload files 
import requests
file={
    'file':open("./data/test.png",'rb')}
r=requests.post('http://httpbin.org/post',files=file)
print(r.text)

2.4 requests.request

We can look at it Request This class , What are the specific attributes .

def request(method, url, **kwargs):
    """Constructs and sends a :class:`Request <Request>`. :param method: method for the new :class:`Request` object. :param url: URL for the new :class:`Request` object. :param params: (optional) Dictionary, list of tuples or bytes to send in the query string for the :class:`Request`. :param data: (optional) Dictionary, list of tuples, bytes, or file-like object to send in the body of the :class:`Request`. :param json: (optional) A JSON serializable Python object to send in the body of the :class:`Request`. :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`. :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`. :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload. ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')`` or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers to add for the file. :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth. :param timeout: (optional) How many seconds to wait for the server to send data before giving up, as a float, or a :ref:`(connect timeout, read timeout) <timeouts>` tuple. :type timeout: float or tuple :param allow_redirects: (optional) Boolean. Enable/disable GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD redirection. Defaults to ``True``. :type allow_redirects: bool :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy. :param verify: (optional) Either a boolean, in which case it controls whether we verify the server's TLS certificate, or a string, in which case it must be a path to a CA bundle to use. Defaults to ``True``. :param stream: (optional) if ``False``, the response content will be immediately downloaded. :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair. :return: :class:`Response <Response>` object :rtype: requests.Response Usage:: >>> import requests >>> req = requests.request('GET', 'https://httpbin.org/get') <Response [200]> """

    # By using the 'with' statement we are sure the session is closed, thus we
    # avoid leaving sockets open which can trigger a ResourceWarning in some
    # cases, and look like a memory leak in others.
    with sessions.Session() as session:
        return session.request(method=method, url=url, **kwargs)

2.5 cookie

Cookies Refers to some websites in order to identify users , Carry out drawing tracking II data stored in the user's local terminal . It can be used to maintain the session state .

When the client first requests the server , return set-cookie Field response , The client browser will store Cookie Information .

On the second visit , The client browser will submit the request to the server , The server determines the session state .

import requests
r=requests.get('http://www.baidu.com')
cookie=r.cookies
for key,value in cookies.items():
    print(key+':'+value)
    
""" BAIDUID:974A BIDUPSID:974A9F PSTM:1657 """

2.6 Conversation maintenance

introduce Session object , Maintain the same session .（ The difference is how many browsers ）

Demo1： It is equivalent to accessing through two different browsers

import requests
requests.get('http://httpbin.org/cookies/set/number/1234567')
r=requests.get('http://httpbin.org/cookies')
print(r.text)
"""output { "cookies": {} } """

You can also try to access these two with a browser url, Better understanding .

Demo2：Session object

import requests
sess=requests.session()
sess.get('http://httpbin.org/cookies/set/number/1234567')
r=sess.get('http://httpbin.org/cookies')
print(r.text)
"""Output { "cookies": { "number": "1234567" } } """

2.7 Prepared Request object

Request object , Just look at the source code directly in this part , With this Request Objects can treat requests as independent objects . Write a little below demo.

from requests import Request,Session
url='http://www.baidu.com'
headers={
    
    
}
s=Session()
req=Request('get',url) #headers coca 
prepare=s.prepare_request(req) 
r=s.send(prepare)
print(r.text)