当前位置:网站首页>Use of crawler manual 02 requests
Use of crawler manual 02 requests
2022-07-06 00:58:00 【A little black sauce】
Requests Use
The goal is : list Requests Common functions , Convenient access .
One . GET request
1. Basic usage
import requests
r = requests.get('https://www.baidu.com/')
print(type(r))
print(r.status_code)
print(type(r.text))
print(r.text[:100])
print(r.cookies)
Running results :
<class 'requests.models.Response'>
200
<class 'str'>
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charse
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
2. Portability parameter (params Parameters )
import requests
data = {
'name': 'germey',
'age': 25
}
r = requests.get('https://httpbin.org/get', params=data)
print(r.text)
print(type(r.text))
print(r.json())
print(type(r.json()))
Running results : The return is json character string , Can be called directly json() Turn to dictionary
{
"args": {
"age": "25",
"name": "germey"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.22.0",
"X-Amzn-Trace-Id": "Root=1-620913df-3f5acf216ce4775c03354c66"
},
"origin": "192.168.1.1",
"url": "https://httpbin.org/get?name=germey&age=25"
}
<class 'str'>
{'args': {'age': '25', 'name': 'germey'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.22.0', 'X-Amzn-Trace-Id': 'Root=1-620913df-3f5acf216ce4775c03354c66'}, 'origin': '192.168.1.1', 'url': 'https://httpbin.org/get?name=germey&age=25'}
<class 'dict'>
3. Grab binary data ( picture , Audio , video )
import requests
r = requests.get('https://github.com/favicon.ico')
with open('favicon.ico', 'wb') as f:
f.write(r.content)
Writing into a file can save binary data .
4. Add request header ( add to UA, agent , Anti theft chain and other parameters )
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
r = requests.get('https://httpbin.org/get', headers=headers)
print(r.text)
Running results :
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-62091775-6e03f96c398c288d0537dc48"
},
"origin": "192.168.1.1",
"url": "https://httpbin.org/get"
}
Two . POST request
import requests
data = {
'name': 'germey', 'age': '25'}
r = requests.post("https://www.httpbin.org/post", data=data)
print(r.text)
Running results :form The parameter in is POST Submitted parameters
{
"args": {},
"data": "",
"files": {},
"form": {
"age": "25",
"name": "germey"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Content-Length": "18",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "www.httpbin.org",
"User-Agent": "python-requests/2.22.0",
"X-Amzn-Trace-Id": "Root=1-62091826-3152bf7a1515a77d072a49b8"
},
"json": null,
"origin": "192.168.1.1",
"url": "https://www.httpbin.org/post"
}
3、 ... and . Respond to
1. Get response information
import requests
r = requests.get('https://httpbin.org/get')
print(type(r.status_code), r.status_code)
print(type(r.headers), r.headers)
print(type(r.cookies), r.cookies)
print(type(r.url), r.url)
print(type(r.history), r.history)
Running results :
<class 'int'> 200
<class 'requests.structures.CaseInsensitiveDict'> {'Date': 'Sun, 13 Feb 2022 14:48:27 GMT', 'Content-Type': 'application/json', 'Content-Length': '308', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}
<class 'requests.cookies.RequestsCookieJar'> <RequestsCookieJar[]>
<class 'str'> https://httpbin.org/get
<class 'list'> []
2. Perform different operations according to the response status code
import requests
r = requests.get('https://ssr1.scrape.center/')
exit() if not r.status_code == requests.codes.ok else print('Request Successfully')
Running results : Ternary expression , sentence 1 if Conditions else sentence 2, Execute statement if condition is true 1, If the condition is false, execute the statement 2
Request Successfully
Four . Advanced usage
1. Upload files
import requests
files = {
'file': open('favicon.ico', 'rb')}
r = requests.post('https://www.httpbin.org/post', files=files)
print(r.text)
Running results :file Part is the uploaded file information , Binary content is partially omitted
{
"args": {},
"data": "",
"files": {
"file": "data:application/octet-stream;base64,AAABAAI..."
},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Content-Length": "6665",
"Content-Type": "multipart/form-data; boundary=d4f2a858cbaf82294e38dc9d13398861",
"Host": "www.httpbin.org",
"User-Agent": "python-requests/2.22.0",
"X-Amzn-Trace-Id": "Root=1-62091be0-0620970e4ce4ad2744525587"
},
"json": null,
"origin": "192.168.1.1",
"url": "https://www.httpbin.org/post"
}
2. Cookie Set up
Method 1 : Take from the response
import requests
r = requests.get('https://www.baidu.com')
print(r.cookies)
for key, value in r.cookies.items():
print(key + '=' + value)
Running effect :
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
BDORZ=27315
Method 2 : Copy from browser , Put in the request header
import requests
headers = {
"Cookie": "PSTM=1642921828; BAIDUID=5D0366F594005428ED23CD344FFF8AB4:FG=1; BIDUPSID=C513DE1AABFEE265656D91ED68FFD990; BD_UPN=12314753; __yjs_duid=1_28bc45f4f2a7966986384542e284e0061642937695718; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; delPer=0; BD_CK_SAM=1; PSINO=5; BAIDUID_BFESS=5D0366F594005428ED23CD344FFF8AB4:FG=1; H_PS_645EC=75f2y6ETpxsjOyy%2Fh8Ps97om%2BQzF1lQ85oIBJWVzxjZ7a8PaM7IA2umnH%2Bw; COOKIE_SESSION=6097_1_9_9_0_12_1_0_8_9_3_1_0_0_0_0_0_1644493227_1644763447%7C9%2386341_20_1644493217%7C7; BD_HOME=1; BDUSS=NSdVpwWFAtd1I3YUg0MkNwek5WVEd0SmV1N2FOYkNjZHNPV1VGRVByVVJxekJpSVFBQUFBJCQAAAAAAAAAAAEAAABmANkpQXBwaGFvAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABEeCWIRHgliYz; BDUSS_BFESS=NSdVpwWFAtd1I3YUg0MkNwek5WVEd0SmV1N2FOYkNjZHNPV1VGRVByVVJxekJpSVFBQUFBJCQAAAAAAAAAAAEAAABmANkpQXBwaGFvAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABEeCWIRHgliYz; H_PS_PSSID=35104_31253_35833_35488_34584_35490_35871_35802_35326_26350_35883_35724_35877_35746; sug=3; sugstore=1; ORIGIN=0; bdime=0; BA_HECTOR=a4242h810h2k0005p71h0i7h00q"
}
r = requests.get("http://www.baidu.com/", headers=headers)
print(r.text)
Method 3 : structure RequestsCookieJar object
cookies = 'PSTM=1642921828; BAIDUID=5D0366F594005428ED23CD344FFF8AB4:FG=1; BIDUPSID=C513DE1AABFEE265656D91ED68FFD990; BD_UPN=12314753; __yjs_duid=1_28bc45f4f2a7966986384542e284e0061642937695718; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; delPer=0; BD_CK_SAM=1; PSINO=5; BAIDUID_BFESS=5D0366F594005428ED23CD344FFF8AB4:FG=1; H_PS_645EC=75f2y6ETpxsjOyy%2Fh8Ps97om%2BQzF1lQ85oIBJWVzxjZ7a8PaM7IA2umnH%2Bw; COOKIE_SESSION=6097_1_9_9_0_12_1_0_8_9_3_1_0_0_0_0_0_1644493227_1644763447%7C9%2386341_20_1644493217%7C7; BD_HOME=1; BDUSS=NSdVpwWFAtd1I3YUg0MkNwek5WVEd0SmV1N2FOYkNjZHNPV1VGRVByVVJxekJpSVFBQUFBJCQAAAAAAAAAAAEAAABmANkpQXBwaGFvAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABEeCWIRHgliYz; BDUSS_BFESS=NSdVpwWFAtd1I3YUg0MkNwek5WVEd0SmV1N2FOYkNjZHNPV1VGRVByVVJxekJpSVFBQUFBJCQAAAAAAAAAAAEAAABmANkpQXBwaGFvAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABEeCWIRHgliYz; H_PS_PSSID=35104_31253_35833_35488_34584_35490_35871_35802_35326_26350_35883_35724_35877_35746; sug=3; sugstore=1; ORIGIN=0; bdime=0; BA_HECTOR=a4242h810h2k0005p71h0i7h00q'
jar = requests.cookies.RequestsCookieJar()
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36'
}
for cookie in cookies.split(';'):
key, value = cookie.split('=', 1)
jar.set(key, value)
r = requests.get('http://www.baidu.com/', cookies=jar, headers=headers)
print(r.text)
3. Session maintain
Keep logged in Cookies
import requests
requests.get('https://www.httpbin.org/cookies/set/number/123456789')
r = requests.get('https://www.httpbin.org/cookies')
print(r.text)
s = requests.Session()
s.get('https://www.httpbin.org/cookies/set/number/123456789')
r = s.get('https://www.httpbin.org/cookies')
print(r.text)
Running results :
{
"cookies": {}
}
{
"cookies": {
"number": "123456789"
}
}
4. SSL Certificate validation
problem : Some websites will verify SSL certificate , You cannot access without a certificate file
import requests
response = requests.get('https://ssr2.scrape.center/')
print(response.status_code)
Running results : Report errors
requests.exceptions.SSLError: HTTPSConnectionPool(host='ssr2.scrape.center', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])")))
resolvent : Cancel SSL verification
import requests
response = requests.get('https://ssr2.scrape.center/', verify=False)
print(response.status_code)
Running results : The alarm , But the response code 200
C:\Users\Apphao\anaconda3\lib\site-packages\urllib3\connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made to host 'ssr2.scrape.center'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning,
200
Cancel the terminal alarm printing by closing the alarm information or capturing the alarm information
import requests
from requests.packages import urllib3
urllib3.disable_warnings()
response = requests.get('https://ssr2.scrape.center/', verify=False)
print(response.status_code)
import logging
import requests
logging.captureWarnings(True)
response = requests.get('https://ssr2.scrape.center/', verify=False)
print(response.status_code)
Running results :
200
200
You can also specify SSL Certificate file
import logging
import requests
r = requests.get('https://ssr2.scrape.center/', cert=('/path/server.crt', '/path/server.key'))
print(response.status_code)
5. timeout
wait for 1 second , If the timeout occurs, an exception is thrown
import requests
r = requests.get('https://www.httpbin.org/get', timeout=1)
print(r.status_code)
Connect to the server and wait 1 second , Read server wait 2 second
import requests
r = requests.get('https://www.httpbin.org/get', timeout=(1,2)
print(r.status_code)
Wait forever , Copy None Or not timeout Parameters
import requests
r = requests.get('https://www.httpbin.org/get', timeout=None)
print(r.status_code)
6. Identity Authentication
Mode one : Pass on HTTPBasicAuth object
import requests
from requests.auth import HTTPBasicAuth
r = requests.get('https://ssr3.scrape.center/', auth=HTTPBasicAuth('admin', 'admin'))
print(r.status_code)
Mode two : Direct tuple ( convenient )
import requests
r = requests.get('https://ssr3.scrape.center/', auth=('admin', 'admin'))
print(r.status_code)
Mode three : Use OAuth1 modular
import requests
from requests_oauthlib import OAuth1
url = 'https://api.twitter.com/1.1/account/verify_credentials.json'
auth = OAuth1('YOUR_APP_KEY', 'YOUR_APP_SECRET',
'USER_OAUTH_TOKEN', 'USER_OAUTH_TOKEN_SECRET')
r = requests.get(url, auth=auth)
print(r.status_code)
7. Agent settings
import requests
proxies = {
'http': 'http://10.10.10.10:1080',
'https': 'https://10.10.10.10:1080'
}
requests.get('https://httpbin.org/get', proxies=proxies)
8. Prepared Request
requests Internal implementation , After mastering it, you can control it more carefully and flexibly requests request
from requests import Request, Session
url = 'https://httpbin.org/post'
data = {
'name': 'germey'}
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36'
}
s = Session()
req = Request('POST', url, data=data, headers=headers)
prepped = s.prepare_request(req)
r = s.send(prepped)
print(r.text)
Running effect : Hair harmony POST Request the same
{
"args": {},
"data": "",
"files": {},
"form": {
"name": "germey"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Content-Length": "11",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-62092798-72199d8761a339441af9b39c"
},
"json": null,
"origin": "192.168.1.1",
"url": "https://httpbin.org/post"
}
边栏推荐
- Recursive method converts ordered array into binary search tree
- 95后CV工程师晒出工资单,狠补了这个,真香...
- Leetcode 44 Wildcard matching (2022.02.13)
- esxi的安装和使用
- Comment faire votre propre robot
- Introduction of motor
- Dynamic programming -- linear DP
- The value of applet containers
- Spark-SQL UDF函数
- What is the most suitable book for programmers to engage in open source?
猜你喜欢
从 1.5 开始搭建一个微服务框架——调用链追踪 traceId
Arduino六足机器人
I'm interested in watching Tiktok live beyond concert
Browser reflow and redraw
[groovy] JSON serialization (convert class objects to JSON strings | convert using jsonbuilder | convert using jsonoutput | format JSON strings for output)
Free chat robot API
The third season of ape table school is about to launch, opening a new vision for developers under the wave of going to sea
Five challenges of ads-npu chip architecture design
MCU realizes OTA online upgrade process through UART
After 95, the CV engineer posted the payroll and made up this. It's really fragrant
随机推荐
1791. Find the central node of the star diagram / 1790 Can two strings be equal by performing string exchange only once
[groovy] JSON serialization (convert class objects to JSON strings | convert using jsonbuilder | convert using jsonoutput | format JSON strings for output)
BiShe - College Student Association Management System Based on SSM
云导DNS和知识科普以及课堂笔记
ADS-NPU芯片架构设计的五大挑战
Zhuhai laboratory ventilation system construction and installation instructions
Idea remotely submits spark tasks to the yarn cluster
Meta AI西雅图研究负责人Luke Zettlemoyer | 万亿参数后,大模型会持续增长吗?
Daily practice - February 13, 2022
Model analysis of establishment time and holding time
Spark-SQL UDF函数
Programmer growth Chapter 9: precautions in real projects
Set data real-time update during MDK debug
Dede collection plug-in free collection release push plug-in
Spark SQL空值Null,NaN判断和处理
VSphere implements virtual machine migration
Common API classes and exception systems
Gartner发布2022-2023年八大网络安全趋势预测,零信任是起点,法规覆盖更广
Spark AQE
Four commonly used techniques for anti aliasing