当前位置:网站首页>Reptile 01 basic principles of reptile
Reptile 01 basic principles of reptile
2022-07-05 18:26:00 【Twinkling stars】
List of articles
1. What is a reptile
An automated program that requests websites and extracts data .
2. Basic flow of crawler
- Initiate request
- Get response content
- Parsing content
- Save the data
3. What is? request And Response
3.1 Request What is included in
Request mode : There are mainly get post
* View request mode
* Type of request method
HTTP/1.0
1. GET: Prefer the way of acquisition
Most of them give back-end parameters , Used to get some columns of data
2. POST: Prefer to give the server some data
Most of them log in , Give the server some information , You give me a simple result
3. PUT: Prefer to give the server some information , But it is to add and use
Most do registration , Give the server some information , You save this information
4. HEAD: Used to get server header information
HTTP/1.1
5. DELETE: Prefer to delete
Mostly delete comments , Delete micro-blog
6. CONNECT: Pipe connection changes proxy connection usage 【 Not commonly used 】
7. PATCH: Prefer to give the server some information , Prefer to modify some information
Most of them are used to improve user data
8. OPTIONS: Used to obtain server performance , However, the server's consent is required
* get And post The difference between
request URL
URL: Uniform resource locator
Request header
Requested configuration information
Request body
The request body is usually in get There is no request body
stay post When the way , With form data In the form of , Including login information .
3.2Response What does it contain
Response state
Status code : 200 success , 300: Jump 404: Can't find
Response head
Response body
preview Internal content
4 Instance Introduction
import requests
r = requests.get('https://www.baidu.com/')
print(type(r))
print(r.status_code)
print(type(r.text))
print(r.text)
print(r.cookies)
Output :
<class 'requests.models.Response'>
200
<class'str'>
<html>
<head>
<script>
location.replace(location.href.replace("https://","http://"));
</script>
</head>
<body>
<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>
<RequestsCookieJar[<Cookie BIDUPSID=992C3B26F4C4D09505C5E959D5FBC005 for .baidu.com/>, <Cookie
PSTM=1472227535 for .baidu.com/>, <Cookie __bsi=15304754498609545148_00_40_N_N_2_0303_C02F_N_N_N_0
for .www.baidu.com/>, <Cookie BD_NOT_HTTPS=1 for www.baidu.com/>]>
Respectively output Response The type of 、 Status code 、 The type of response body 、 Content and Cookies.
Running results show that , Its return type is requests.models.Response, The type of response body is string str,Cookies The type is RequestsCookieJar.
Use get Method successfully implements a GET request , That's nothing , The more convenient thing is that other request types can still be completed in one sentence , Examples are as follows :
r = requests.post('http://httpbin.org/post')
r = requests.put('http://httpbin.org/put')
r = requests.delete('http://httpbin.org/delete')
r = requests.head('http://httpbin.org/get')
r = requests.options('http://httpbin.org/get')
5 What kind of data can be grasped
6 How to analyze
7 Why do I catch something different from what I see in the browser
What we got was Javascript Format , Need to carry out ajax Equal resolution
8 How to solve JavaScript The problem of rendering
9 How to save data
边栏推荐
- 记一次使用Windbg分析内存“泄漏”的案例
- Thoroughly understand why network i/o is blocked?
- JVM third talk -- JVM performance tuning practice and high-frequency interview question record
- Maximum artificial island [how to make all nodes of a connected component record the total number of nodes? + number the connected component]
- 分享:中兴 远航 30 pro root 解锁BL magisk ZTE 7532N 8040N 9041N 刷机 刷面具原厂刷机包 root方法下载
- 含重复元素取不重复子集[如何取子集?如何去重?]
- Nanjing University: Discussion on the training program of digital talents in the new era
- Clickhouse (03) how to install and deploy Clickhouse
- 图片数据不够?我做了一个免费的图像增强软件
- 【HCIA-cloud】【1】云计算的定义、什么是云计算、云计算的架构与技术说明、华为云计算产品、华为内存DDR配置工具说明
猜你喜欢
Sophon autocv: help AI industrial production and realize visual intelligent perception
Nacos distributed transactions Seata * * install JDK on Linux, mysql5.7 start Nacos configure ideal call interface coordination (nanny level detail tutorial)
pytorch yolov5 训练自定义数据
About Estimation with Cross-Validation
《2022中国信创生态市场研究及选型评估报告》发布 华云数据入选信创IT基础设施主流厂商!
《力扣刷题计划》复制带随机指针的链表
南京大学:新时代数字化人才培养方案探讨
最大人工岛[如何让一个连通分量的所有节点都记录总节点数?+给连通分量编号]
Let more young people from Hong Kong and Macao know about Nansha's characteristic cultural and creative products! "Nansha kylin" officially appeared
Sophon Base 3.1 推出MLOps功能,为企业AI能力运营插上翅膀
随机推荐
图扑软件数字孪生 | 基于 BIM 技术的可视化管理系统
Login and connect CDB and PDB
MATLAB中print函数使用
How to solve the error "press any to exit" when deploying multiple easycvr on one server?
Exemple Quelle est la relation entre le taux d'échantillonnage, l'échantillon et la durée?
Sophon CE社区版上线,免费Get轻量易用、高效智能的数据分析工具
Huaxia Fund: sharing of practical achievements of digital transformation in the fund industry
About Statistical Power(统计功效)
南京大学:新时代数字化人才培养方案探讨
The 11th China cloud computing standards and Applications Conference | China cloud data has become the deputy leader unit of the cloud migration special group of the cloud computing standards working
【pm2详解】
Sophon kg upgrade 3.1: break down barriers between data and liberate enterprise productivity
Einstein sum einsum
[paddleclas] common commands
第十一届中国云计算标准和应用大会 | 云计算国家标准及白皮书系列发布 华云数据全面参与编制
jdbc读大量数据导致内存溢出
破解湖+仓混合架构顽疾,星环科技推出自主可控云原生湖仓一体平台
爱因斯坦求和einsum
Introduction to Resampling
[utiliser Electron pour développer le Bureau sur youkirin devrait]