当前位置:网站首页>Requests crawler implements a simple web page collector
Requests crawler implements a simple web page collector
2022-07-24 07:30:00 【Can't fail I】
Realize a simple web page collector
List of articles
- Based on the different keywords specified by Sogou, the corresponding page data is crawled
1 Introduce a new knowledge point
Parameters are dynamic :
- If requested url Portability parameter , And if we want to dynamically operate the carried parameters, we must :
- 1. Encapsulate the dynamic parameters carried into a dictionary in the form of key value pairs
- 2. Apply the dictionary to get Methodical params Parameters
- 3. It is necessary to carry the original parameter url Delete the carried parameters
2 demonstration
Next, I will demonstrate crawling search dog search “hello Mr. tree ” Code and results
import requests
keyWord = input(" Enter a content to query :")
# With request parameters url, If you want to crawl the page corresponding to different keywords , We need to url Carry the parameters dynamically
# Realize parameter dynamic
params = {
'query': keyWord
}
url = 'https://www.sogou.com/web'
# params Parameters ( Dictionaries ): When saving the request url Parameters carried
response = requests.get(url=url, params=params)
page_text = response.text
fileName = keyWord + '.html'
with open(fileName, 'w', encoding='utf-8') as fp:
fp.write(page_text)
print(fileName, ' Crawling over !')
The screenshot of the operation shows that it has been successfully crawled down

But looking at the crawled web page, I found that it was different from the real interface
Real interface

Climb down the picture 
In fact, there may be garbled code
3 solve the problem
First solve the problem of garbled code
The reason for garbled code is that the written code is inconsistent with the original code . The regulations are consistent . Add... To the code
# Modify the coding format of the corresponding data # encoding What is returned is the original encoding format of the response data , If it is assigned a value, it means that the encoding format of the response data has been modified response.encoding = 'utf-8'Solve the problem of page display 【 Exception access request 】 Cause the problem of missing request data
- Abnormal access request
- The website background has detected that this request is not a request initiated by the browser, but a request initiated by the crawler .( Requests that are not initiated through the browser are all exception requests )
- How does the background of the website know whether the request is initiated through the browser ?
- It is through the... In the request header of the decision request user-agent Judgmental
- What is? User-Agent
- The identity of the request carrier
- What is a request carrier :
- browser
- The identity of the browser is unified and fixed , The identity can be obtained from the packet capturing tool .
- Crawler program
- Identity is different
Code implementation reference for solving this problem 2.6
4 How to get the browser User-Agent
- Click in a web page F12.
- Click on NetWork
- Refresh web page
- Find any requested information , Select on the right header
- find User-Agent

5 The second anti creep mechanism
UA testing : The website background will detect the corresponding User-Agent, To determine whether the current request is an abnormal request .
Anti-crawl strategy :
- UA camouflage : It is applied to some websites , The crawlers we write in the future will be brought by default UA Test operation .
- Camouflage process :
- Capture a browser based request from the packet capturing tool User-Agent Value , Disguise it into a dictionary , Apply this dictionary to the request method (get,post) Of headers Parameters .
6 Exception access request code resolution
Take what we got from the browser User-Agent Value stored in dictionary headers in
And then headers The dictionary is assigned to the request method headers Parameters
import requests
keyWord = input(" Enter a content to query :")
# With request parameters url, If you want to crawl the page corresponding to different keywords , We need to url Carry the parameters dynamically
# Realize parameter dynamic
params = {
'query': keyWord
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
url = 'https://www.sogou.com/web'
# params Parameters ( Dictionaries ): When saving the request url Parameters carried
# Joined the headers, Realized UA camouflage
response = requests.get(url=url, params=params, headers=headers)
# Modify the coding format of the corresponding data
# encoding What is returned is the original encoding format of the response data , If it is assigned a value, it means that the encoding format of the response data has been modified
response.encoding = 'utf-8'
page_text = response.text
fileName = keyWord + '.html'
with open(fileName, 'w', encoding='utf-8') as fp:
fp.write(page_text)
print(fileName, ' Crawling over !')
Climb to success
Follow the column for more details
边栏推荐
- cookie_ session
- Network security B module windows operating system penetration test of national vocational college skills competition
- Jackson parsing JSON detailed tutorial
- 研究会2022.07.22--对比学习
- numpy.concatenate
- numpy.cumsum
- Kali安装pip以及pip换源
- numpy.arange
- win10声音图标有个没有声音
- 25. Message subscription and publishing - PubSub JS
猜你喜欢

【FreeRTOS】11 软件定时器

Riotboard development board series notes (IX) -- buildreoot porting matchbox
![[leetcode] 11. Container with the most water - go language solution](/img/42/3a1839dd768a5f02dc2acb5bd66438.png)
[leetcode] 11. Container with the most water - go language solution

Customization or GM, what is the future development trend of SaaS in China?

Jay Chou's live broadcast was watched by more than 6.54 million people, with a total interaction volume of 450million, helping Kwai break the record again

DOM operation of JS -- style operation

Win10 sound icon has no sound

Buffer overflow vulnerability of network security module B in national vocational college skills competition

Using depth and normal textures in unity

JS的DOM操作——style的操作
随机推荐
A great hymn
CSDN, it's time to say goodbye!
Using depth and normal textures in unity
Raspberry pie change source
Write three piece chess in C language
项目上线就炸,这谁受得了
Basic syntax of MySQL DDL and DML and DQL
JS_ Realize the separation of multiple lines of text into an array according to the newline
Jenkins 详细部署
Kali installing PIP and pip source changing
Learning notes - distributed transaction theory
B. Also Try Minecraft
【LeetCode-简单】20. 有效的括号 - 栈
Notes on the basics of using parameters in libsvm (1)
Gimp custom screenshot
requests-爬取页面源码数据
oracle中有A,B连个表,这两个表需要第三个表C关联,那怎么将A表中的字段MJ1更新为B表中MJ2的值
QoS quality of service three DiffServ Model message marking and PHB
中国三氯氢硅市场预测及战略研究报告(2022版)
论文阅读:HarDNet: A Low Memory Traffic Network