当前位置:网站首页>Crawler notes (1) - urllib
Crawler notes (1) - urllib
2022-06-27 22:30:00 【997and】
Reptile notes (1)- urllib
Concept
1. Through a program , according to url Crawl web pages , Get useful information
2. Using programs and browsers , To send a request to the server , Get response information
Reptile core
1. Crawl to the web
2. Parsing data
3. difficulty : The game between reptiles and anti reptiles
Reptile uses
1. Data analysis / Manual data sets
2. Social software cold start
3. Public opinion monitoring
4. Competitor monitoring
Reptile classification
Universal crawler :
1、 example :360、 Baidu 、google Wait for the search engine
2、 function : Access page 、 Fetching the data 、 data storage 、 Data processing 、 Provide retrieval services
3、robots agreement : A conventional agreement , add to robots.txt file , Explain what content on this website can't be captured , Not limiting
4、 Website ranking :
(1) according to pagerank Algorithm values are ranked
(2) Baidu bidding ranking
5. shortcoming
(1) Most of the captured data is useless
(2) Unable to accurately obtain data according to the needs of users
Focus on reptiles :
1、 function : According to the demand , Implement crawler , Grab the data you need
2、 Design thinking :
(1) Make sure you want to climb url -> How to get url
(2) Simulate browser through http Agreement to access url, Get the information returned by the server html Code
(3) analysis html character string
Anti climbing means
1.user-Agent
The user agent , abbreviation UA, Is a special string header , Enables the server to identify the operating system and version used by the client 、CPU type 、 Browsers and versions 、 Browser rendering engine 、 Browser language 、 Browser plug-ins, etc
2. agent IP
Xici agency
Come on, agent
(1) Use transparent proxy , The other server can know that you are using the proxy , And I know your truth IP.
(2) Use anonymous proxy , The other server can know that you are using the proxy , But I don't know your truth IP.
(3) Use high anonymous proxy , The server doesn't know you're using a proxy , I don't know your truth IP.
3. Verification code access
Code platform 、 Cloud coding platform 、 Super
4. Loading web pages dynamically
The website returns js data , Not the real data of the web page ,selenium Drive real browsers to send requests
5. Data encryption
analysis js Code
urllib Library usage
python Bring their own
1 A type of 6 A way
print(type(response))
#HTTPResponse type
content = response.read()
content = response.read(5)# return 5 Bytes
content = response.readline()# Read a line
content = response.readlines()# Read line by line , Until the end of reading
print(response.getcode())# Return status code , If it is 200 It proves correct
print(response.geturl())# return url Address
print(response.getheaders())# Get status information
download
urllib.request.urlretrieve(url,filename)
Picture copy address
The video is found in the web page check vedio Of src
Request object customization
url form
http 80/https 443
agreement host Port number route s Parameters wd
mysql 3306、oracle 1521、redis 6379、mongodb 27017
# Request object customization , because urlopen Cannot store dictionary in method , therefore headers Can't pass it out
# Parameter order problem , Add the variable name
request = urllib.request.Request(url=url,headers=headers)
response = urllib.request.urlopen(request)
codec
get:
quote Method can change Chinese into ASCII code ,urllib.parse.quote(‘ Jay Chou ’)
urlencode Method : When there are multiple parameters , Turn the dictionary into ASCII code , Add... Between two &
post:
#post The requested parameters must be encoded
data = urllib.parse.urlencode(data).encode('utf-8')
#post The requested parameters are not spliced in url The back
request = urllib.request.Request(url=url,data=data,headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
print(content)
1.post Parameters for request mode You have to code urlencode
2. After coding Must call encode Method
3. Parameters are placed in the method customized by the request object

Pictured , Baidu translation is a detailed translation .
cookie Very important .
ajax Of get request
Top of the movie rankings 10 page
ajax Of post request

See the picture below Ajax request
Exception in crawler
URLError
HTTPError yes URLError Subclasses of
cookie land
After logging in cookie You can go to any page
handler The processor basically uses

agent

Agent pool
proxies_pool=[
{
'http':'118.24.219.151:1681711111'},
{
'http':'118.24.219.151:1681722222'},
]
边栏推荐
- Analysis of stone merging
- 从学生到工程师的蜕变之路
- 二维数组中修改代价最小问题【转换题意+最短路径】(Dijkstra+01BFS)
- Solution to the error of VMware tool plug-in installed in Windows 8.1 system
- Professor of Tsinghua University: software testing has gone into a misunderstanding - "code is necessary"
- MONTHS_BETWEEN函数使用
- The karsonzhang/fastadmin addons provided by the system reports an error
- Go from introduction to practice -- shared memory concurrency mechanism (notes)
- Codeforces Round #723 (Div. 2)
- Codeforces Round #719 (Div. 3)
猜你喜欢

I think I should start writing my own blog.

Codeforces Round #723 (Div. 2)

管理系统-ITclub(下)

解决本地连接不上虚拟机的问题

Stm32f107+lan8720a use stm32subemx to configure network connection +tcp master-slave +udp app

Conversation Qiao Xinyu: l'utilisateur est le gestionnaire de produits Wei Brand, zéro anxiété définit le luxe

7 jours d'apprentissage de la programmation simultanée go 7 jours de programmation simultanée go Language Atomic Atomic Atomic actual Operation contains ABA Problems

It smells good. Since I used Charles, Fiddler has been completely uninstalled by me

渗透学习-靶场篇-dvwa靶场详细攻略(持续更新中-目前只更新sql注入部分)

渗透学习-靶场篇-pikachu靶场详细攻略(持续更新中-目前只更新sql注入部分)
随机推荐
不外泄的测试用例设计秘籍--模块测试
The problem of minimum modification cost in two-dimensional array [conversion question + shortest path] (dijkstra+01bfs)
使用Jmeter进行性能测试的这套步骤,涨薪2次,升职一次
Go language slice vs array panic: runtime error: index out of range problem solving
Codeforces Round #716 (Div. 2)
美团20k软件测试工程师的经验分享
中金证券经理给的开户链接办理股票开户安全吗?我想开个户
《7天学会Go并发编程》第7天 go语言并发编程Atomic原子实战操作含ABA问题
Use Fiddler to simulate weak network test (2g/3g)
The create database of gbase 8A takes a long time to query and is suspected to be stuck
Matlab finds the position of a row or column in the matrix
渗透学习-靶场篇-pikachu靶场详细攻略(持续更新中-目前只更新sql注入部分)
Macro task and micro task understanding
Professor of Tsinghua University: software testing has gone into a misunderstanding - "code is necessary"
同花顺炒股软件可靠吗??安全嘛?
Go from introduction to practice -- definition and implementation of behavior (notes)
QT large file generation MD5 check code
使用Fiddler模拟弱网测试(2G/3G)
Yarn中RMApp、RMAppAttempt、RMContainer和RMNode状态机及其状态转移
结构化机器学习项目(二)- 机器学习策略(2)