当前位置:网站首页>Crawler notes (1) - urllib
Crawler notes (1) - urllib
2022-06-27 22:30:00 【997and】
Reptile notes (1)- urllib
Concept
1. Through a program , according to url Crawl web pages , Get useful information
2. Using programs and browsers , To send a request to the server , Get response information
Reptile core
1. Crawl to the web
2. Parsing data
3. difficulty : The game between reptiles and anti reptiles
Reptile uses
1. Data analysis / Manual data sets
2. Social software cold start
3. Public opinion monitoring
4. Competitor monitoring
Reptile classification
Universal crawler :
1、 example :360、 Baidu 、google Wait for the search engine
2、 function : Access page 、 Fetching the data 、 data storage 、 Data processing 、 Provide retrieval services
3、robots agreement : A conventional agreement , add to robots.txt file , Explain what content on this website can't be captured , Not limiting
4、 Website ranking :
(1) according to pagerank Algorithm values are ranked
(2) Baidu bidding ranking
5. shortcoming
(1) Most of the captured data is useless
(2) Unable to accurately obtain data according to the needs of users
Focus on reptiles :
1、 function : According to the demand , Implement crawler , Grab the data you need
2、 Design thinking :
(1) Make sure you want to climb url -> How to get url
(2) Simulate browser through http Agreement to access url, Get the information returned by the server html Code
(3) analysis html character string
Anti climbing means
1.user-Agent
The user agent , abbreviation UA, Is a special string header , Enables the server to identify the operating system and version used by the client 、CPU type 、 Browsers and versions 、 Browser rendering engine 、 Browser language 、 Browser plug-ins, etc
2. agent IP
Xici agency
Come on, agent
(1) Use transparent proxy , The other server can know that you are using the proxy , And I know your truth IP.
(2) Use anonymous proxy , The other server can know that you are using the proxy , But I don't know your truth IP.
(3) Use high anonymous proxy , The server doesn't know you're using a proxy , I don't know your truth IP.
3. Verification code access
Code platform 、 Cloud coding platform 、 Super
4. Loading web pages dynamically
The website returns js data , Not the real data of the web page ,selenium Drive real browsers to send requests
5. Data encryption
analysis js Code
urllib Library usage
python Bring their own
1 A type of 6 A way
print(type(response))
#HTTPResponse type
content = response.read()
content = response.read(5)# return 5 Bytes
content = response.readline()# Read a line
content = response.readlines()# Read line by line , Until the end of reading
print(response.getcode())# Return status code , If it is 200 It proves correct
print(response.geturl())# return url Address
print(response.getheaders())# Get status information
download
urllib.request.urlretrieve(url,filename)
Picture copy address
The video is found in the web page check vedio Of src
Request object customization
url form
http 80/https 443
agreement host Port number route s Parameters wd
mysql 3306、oracle 1521、redis 6379、mongodb 27017
# Request object customization , because urlopen Cannot store dictionary in method , therefore headers Can't pass it out
# Parameter order problem , Add the variable name
request = urllib.request.Request(url=url,headers=headers)
response = urllib.request.urlopen(request)
codec
get:
quote Method can change Chinese into ASCII code ,urllib.parse.quote(‘ Jay Chou ’)
urlencode Method : When there are multiple parameters , Turn the dictionary into ASCII code , Add... Between two &
post:
#post The requested parameters must be encoded
data = urllib.parse.urlencode(data).encode('utf-8')
#post The requested parameters are not spliced in url The back
request = urllib.request.Request(url=url,data=data,headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
print(content)
1.post Parameters for request mode You have to code urlencode
2. After coding Must call encode Method
3. Parameters are placed in the method customized by the request object

Pictured , Baidu translation is a detailed translation .
cookie Very important .
ajax Of get request
Top of the movie rankings 10 page
ajax Of post request

See the picture below Ajax request
Exception in crawler
URLError
HTTPError yes URLError Subclasses of
cookie land
After logging in cookie You can go to any page
handler The processor basically uses

agent

Agent pool
proxies_pool=[
{
'http':'118.24.219.151:1681711111'},
{
'http':'118.24.219.151:1681722222'},
]
边栏推荐
- OpenSSL Programming II: building CA
- Ellipsis after SQLite3 statement Solutions for
- Typescript learning
- MySQL greater than less than or equal to symbol representation
- Use Fiddler to simulate weak network test (2g/3g)
- Beijing University of Posts and Telecommunications - multi-agent deep reinforcement learning for cost and delay sensitive virtual network function placement and routing
- 渗透学习-sql注入过程中遇到的问题-针对sort=left(version(),1)的解释-对order by后接字符串的理解
- Go from introduction to practice -- shared memory concurrency mechanism (notes)
- [sword offer ii] sword finger offer II 029 Sorted circular linked list
- 北京邮电大学|用于成本和延迟敏感的虚拟网络功能放置和路由的多智能体深度强化学习
猜你喜欢

清华大学教授:软件测试已经走入一个误区——“非代码不可”

Système de gestion - itclub (II)

BAT测试专家对web测试和APP测试的总结

crontab定时任务常用命令

Professor of Tsinghua University: software testing has gone into a misunderstanding - "code is necessary"

Fill in the blank of rich text test

美团20k软件测试工程师的经验分享

對話喬心昱:用戶是魏牌的產品經理,零焦慮定義豪華

Open source technology exchange - Introduction to Chengying, a one-stop fully automated operation and maintenance manager

改善深层神经网络:超参数调试、正则化以及优化(三)- 超参数调试、Batch正则化和程序框架
随机推荐
Yarn中RMApp、RMAppAttempt、RMContainer和RMNode状态机及其状态转移
[LeetCode]30. Concatenate substrings of all words
Go from introduction to actual combat - task cancellation (note)
【Redis】零基础十分钟学会Redis
mysql 大于 小于 等于符号的表示方法
gomock mockgen : unknown embedded interface
《7天学会Go并发编程》第六天 go语言Sync.cond的应用和实现 go实现多线程联合执行
QT large file generation MD5 check code
管理系統-ITclub(下)
Login credentials (cookie+session and token token)
The problem of minimum modification cost in two-dimensional array [conversion question + shortest path] (dijkstra+01bfs)
The karsonzhang/fastadmin addons provided by the system reports an error
Système de gestion - itclub (II)
xpath
Read write separation master-slave replication of MySQL
CUDA error:out of memory caused by insufficient video memory of 6G graphics card
AQS SOS AQS with me
爬虫笔记(2)- 解析
Day8 - cloud information project introduction and creation
Ellipsis after SQLite3 statement Solutions for