当前位置:网站首页>Crawler notes (1) - urllib
Crawler notes (1) - urllib
2022-06-27 22:30:00 【997and】
Reptile notes (1)- urllib
Concept
1. Through a program , according to url Crawl web pages , Get useful information
2. Using programs and browsers , To send a request to the server , Get response information
Reptile core
1. Crawl to the web
2. Parsing data
3. difficulty : The game between reptiles and anti reptiles
Reptile uses
1. Data analysis / Manual data sets
2. Social software cold start
3. Public opinion monitoring
4. Competitor monitoring
Reptile classification
Universal crawler :
1、 example :360、 Baidu 、google Wait for the search engine
2、 function : Access page 、 Fetching the data 、 data storage 、 Data processing 、 Provide retrieval services
3、robots agreement : A conventional agreement , add to robots.txt file , Explain what content on this website can't be captured , Not limiting
4、 Website ranking :
(1) according to pagerank Algorithm values are ranked
(2) Baidu bidding ranking
5. shortcoming
(1) Most of the captured data is useless
(2) Unable to accurately obtain data according to the needs of users
Focus on reptiles :
1、 function : According to the demand , Implement crawler , Grab the data you need
2、 Design thinking :
(1) Make sure you want to climb url -> How to get url
(2) Simulate browser through http Agreement to access url, Get the information returned by the server html Code
(3) analysis html character string
Anti climbing means
1.user-Agent
The user agent , abbreviation UA, Is a special string header , Enables the server to identify the operating system and version used by the client 、CPU type 、 Browsers and versions 、 Browser rendering engine 、 Browser language 、 Browser plug-ins, etc
2. agent IP
Xici agency
Come on, agent
(1) Use transparent proxy , The other server can know that you are using the proxy , And I know your truth IP.
(2) Use anonymous proxy , The other server can know that you are using the proxy , But I don't know your truth IP.
(3) Use high anonymous proxy , The server doesn't know you're using a proxy , I don't know your truth IP.
3. Verification code access
Code platform 、 Cloud coding platform 、 Super
4. Loading web pages dynamically
The website returns js data , Not the real data of the web page ,selenium Drive real browsers to send requests
5. Data encryption
analysis js Code
urllib Library usage
python Bring their own
1 A type of 6 A way
print(type(response))
#HTTPResponse type
content = response.read()
content = response.read(5)# return 5 Bytes
content = response.readline()# Read a line
content = response.readlines()# Read line by line , Until the end of reading
print(response.getcode())# Return status code , If it is 200 It proves correct
print(response.geturl())# return url Address
print(response.getheaders())# Get status information
download
urllib.request.urlretrieve(url,filename)
Picture copy address
The video is found in the web page check vedio Of src
Request object customization
url form
http 80/https 443
agreement host Port number route s Parameters wd
mysql 3306、oracle 1521、redis 6379、mongodb 27017
# Request object customization , because urlopen Cannot store dictionary in method , therefore headers Can't pass it out
# Parameter order problem , Add the variable name
request = urllib.request.Request(url=url,headers=headers)
response = urllib.request.urlopen(request)
codec
get:
quote Method can change Chinese into ASCII code ,urllib.parse.quote(‘ Jay Chou ’)
urlencode Method : When there are multiple parameters , Turn the dictionary into ASCII code , Add... Between two &
post:
#post The requested parameters must be encoded
data = urllib.parse.urlencode(data).encode('utf-8')
#post The requested parameters are not spliced in url The back
request = urllib.request.Request(url=url,data=data,headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
print(content)
1.post Parameters for request mode You have to code urlencode
2. After coding Must call encode Method
3. Parameters are placed in the method customized by the request object

Pictured , Baidu translation is a detailed translation .
cookie Very important .
ajax Of get request
Top of the movie rankings 10 page
ajax Of post request

See the picture below Ajax request
Exception in crawler
URLError
HTTPError yes URLError Subclasses of
cookie land
After logging in cookie You can go to any page
handler The processor basically uses

agent

Agent pool
proxies_pool=[
{
'http':'118.24.219.151:1681711111'},
{
'http':'118.24.219.151:1681722222'},
]
边栏推荐
- Go from introduction to practice -- shared memory concurrency mechanism (notes)
- average-population-of-each-continent
- Management system itclub (Part 1)
- [MySQL practice] query statement demonstration
- Login credentials (cookie+session and token token)
- Management system itclub (medium)
- Ellipsis after SQLite3 statement Solutions for
- 软件测试自动化测试之——接口测试从入门到精通,每天学习一点点
- 记一次List对象遍历及float类型判断大小
- Yarn performance tuning of CDH cluster
猜你喜欢

使用Fiddler模拟弱网测试(2G/3G)

对话乔心昱:用户是魏牌的产品经理,零焦虑定义豪华

The "business and Application Security Development Forum" held by the ICT Institute was re recognized for the security capability of Tianyi cloud

信通院举办“业务与应用安全发展论坛” 天翼云安全能力再获认可

Go from introduction to actual combat - task cancellation (note)

Secret script of test case design without leakage -- module test

改善深层神经网络:超参数调试、正则化以及优化(三)- 超参数调试、Batch正则化和程序框架

對話喬心昱:用戶是魏牌的產品經理,零焦慮定義豪華

管理系统-ITclub(上)

二维数组中修改代价最小问题【转换题意+最短路径】(Dijkstra+01BFS)
随机推荐
Go from introduction to practice - error mechanism (note)
使用Fiddler模拟弱网测试(2G/3G)
gomock mockgen : unknown embedded interface
Go language slice vs array panic: runtime error: index out of range problem solving
[leetcode] dynamic programming solution partition array i[red fox]
Test automatique de Test logiciel - test d'interface de l'introduction à la maîtrise, apprendre un peu chaque jour
Go from introduction to practice -- shared memory concurrency mechanism (notes)
\W and [a-za-z0-9_], \Are D and [0-9] equivalent?
Software test automation test -- interface test from entry to proficiency, learn a little every day
[LeetCode]515. Find the maximum value in each tree row
[LeetCode]161. Edit distance of 1
Yolov6: the fast and accurate target detection framework is open source
Ellipsis after SQLite3 statement Solutions for
年薪50W+的测试大鸟都在用这个:Jmeter 脚本开发之——扩展函数
Gbase 8A OLAP analysis function cume_ Example of dist
CDH集群之YARN性能调优
QT large file generation MD5 check code
[sword offer ii] sword finger offer II 029 Sorted circular linked list
渗透学习-靶场篇-pikachu靶场详细攻略(持续更新中-目前只更新sql注入部分)
从学生到工程师的蜕变之路