当前位置:网站首页>My colleague didn't understand selenium for half a month, so I figured it out for him in half an hour! Easily showed a wave of operations of climbing Taobao [easy to understand]
My colleague didn't understand selenium for half a month, so I figured it out for him in half an hour! Easily showed a wave of operations of climbing Taobao [easy to understand]
2022-07-05 13:07:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm your friend, Quan Jun .
Because of work needs , Colleagues are just beginning to learn python, Acquire selenium This tool hasn't been understood for half a month , Because this made him bald for half a month , Finally, find me and give him an answer .
So I explained it to him with an example of Taobao crawler , He figured it out in less than an hour . Crawler projects that beginners can understand .
We need to understand some concepts before reptiles start , This reptile will use selenium.
What is? selenium?
selenium It's a web automation testing tool , Can operate the browser automatically . If you need to operate which browser, you need to install the corresponding driver, For example, you need to pass selenium operation chrome, That has to be installed chromedriver, And the version is the same as chrome bring into correspondence with .
When you're done , install selenium:
pip install selenium -i https://pypi.tuna.tsinghua.edu.cn/simpleOne 、 The import module
First, we import the module
from selenium import webdriverWe will use other modules later , I'll let it all out first :
Two 、 Browser initialization
Then there is the initialization of the browser
browser = webdriver.Chrome()You can use many browsers ,android、blackberry、ie wait . Want to use another browser , Just download the corresponding browser driver .
Because I only installed the driver of Google browser , So it uses chrome Google , Drivers can be downloaded by themselves .
chrome Google browser corresponding to driver:
http://npm.taobao.org/mirrors/chromedriver/
3、 ... and 、 Login acquisition page
The first thing to solve is the login problem , When logging in, do not directly enter the account to log in , Because Taobao's anti climbing is particularly serious , If it detects that you are a reptile , You are not allowed to log in , Taobao's measures for logging in are very strict .
So I used another login method , Alipay scan code login , Request to Alipay scan code login page URL .
def loginTB():
browser.get(
'https://auth.alipay.com/login/index.htm?loginScene=7&goto=https%3A%2F%2Fauth.alipay.com%2Flogin%2Ftaobao_trust_login.htm%3Ftarget%3Dhttps%253A%252F%252Flogin.taobao.com%252Fmember%252Falipay_sign_dispatcher.jhtml%253Ftg%253Dhttps%25253A%25252F%25252Fwww.taobao.com%25252F¶ms=VFBMX3JlZGlyZWN0X3VybD1odHRwcyUzQSUyRiUyRnd3dy50YW9iYW8uY29tJTJG')Jump to Alipay scan code login interface .
I set a waiting time here ,180 Seconds later, the search box appears , In fact, I won't wait 180 second , Is a display waiting , As long as the element appears , You won't be waiting .
Then find the search box and enter the keyword search .
# Set display wait Wait for the search box to appear
wait = WebDriverWait(browser, 180)
wait.until(EC.presence_of_element_located((By.ID, 'q')))
# Find the search box , Enter the search keyword and click search
text_input = browser.find_element_by_id('q')
text_input.send_keys(' food ')
btn = browser.find_element_by_xpath('//*[@id="J_TSearchForm"]/div[1]/button')
btn.click()Four 、 Parsing data
After getting the page , Then analyze the data , Climb the required product data to , It's used here lxml Parsing library ,XPath Select child nodes for direct resolution .
5、 ... and 、 Crawl the page
After searching in the search box, the required product page details will appear , But not just a page , Is to constantly crawl to the next page to get more than one page of product information . Here's a dead cycle , It's gone all the way to the product page
def loop_get_data():
page_index = 1
while True:
print("=================== We're grabbing number one {} page ===================".format(page_index))
print(" Current page URL:" + browser.current_url)
# Parsing data
parse_html(browser.page_source)
# Set display wait Wait for the next button
wait = WebDriverWait(browser, 60)
wait.until(EC.presence_of_element_located((By.XPATH, '//a[@class="J_Ajax num icon-tag"]')))
time.sleep(1)
try:
# Through action chain , Scroll to the button element on the next page
write = browser.find_element_by_xpath('//li[@class="item next"]')
ActionChains(browser).move_to_element(write).perform()
except NoSuchElementException as e:
print(" Crawling over , Next page data does not exist !")
print(e)
sys.exit(0)
time.sleep(0.2)
# Click next
a_href = browser.find_element_by_xpath('//li[@class="item next"]')
a_href.click()
page_index += 16、 ... and 、 Crawler complete
The last is the call loginTB(), loop_get_data() These two were written before ,def loop_get_data() stay while Called in the loop , So no more calls are needed .
When the crawler is finished, it is saved to a shop_data.json In the document .
The results of crawling are as follows :
The web pages involved in this crawler can be replaced , Friends need source code , Comment in the comments area :taobao I can post a private letter , Or you can ask me any questions in the process of crawling .
Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/149590.html Link to the original text :https://javaforall.cn
边栏推荐
- How do e-commerce sellers refund in batches?
- leetcode:221. 最大正方形【dp状态转移的精髓】
- RHCSA8
- Install rhel8.2 virtual machine
- 函数传递参数小案例
- 石臻臻的2021总结和2022展望 | 文末彩蛋
- 自然语言处理从小白到精通(四):用机器学习做中文邮件内容分类
- Didi open source Delta: AI developers can easily train natural language models
- Halcon template matching actual code (I)
- JXL notes
猜你喜欢

Put functions in modules

【云原生】Nacos中的事件发布与订阅--观察者模式

PyCharm安装第三方库图解

量价虽降,商业银行结构性存款为何受上市公司所偏爱?

Pandora IOT development board learning (HAL Library) - Experiment 7 window watchdog experiment (learning notes)

Reverse Polish notation

The Research Report "2022 RPA supplier strength matrix analysis of China's banking industry" was officially launched

深度长文探讨Join运算的简化和提速

MSTP and eth trunk

Lb10s-asemi rectifier bridge lb10s
随机推荐
Yyds dry inventory JS intercept file suffix
#从源头解决# 自定义头文件在VS上出现“无法打开源文件“XX.h“的问题
初次使用腾讯云,解决只能使用webshell连接,不能使用ssh连接。
解决uni-app配置页面、tabBar无效问题
uni-app开发语音识别app,讲究的就是简单快速。
Principle and performance analysis of lepton lossless compression
从外卖点单浅谈伪需求
《2022年中國銀行業RPA供應商實力矩陣分析》研究報告正式啟動
DataPipeline双料入选中国信通院2022数智化图谱、数据库发展报告
《2022年中国银行业RPA供应商实力矩阵分析》研究报告正式启动
Apicloud studio3 API management and debugging tutorial
SAP SEGW 事物码里的导航属性(Navigation Property) 和 EntitySet 使用方法
Hundred days to complete the open source task of the domestic database opengauss -- openguass minimalist version 3.0.0 installation tutorial
Asemi rectifier bridge hd06 parameters, hd06 pictures, hd06 applications
Reverse Polish notation
Write macro with word
AVC1与H264的区别
Cf:a. the third three number problem
Lb10s-asemi rectifier bridge lb10s
Insmod prompt invalid module format