当前位置:网站首页>My colleague didn't understand selenium for half a month, so I figured it out for him in half an hour! Easily showed a wave of operations of climbing Taobao [easy to understand]
My colleague didn't understand selenium for half a month, so I figured it out for him in half an hour! Easily showed a wave of operations of climbing Taobao [easy to understand]
2022-07-05 13:07:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm your friend, Quan Jun .
Because of work needs , Colleagues are just beginning to learn python, Acquire selenium This tool hasn't been understood for half a month , Because this made him bald for half a month , Finally, find me and give him an answer .
So I explained it to him with an example of Taobao crawler , He figured it out in less than an hour . Crawler projects that beginners can understand .
We need to understand some concepts before reptiles start , This reptile will use selenium.
What is? selenium?
selenium It's a web automation testing tool , Can operate the browser automatically . If you need to operate which browser, you need to install the corresponding driver, For example, you need to pass selenium operation chrome, That has to be installed chromedriver, And the version is the same as chrome bring into correspondence with .
When you're done , install selenium:
pip install selenium -i https://pypi.tuna.tsinghua.edu.cn/simple
One 、 The import module
First, we import the module
from selenium import webdriver
We will use other modules later , I'll let it all out first :
Two 、 Browser initialization
Then there is the initialization of the browser
browser = webdriver.Chrome()
You can use many browsers ,android、blackberry、ie wait . Want to use another browser , Just download the corresponding browser driver .
Because I only installed the driver of Google browser , So it uses chrome Google , Drivers can be downloaded by themselves .
chrome Google browser corresponding to driver:
http://npm.taobao.org/mirrors/chromedriver/
3、 ... and 、 Login acquisition page
The first thing to solve is the login problem , When logging in, do not directly enter the account to log in , Because Taobao's anti climbing is particularly serious , If it detects that you are a reptile , You are not allowed to log in , Taobao's measures for logging in are very strict .
So I used another login method , Alipay scan code login , Request to Alipay scan code login page URL .
def loginTB():
browser.get(
'https://auth.alipay.com/login/index.htm?loginScene=7&goto=https%3A%2F%2Fauth.alipay.com%2Flogin%2Ftaobao_trust_login.htm%3Ftarget%3Dhttps%253A%252F%252Flogin.taobao.com%252Fmember%252Falipay_sign_dispatcher.jhtml%253Ftg%253Dhttps%25253A%25252F%25252Fwww.taobao.com%25252F¶ms=VFBMX3JlZGlyZWN0X3VybD1odHRwcyUzQSUyRiUyRnd3dy50YW9iYW8uY29tJTJG')
Jump to Alipay scan code login interface .
I set a waiting time here ,180 Seconds later, the search box appears , In fact, I won't wait 180 second , Is a display waiting , As long as the element appears , You won't be waiting .
Then find the search box and enter the keyword search .
# Set display wait Wait for the search box to appear
wait = WebDriverWait(browser, 180)
wait.until(EC.presence_of_element_located((By.ID, 'q')))
# Find the search box , Enter the search keyword and click search
text_input = browser.find_element_by_id('q')
text_input.send_keys(' food ')
btn = browser.find_element_by_xpath('//*[@id="J_TSearchForm"]/div[1]/button')
btn.click()
Four 、 Parsing data
After getting the page , Then analyze the data , Climb the required product data to , It's used here lxml Parsing library ,XPath Select child nodes for direct resolution .
5、 ... and 、 Crawl the page
After searching in the search box, the required product page details will appear , But not just a page , Is to constantly crawl to the next page to get more than one page of product information . Here's a dead cycle , It's gone all the way to the product page
def loop_get_data():
page_index = 1
while True:
print("=================== We're grabbing number one {} page ===================".format(page_index))
print(" Current page URL:" + browser.current_url)
# Parsing data
parse_html(browser.page_source)
# Set display wait Wait for the next button
wait = WebDriverWait(browser, 60)
wait.until(EC.presence_of_element_located((By.XPATH, '//a[@class="J_Ajax num icon-tag"]')))
time.sleep(1)
try:
# Through action chain , Scroll to the button element on the next page
write = browser.find_element_by_xpath('//li[@class="item next"]')
ActionChains(browser).move_to_element(write).perform()
except NoSuchElementException as e:
print(" Crawling over , Next page data does not exist !")
print(e)
sys.exit(0)
time.sleep(0.2)
# Click next
a_href = browser.find_element_by_xpath('//li[@class="item next"]')
a_href.click()
page_index += 1
6、 ... and 、 Crawler complete
The last is the call loginTB(), loop_get_data() These two were written before ,def loop_get_data() stay while Called in the loop , So no more calls are needed .
When the crawler is finished, it is saved to a shop_data.json In the document .
The results of crawling are as follows :
The web pages involved in this crawler can be replaced , Friends need source code , Comment in the comments area :taobao I can post a private letter , Or you can ask me any questions in the process of crawling .
Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/149590.html Link to the original text :https://javaforall.cn
边栏推荐
- OpenHarmony应用开发之Navigation组件详解
- 潘多拉 IOT 开发板学习(HAL 库)—— 实验7 窗口看门狗实验(学习笔记)
- #yyds干货盘点# 解决名企真题:搬圆桌
- 155. Minimum stack
- How to realize batch sending when fishing
- 国内市场上的BI软件,到底有啥区别
- Halcon template matching actual code (I)
- Principle and configuration of RSTP protocol
- Natural language processing series (I) introduction overview
- Vonedao solves the problem of organizational development effectiveness
猜你喜欢
Actual combat simulation │ JWT login authentication
国际自动机工程师学会(SAE International)战略投资几何伙伴
Didi open source Delta: AI developers can easily train natural language models
深度长文探讨Join运算的简化和提速
【服务器数据恢复】某品牌服务器存储raid5数据恢复案例
函数传递参数小案例
碎片化知识管理工具Memos
国内市场上的BI软件,到底有啥区别
PyCharm安装第三方库图解
Alipay transfer system background or API interface to avoid pitfalls
随机推荐
MySQL 巨坑:update 更新慎用影响行数做判断!!!
Reverse Polish notation
事务的基本特性和隔离级别
The Research Report "2022 RPA supplier strength matrix analysis of China's banking industry" was officially launched
ABAP editor in SAP segw transaction code
[Nacos cloud native] the first step of reading the source code is to start Nacos locally
Taobao short video, why the worse the effect
Rocky基础知识1
Asemi rectifier bridge hd06 parameters, hd06 pictures, hd06 applications
简单上手的页面请求和解析案例
深度长文探讨Join运算的简化和提速
RHCSA8
Rocky basics 1
Small case of function transfer parameters
将函数放在模块中
使用Dom4j解析XML
I'm doing open source in Didi
阿里云SLB负载均衡产品基本概念与购买流程
Realize the addition of all numbers between 1 and number
MySQL giant pit: update updates should be judged with caution by affecting the number of rows!!!