当前位置:网站首页>My colleague didn't understand selenium for half a month, so I figured it out for him in half an hour! Easily showed a wave of operations of climbing Taobao [easy to understand]
My colleague didn't understand selenium for half a month, so I figured it out for him in half an hour! Easily showed a wave of operations of climbing Taobao [easy to understand]
2022-07-05 13:07:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm your friend, Quan Jun .
Because of work needs , Colleagues are just beginning to learn python, Acquire selenium This tool hasn't been understood for half a month , Because this made him bald for half a month , Finally, find me and give him an answer .
So I explained it to him with an example of Taobao crawler , He figured it out in less than an hour . Crawler projects that beginners can understand .
We need to understand some concepts before reptiles start , This reptile will use selenium.
What is? selenium?
selenium It's a web automation testing tool , Can operate the browser automatically . If you need to operate which browser, you need to install the corresponding driver, For example, you need to pass selenium operation chrome, That has to be installed chromedriver, And the version is the same as chrome bring into correspondence with .
When you're done , install selenium:
pip install selenium -i https://pypi.tuna.tsinghua.edu.cn/simpleOne 、 The import module
First, we import the module
from selenium import webdriverWe will use other modules later , I'll let it all out first :
Two 、 Browser initialization
Then there is the initialization of the browser
browser = webdriver.Chrome()You can use many browsers ,android、blackberry、ie wait . Want to use another browser , Just download the corresponding browser driver .
Because I only installed the driver of Google browser , So it uses chrome Google , Drivers can be downloaded by themselves .
chrome Google browser corresponding to driver:
http://npm.taobao.org/mirrors/chromedriver/
3、 ... and 、 Login acquisition page
The first thing to solve is the login problem , When logging in, do not directly enter the account to log in , Because Taobao's anti climbing is particularly serious , If it detects that you are a reptile , You are not allowed to log in , Taobao's measures for logging in are very strict .
So I used another login method , Alipay scan code login , Request to Alipay scan code login page URL .
def loginTB():
browser.get(
'https://auth.alipay.com/login/index.htm?loginScene=7&goto=https%3A%2F%2Fauth.alipay.com%2Flogin%2Ftaobao_trust_login.htm%3Ftarget%3Dhttps%253A%252F%252Flogin.taobao.com%252Fmember%252Falipay_sign_dispatcher.jhtml%253Ftg%253Dhttps%25253A%25252F%25252Fwww.taobao.com%25252F¶ms=VFBMX3JlZGlyZWN0X3VybD1odHRwcyUzQSUyRiUyRnd3dy50YW9iYW8uY29tJTJG')Jump to Alipay scan code login interface .
I set a waiting time here ,180 Seconds later, the search box appears , In fact, I won't wait 180 second , Is a display waiting , As long as the element appears , You won't be waiting .
Then find the search box and enter the keyword search .
# Set display wait Wait for the search box to appear
wait = WebDriverWait(browser, 180)
wait.until(EC.presence_of_element_located((By.ID, 'q')))
# Find the search box , Enter the search keyword and click search
text_input = browser.find_element_by_id('q')
text_input.send_keys(' food ')
btn = browser.find_element_by_xpath('//*[@id="J_TSearchForm"]/div[1]/button')
btn.click()Four 、 Parsing data
After getting the page , Then analyze the data , Climb the required product data to , It's used here lxml Parsing library ,XPath Select child nodes for direct resolution .
5、 ... and 、 Crawl the page
After searching in the search box, the required product page details will appear , But not just a page , Is to constantly crawl to the next page to get more than one page of product information . Here's a dead cycle , It's gone all the way to the product page
def loop_get_data():
page_index = 1
while True:
print("=================== We're grabbing number one {} page ===================".format(page_index))
print(" Current page URL:" + browser.current_url)
# Parsing data
parse_html(browser.page_source)
# Set display wait Wait for the next button
wait = WebDriverWait(browser, 60)
wait.until(EC.presence_of_element_located((By.XPATH, '//a[@class="J_Ajax num icon-tag"]')))
time.sleep(1)
try:
# Through action chain , Scroll to the button element on the next page
write = browser.find_element_by_xpath('//li[@class="item next"]')
ActionChains(browser).move_to_element(write).perform()
except NoSuchElementException as e:
print(" Crawling over , Next page data does not exist !")
print(e)
sys.exit(0)
time.sleep(0.2)
# Click next
a_href = browser.find_element_by_xpath('//li[@class="item next"]')
a_href.click()
page_index += 16、 ... and 、 Crawler complete
The last is the call loginTB(), loop_get_data() These two were written before ,def loop_get_data() stay while Called in the loop , So no more calls are needed .
When the crawler is finished, it is saved to a shop_data.json In the document .
The results of crawling are as follows :
The web pages involved in this crawler can be replaced , Friends need source code , Comment in the comments area :taobao I can post a private letter , Or you can ask me any questions in the process of crawling .
Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/149590.html Link to the original text :https://javaforall.cn
边栏推荐
- 深度长文探讨Join运算的简化和提速
- uni-app开发语音识别app,讲究的就是简单快速。
- Halcon 模板匹配实战代码(一)
- 解决 UnicodeDecodeError: ‘gbk‘ codec can‘t decode byte 0xa2 in position 107
- Natural language processing series (I) introduction overview
- Get to know linkerd project for the first time
- Actual combat simulation │ JWT login authentication
- Sorry, we can't open xxxxx Docx, because there is a problem with the content (repackaging problem)
- Alibaba cloud SLB load balancing product basic concept and purchase process
- Shi Zhenzhen's 2021 summary and 2022 outlook | colorful eggs at the end of the article
猜你喜欢

聊聊异步编程的 7 种实现方式

函数传递参数小案例

Get to know linkerd project for the first time

What is the difference between Bi software in the domestic market

Changing JS code has no effect

OpenHarmony应用开发之Navigation组件详解

Install rhel8.2 virtual machine

RHCAS6

LB10S-ASEMI整流桥LB10S

CF:A. The Third Three Number Problem【关于我是位运算垃圾这个事情】
随机推荐
JXL notes
How can non-technical departments participate in Devops?
STM32 and motor development (from architecture diagram to documentation)
MySQL 巨坑:update 更新慎用影响行数做判断!!!
峰会回顾|保旺达-合规和安全双驱动的数据安全整体防护体系
自然语言处理系列(一)入门概述
OpenHarmony应用开发之Navigation组件详解
Taobao short video, why the worse the effect
The Research Report "2022 RPA supplier strength matrix analysis of China's banking industry" was officially launched
RHCSA8
关于 Notion-Like 工具的反思和畅想
时钟周期
Introduction aux contrôles de la page dynamique SAP ui5
155. Minimum stack
From the perspective of technology and risk control, it is analyzed that wechat Alipay restricts the remote collection of personal collection code
Default parameters of function & multiple methods of function parameters
Principle and performance analysis of lepton lossless compression
Put functions in modules
What is the difference between Bi software in the domestic market
解决 UnicodeDecodeError: ‘gbk‘ codec can‘t decode byte 0xa2 in position 107