当前位置：网站首页>My colleague didn't understand selenium for half a month, so I figured it out for him in half an hour! Easily showed a wave of operations of climbing Taobao [easy to understand]

My colleague didn't understand selenium for half a month, so I figured it out for him in half an hour! Easily showed a wave of operations of climbing Taobao [easy to understand]

2022-07-05 13:07:00 【Full stack programmer webmaster】

Hello everyone , I meet you again , I'm your friend, Quan Jun .

Because of work needs , Colleagues are just beginning to learn python, Acquire selenium This tool hasn't been understood for half a month , Because this made him bald for half a month , Finally, find me and give him an answer .

So I explained it to him with an example of Taobao crawler , He figured it out in less than an hour . Crawler projects that beginners can understand .

We need to understand some concepts before reptiles start , This reptile will use selenium.

What is? selenium？

selenium It's a web automation testing tool , Can operate the browser automatically . If you need to operate which browser, you need to install the corresponding driver, For example, you need to pass selenium operation chrome, That has to be installed chromedriver, And the version is the same as chrome bring into correspondence with .

When you're done , install selenium：

pip install selenium -i https://pypi.tuna.tsinghua.edu.cn/simple

One 、 The import module

First, we import the module

from selenium import webdriver

We will use other modules later , I'll let it all out first ：

Two 、 Browser initialization

Then there is the initialization of the browser

browser = webdriver.Chrome()

You can use many browsers ,android、blackberry、ie wait . Want to use another browser , Just download the corresponding browser driver .

Because I only installed the driver of Google browser , So it uses chrome Google , Drivers can be downloaded by themselves .

chrome Google browser corresponding to driver：

http://npm.taobao.org/mirrors/chromedriver/

3、 ... and 、 Login acquisition page

The first thing to solve is the login problem , When logging in, do not directly enter the account to log in , Because Taobao's anti climbing is particularly serious , If it detects that you are a reptile , You are not allowed to log in , Taobao's measures for logging in are very strict .

So I used another login method , Alipay scan code login , Request to Alipay scan code login page URL .

def loginTB():
    browser.get(
        'https://auth.alipay.com/login/index.htm?loginScene=7&goto=https%3A%2F%2Fauth.alipay.com%2Flogin%2Ftaobao_trust_login.htm%3Ftarget%3Dhttps%253A%252F%252Flogin.taobao.com%252Fmember%252Falipay_sign_dispatcher.jhtml%253Ftg%253Dhttps%25253A%25252F%25252Fwww.taobao.com%25252F&params=VFBMX3JlZGlyZWN0X3VybD1odHRwcyUzQSUyRiUyRnd3dy50YW9iYW8uY29tJTJG')

Jump to Alipay scan code login interface .

I set a waiting time here ,180 Seconds later, the search box appears , In fact, I won't wait 180 second , Is a display waiting , As long as the element appears , You won't be waiting .

Then find the search box and enter the keyword search .

   #  Set display wait   Wait for the search box to appear 
    wait = WebDriverWait(browser, 180)
    wait.until(EC.presence_of_element_located((By.ID, 'q')))

    #  Find the search box , Enter the search keyword and click search 
    text_input = browser.find_element_by_id('q')
    text_input.send_keys(' food ')
    btn = browser.find_element_by_xpath('//*[@id="J_TSearchForm"]/div[1]/button')
    btn.click()

Four 、 Parsing data

After getting the page , Then analyze the data , Climb the required product data to , It's used here lxml Parsing library ,XPath Select child nodes for direct resolution .

5、 ... and 、 Crawl the page

After searching in the search box, the required product page details will appear , But not just a page , Is to constantly crawl to the next page to get more than one page of product information . Here's a dead cycle , It's gone all the way to the product page

def loop_get_data():
    page_index = 1
    while True:
        print("=================== We're grabbing number one {} page ===================".format(page_index))
        print(" Current page URL：" + browser.current_url)
        #  Parsing data 
        parse_html(browser.page_source)

        #  Set display wait   Wait for the next button 
        wait = WebDriverWait(browser, 60)
        wait.until(EC.presence_of_element_located((By.XPATH, '//a[@class="J_Ajax num icon-tag"]')))
        time.sleep(1)
        try:
            #  Through action chain , Scroll to the button element on the next page 
            write = browser.find_element_by_xpath('//li[@class="item next"]')
            ActionChains(browser).move_to_element(write).perform()
        except NoSuchElementException as e:
            print(" Crawling over , Next page data does not exist ！")
            print(e)
            sys.exit(0)
        time.sleep(0.2)
        #  Click next 
        a_href = browser.find_element_by_xpath('//li[@class="item next"]')
        a_href.click()
        page_index += 1

6、 ... and 、 Crawler complete

The last is the call loginTB(), loop_get_data() These two were written before ,def loop_get_data() stay while Called in the loop , So no more calls are needed .

When the crawler is finished, it is saved to a shop_data.json In the document .

The results of crawling are as follows ：

The web pages involved in this crawler can be replaced , Friends need source code , Comment in the comments area ：taobao I can post a private letter , Or you can ask me any questions in the process of crawling .

Publisher ： Full stack programmer stack length , Reprint please indicate the source ：https://javaforall.cn/149590.html Link to the original text ：https://javaforall.cn

原网站

版权声明
本文为[Full stack programmer webmaster]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/186/202207051250310665.html