当前位置：网站首页>Do a reptile project by yourself

Do a reptile project by yourself

2022-07-27 08:31:00 【Sizchristov】

This article tries to pass python Crawler acquisition xxx Part of the information published on the web page is saved in excel In the working paper . This project mainly applies python Reptiles , Database access ,excel File operation and other functions . The following is the first code snippet , I am pure programming Xiaobai , Here I want to share my thoughts ：

from bs4 import BeautifulSoup
import requests, random, fake_useragent
import redis, re, time
import pandas as pd


#  From agent IP Get one randomly in the pool IP
def get_proxy():
    ip = '127.0.0.1'
    password = None
    r = redis.Redis(host=ip, password=password, port=6379, db=0, decode_responses=True)
    len = int(r.zcard('proxies:universal'))
    min_len = [0 if len < 100 else len - 100][0]
    ip_list = r.zrange('proxies:universal', min_len, len, withscores=True)
    random_ip = random.choices(ip_list)[0][0]
    return random_ip

because xxx Strong anti crawl measures , Self built is used here IP The proxy pool implements random IP The address for , Only the top ranked ones are randomly selected 100 An address . About self building IP Chi mainly refers to the following article ： build by oneself IP Agent pool . Of course, there are other self built online IP Agent pool article , Friends who know each other can share with each other . For agents who don't want to spend money IP Of , build by oneself IP Pool is a good choice , Of course, agents who spend money on high concealment and good access are not easy to be found .

#  Initialize crawler 
def _init(url, city):
    ua = fake_useragent.UserAgent()
    browsers = ua.data_browsers
    random_ua = random.choices(browsers[random.choices(ua.data_randomize)[0]])[0]
    headers = {
        'User-Agent': random_ua,
        'referer': "https://{}.anjuke.com/sale/p1/".format(city),
        'cookie': ''
    }
    agent_ip = get_proxy()
    proxy = {"http": "http://{}".format(agent_ip)}
    response = requests.get(url, headers=headers, proxies=proxy)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup


#  Verify whether the access is successful 
def test_ping(soup):
    head = soup.find('head')
    title = head.find('title').string
    if ' Access validation - Anju ' in title:
        print(' The visit to fail ')
        return False
    else:
        print(' Successful visit ')
        print(title)
        return True

The initialization part uses the random simulation browser header UserAgent, In code cookie You need to get it yourself , Limited to time, I didn't do automatic acquisition cookie Code for . Although the anti climbing measures are relatively perfect , But there is still a certain distance to realize , So in getting url After that, you need to verify whether the access is successful .

#  Get all the cities 
def get_city_info():
    url = 'https://nc.anjuke.com/sale/'
    web = _init(url, 'nc')
    _ul = web.find_all('ul', {'class': 'city-list'})
    ip = '127.0.0.1'
    password = None
    r = redis.Redis(host=ip, password=password, port=6379, db=1, decode_responses=True)
    li_end = 0
    pattern = re.compile('/[a-z]+')
    for ul in _ul:
        _a = ul.find_all('a')
        for i in range(len(_a)):
            a = _a[i]
            city_info = re.search(pattern, a.attrs['href']).group().replace('/', '')
            r.hset('city_info', '{}'.format(li_end + i), '{}'.format(city_info))
        li_end += len(_a)

First get xxx All cities under the web page are saved redis In the database . About redis There are many tutorials on data operation online , What is used here is also very simple , I won't go into that .

#  Houses to be acquired url Save in redis In the database 
def get_redis(update=False, target_city=None, page=1):
    ip = '127.0.0.1'
    password = None
    r = redis.Redis(host=ip, password=password, port=6379, db=1, decode_responses=True)
    check_point = eval(r.hget('checkpoint', '0'))
    cities = list(r.hgetall('city_info').values())
    for city in cities[check_point:]:
        if update is False and target_city is None:
            time.sleep(random.randint(5, 10))
            url = 'https://{0}.anjuke.com/sale/p{1}/'.format(city, page)
            web = _init(url, city)
            Flag = test_ping(web)
            if Flag is False:
                break
            else:
                _a = web.find_all('a', {'class': 'property-ex'})
                for i in range(len(_a)):
                    a = _a[i]
                    href = a.attrs['href']
                    r.hset('my_city:{0}'.format(city), '{}'.format(i), '{}'.format(href))
                r.hset('checkpoint', '0', '{}'.format(cities.index(city)))
        elif update is True and target_city is not None:
            city = target_city
            time.sleep(random.randint(5, 10))
            url = 'https://{0}.anjuke.com/sale/p{1}/'.format(city, page)
            web = _init(url, city)
            Flag = test_ping(web)
            if Flag is False:
                break
            else:
                _a = web.find_all('a', {'class': 'property-ex'})
                for i in range(len(_a)):
                    a = _a[i]
                    href = a.attrs['href']
                    r.hset('my_city:{0}'.format(city), '{}'.format(i), '{}'.format(href))

After getting all the city information , Visit each city one by one to crawl the current page Second hand house under url And keep it in redis In the database , So that... Can be called at any time . In the access code part, I use three parameters to represent whether to update , Update the target city , to update page. because xxx The data on the website will change at any time , Therefore, the saved web pages are likely to be lost if they are not visited for a long time , This requires re crawling , By the appointed target_city Parameter can realize directional update .

#  checkpoint 
def checkpoint(code='0000'):
    ip = '127.0.0.1'
    password = None
    r = redis.Redis(host=ip, password=password, port=6379, db=1, decode_responses=True)
    if code != '0000':
        r.hset('checkpoint', '0', code)
    try:
        current_point = r.hget('checkpoint', '0')
        return [int(current_point[0:2]), int(current_point[2:])]
    except:
        r.hset('checkpoint', '0', code)
        current_point = r.hget('checkpoint', '0')
        return [int(current_point[0:2]), int(current_point[2:])]


#  Get the specific data of the house 
def get_house_data():
    ip = '127.0.0.1'
    password = None
    r = redis.Redis(host=ip, password=password, port=6379, db=1, decode_responses=True)
    getPoint = checkpoint()
    cityCode, listCode = getPoint[0], getPoint[1]
    pattern = re.compile(r'[\d+]*(\.{1})?\d+')  #  Match any positive real number 
    #  Loop update 
    while cityCode <= 65:
        city = r.hget('city_info', '{}'.format(cityCode))
        url = r.hget('my_city:{}'.format(city), '{}'.format(listCode))
        web = _init(url, city)
        time.sleep(random.randint(5, 10))
        flag = test_ping(web)
        if flag is False: break
        try:
            houseAvgPrice = re.search(pattern, web.find('div', {'class': 'maininfo-avgprice-price'}).string).group()
            item_01 = web.find('div', {'class': 'maininfo-model-item maininfo-model-item-1'})
            item_02 = web.find('div', {'class': 'maininfo-model-item maininfo-model-item-2'})
            item_03 = web.find('div', {'class': 'maininfo-model-item maininfo-model-item-3'})
            houseModelNum = item_01.find_all('i', {'class': 'maininfo-model-strong-num'})
            houseBedRoomNum = houseModelNum[0].string
            houseSaloonNum = houseModelNum[1].string
            try:
                houseBathroomNum = houseModelNum[2].string
            except:
                houseBathroomNum = '0'
            houseModelPos = item_01.find('div', {'class': 'maininfo-model-weak'}).string
            houseTotalArea = item_02.find('i', {'class': 'maininfo-model-strong-num'}).string
            houseFitmentLv = item_02.find('div', {'class': 'maininfo-model-weak'}).string
            houseToward = item_03.find('i', {'class': 'maininfo-model-strong-text'}).string
            houseAge = re.search(pattern, item_03.find('div', {'class': 'maininfo-model-weak'}).string).group()
            houseName = web.find('div', {'class': 'crumbs crumbs-middle'}).find_all('a')[-1].string
            span = web.find_all('span', {'class': 'houseInfo-main-item-name'})[1]
            houseProperty = span.string
            communityAvgPrice = re.search(pattern, web.find('span', {'class': 'monthchange-money'}).string).group()
            tr = web.find_all('div', {'class': 'community-info-tr'})
            communityPopulation = re.search(pattern,
                                            tr[0].find('p', {'class': 'community-info-td-value'}).string).group()
            div = tr[1].find('div', {'class': 'community-info-td'})
            communityPropertyPrice = re.search(pattern,
                                               div.find('p', {'class': 'community-info-td-value'}).string).group()
            div2 = tr[1].find('div', {'class': 'community-info-td community-info-right'})
            communityHouse = div2.find_all('p')[-1].string
            communityGreening = re.search(pattern, tr[2].find('p', {'class': 'community-info-td-value'}).string).group()

            #  Data entry 
            excelName = ' Anjuke data collection form .xlsx'
            Excel = pd.read_excel(excelName, 'Sheet2')
            Data = pd.DataFrame(Excel)
            row = str(len(Data.index) + 1)
            Data.at[row, ' House name '] = '{}:'.format(city) + houseName
            Data.at[row, ' The historical average price of the community '] = communityAvgPrice
            Data.at[row, ' The number of bedrooms '] = houseBedRoomNum
            Data.at[row, ' Number of living rooms '] = houseSaloonNum
            Data.at[row, ' The number of bathrooms '] = houseBathroomNum
            Data.at[row, ' area '] = houseTotalArea
            Data.at[row, ' floor '] = houseModelPos
            Data.at[row, ' decorate '] = houseFitmentLv
            Data.at[row, ' that '] = houseAge
            Data.at[row, ' Type of property '] = houseProperty
            Data.at[row, ' toward '] = houseToward
            Data.at[row, ' Property fee '] = communityPropertyPrice
            Data.at[row, ' Number of households in the community '] = communityPopulation
            Data.at[row, ' Greening rate '] = communityGreening
            Data.at[row, ' Plot ratio '] = communityHouse
            Data.at[row, ' The price is '] = houseAvgPrice
            Data.to_excel(excelName, index=False, sheet_name='Sheet2')
        except Exception as e:
            print(e)
            pass

        #  Record the current position 
        listCode += 1
        if listCode == 60:
            cityCode += 1
            listCode = 0
        code = '{0}{1}'.format(['0{}'.format(cityCode) if cityCode < 10 else cityCode][0],
                               ['0{}'.format(listCode) if listCode < 10 else listCode][0])
        print(code)
        checkpoint(code)

This part is the core of the code , First, create a checkpoint to record the currently crawled web page information points , So that you can close or crawl again at any time without worrying about the problem of repeated crawling . I use a four digit number to represent the checkpoint here , The first two represent the city code , The last two representatives correspond to url code . In the core code fragment, a regular expression is used to match the required value , utilize pandas Library save data to excel in , Every time you save data, it will be in excel Add a new line under the last line in . To avoid the problem of missing data , Here we use try Missing values are filtered .

#  clear database 
def clear_data():
    ip = '127.0.0.1'
    password = None
    r = redis.Redis(host=ip, password=password, port=6379, db=1, decode_responses=True)
    cities = list(r.hgetall('city_info').values())
    for city in cities:
        maxLen = len(list(r.hgetall('my_city:{}'.format(city)).keys()))
        for i in range(maxLen):
            r.hdel('my_city:{}'.format(city), '{}'.format(i))
        print(' database my_city:{} Cleared '.format(city))
    print(' The database has been emptied ！')

Finally, after crawling all the databases url Then clear the information in the current database , So that you can crawl and record new pages again .

原网站

版权声明
本文为[Sizchristov]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/208/202207270501163014.html

当前位置：网站首页>Do a reptile project by yourself

Do a reptile project by yourself

边栏推荐

猜你喜欢

随机推荐