当前位置:网站首页>Do a reptile project by yourself
Do a reptile project by yourself
2022-07-27 08:31:00 【Sizchristov】
This article tries to pass python Crawler acquisition xxx Part of the information published on the web page is saved in excel In the working paper . This project mainly applies python Reptiles , Database access ,excel File operation and other functions . The following is the first code snippet , I am pure programming Xiaobai , Here I want to share my thoughts :
from bs4 import BeautifulSoup
import requests, random, fake_useragent
import redis, re, time
import pandas as pd
# From agent IP Get one randomly in the pool IP
def get_proxy():
ip = '127.0.0.1'
password = None
r = redis.Redis(host=ip, password=password, port=6379, db=0, decode_responses=True)
len = int(r.zcard('proxies:universal'))
min_len = [0 if len < 100 else len - 100][0]
ip_list = r.zrange('proxies:universal', min_len, len, withscores=True)
random_ip = random.choices(ip_list)[0][0]
return random_ip
because xxx Strong anti crawl measures , Self built is used here IP The proxy pool implements random IP The address for , Only the top ranked ones are randomly selected 100 An address . About self building IP Chi mainly refers to the following article : build by oneself IP Agent pool . Of course, there are other self built online IP Agent pool article , Friends who know each other can share with each other . For agents who don't want to spend money IP Of , build by oneself IP Pool is a good choice , Of course, agents who spend money on high concealment and good access are not easy to be found .
# Initialize crawler
def _init(url, city):
ua = fake_useragent.UserAgent()
browsers = ua.data_browsers
random_ua = random.choices(browsers[random.choices(ua.data_randomize)[0]])[0]
headers = {
'User-Agent': random_ua,
'referer': "https://{}.anjuke.com/sale/p1/".format(city),
'cookie': ''
}
agent_ip = get_proxy()
proxy = {"http": "http://{}".format(agent_ip)}
response = requests.get(url, headers=headers, proxies=proxy)
soup = BeautifulSoup(response.text, 'html.parser')
return soup
# Verify whether the access is successful
def test_ping(soup):
head = soup.find('head')
title = head.find('title').string
if ' Access validation - Anju ' in title:
print(' The visit to fail ')
return False
else:
print(' Successful visit ')
print(title)
return TrueThe initialization part uses the random simulation browser header UserAgent, In code cookie You need to get it yourself , Limited to time, I didn't do automatic acquisition cookie Code for . Although the anti climbing measures are relatively perfect , But there is still a certain distance to realize , So in getting url After that, you need to verify whether the access is successful .
# Get all the cities
def get_city_info():
url = 'https://nc.anjuke.com/sale/'
web = _init(url, 'nc')
_ul = web.find_all('ul', {'class': 'city-list'})
ip = '127.0.0.1'
password = None
r = redis.Redis(host=ip, password=password, port=6379, db=1, decode_responses=True)
li_end = 0
pattern = re.compile('/[a-z]+')
for ul in _ul:
_a = ul.find_all('a')
for i in range(len(_a)):
a = _a[i]
city_info = re.search(pattern, a.attrs['href']).group().replace('/', '')
r.hset('city_info', '{}'.format(li_end + i), '{}'.format(city_info))
li_end += len(_a)First get xxx All cities under the web page are saved redis In the database . About redis There are many tutorials on data operation online , What is used here is also very simple , I won't go into that .
# Houses to be acquired url Save in redis In the database
def get_redis(update=False, target_city=None, page=1):
ip = '127.0.0.1'
password = None
r = redis.Redis(host=ip, password=password, port=6379, db=1, decode_responses=True)
check_point = eval(r.hget('checkpoint', '0'))
cities = list(r.hgetall('city_info').values())
for city in cities[check_point:]:
if update is False and target_city is None:
time.sleep(random.randint(5, 10))
url = 'https://{0}.anjuke.com/sale/p{1}/'.format(city, page)
web = _init(url, city)
Flag = test_ping(web)
if Flag is False:
break
else:
_a = web.find_all('a', {'class': 'property-ex'})
for i in range(len(_a)):
a = _a[i]
href = a.attrs['href']
r.hset('my_city:{0}'.format(city), '{}'.format(i), '{}'.format(href))
r.hset('checkpoint', '0', '{}'.format(cities.index(city)))
elif update is True and target_city is not None:
city = target_city
time.sleep(random.randint(5, 10))
url = 'https://{0}.anjuke.com/sale/p{1}/'.format(city, page)
web = _init(url, city)
Flag = test_ping(web)
if Flag is False:
break
else:
_a = web.find_all('a', {'class': 'property-ex'})
for i in range(len(_a)):
a = _a[i]
href = a.attrs['href']
r.hset('my_city:{0}'.format(city), '{}'.format(i), '{}'.format(href))After getting all the city information , Visit each city one by one to crawl the current page Second hand house under url And keep it in redis In the database , So that... Can be called at any time . In the access code part, I use three parameters to represent whether to update , Update the target city , to update page. because xxx The data on the website will change at any time , Therefore, the saved web pages are likely to be lost if they are not visited for a long time , This requires re crawling , By the appointed target_city Parameter can realize directional update .
# checkpoint
def checkpoint(code='0000'):
ip = '127.0.0.1'
password = None
r = redis.Redis(host=ip, password=password, port=6379, db=1, decode_responses=True)
if code != '0000':
r.hset('checkpoint', '0', code)
try:
current_point = r.hget('checkpoint', '0')
return [int(current_point[0:2]), int(current_point[2:])]
except:
r.hset('checkpoint', '0', code)
current_point = r.hget('checkpoint', '0')
return [int(current_point[0:2]), int(current_point[2:])]
# Get the specific data of the house
def get_house_data():
ip = '127.0.0.1'
password = None
r = redis.Redis(host=ip, password=password, port=6379, db=1, decode_responses=True)
getPoint = checkpoint()
cityCode, listCode = getPoint[0], getPoint[1]
pattern = re.compile(r'[\d+]*(\.{1})?\d+') # Match any positive real number
# Loop update
while cityCode <= 65:
city = r.hget('city_info', '{}'.format(cityCode))
url = r.hget('my_city:{}'.format(city), '{}'.format(listCode))
web = _init(url, city)
time.sleep(random.randint(5, 10))
flag = test_ping(web)
if flag is False: break
try:
houseAvgPrice = re.search(pattern, web.find('div', {'class': 'maininfo-avgprice-price'}).string).group()
item_01 = web.find('div', {'class': 'maininfo-model-item maininfo-model-item-1'})
item_02 = web.find('div', {'class': 'maininfo-model-item maininfo-model-item-2'})
item_03 = web.find('div', {'class': 'maininfo-model-item maininfo-model-item-3'})
houseModelNum = item_01.find_all('i', {'class': 'maininfo-model-strong-num'})
houseBedRoomNum = houseModelNum[0].string
houseSaloonNum = houseModelNum[1].string
try:
houseBathroomNum = houseModelNum[2].string
except:
houseBathroomNum = '0'
houseModelPos = item_01.find('div', {'class': 'maininfo-model-weak'}).string
houseTotalArea = item_02.find('i', {'class': 'maininfo-model-strong-num'}).string
houseFitmentLv = item_02.find('div', {'class': 'maininfo-model-weak'}).string
houseToward = item_03.find('i', {'class': 'maininfo-model-strong-text'}).string
houseAge = re.search(pattern, item_03.find('div', {'class': 'maininfo-model-weak'}).string).group()
houseName = web.find('div', {'class': 'crumbs crumbs-middle'}).find_all('a')[-1].string
span = web.find_all('span', {'class': 'houseInfo-main-item-name'})[1]
houseProperty = span.string
communityAvgPrice = re.search(pattern, web.find('span', {'class': 'monthchange-money'}).string).group()
tr = web.find_all('div', {'class': 'community-info-tr'})
communityPopulation = re.search(pattern,
tr[0].find('p', {'class': 'community-info-td-value'}).string).group()
div = tr[1].find('div', {'class': 'community-info-td'})
communityPropertyPrice = re.search(pattern,
div.find('p', {'class': 'community-info-td-value'}).string).group()
div2 = tr[1].find('div', {'class': 'community-info-td community-info-right'})
communityHouse = div2.find_all('p')[-1].string
communityGreening = re.search(pattern, tr[2].find('p', {'class': 'community-info-td-value'}).string).group()
# Data entry
excelName = ' Anjuke data collection form .xlsx'
Excel = pd.read_excel(excelName, 'Sheet2')
Data = pd.DataFrame(Excel)
row = str(len(Data.index) + 1)
Data.at[row, ' House name '] = '{}:'.format(city) + houseName
Data.at[row, ' The historical average price of the community '] = communityAvgPrice
Data.at[row, ' The number of bedrooms '] = houseBedRoomNum
Data.at[row, ' Number of living rooms '] = houseSaloonNum
Data.at[row, ' The number of bathrooms '] = houseBathroomNum
Data.at[row, ' area '] = houseTotalArea
Data.at[row, ' floor '] = houseModelPos
Data.at[row, ' decorate '] = houseFitmentLv
Data.at[row, ' that '] = houseAge
Data.at[row, ' Type of property '] = houseProperty
Data.at[row, ' toward '] = houseToward
Data.at[row, ' Property fee '] = communityPropertyPrice
Data.at[row, ' Number of households in the community '] = communityPopulation
Data.at[row, ' Greening rate '] = communityGreening
Data.at[row, ' Plot ratio '] = communityHouse
Data.at[row, ' The price is '] = houseAvgPrice
Data.to_excel(excelName, index=False, sheet_name='Sheet2')
except Exception as e:
print(e)
pass
# Record the current position
listCode += 1
if listCode == 60:
cityCode += 1
listCode = 0
code = '{0}{1}'.format(['0{}'.format(cityCode) if cityCode < 10 else cityCode][0],
['0{}'.format(listCode) if listCode < 10 else listCode][0])
print(code)
checkpoint(code)This part is the core of the code , First, create a checkpoint to record the currently crawled web page information points , So that you can close or crawl again at any time without worrying about the problem of repeated crawling . I use a four digit number to represent the checkpoint here , The first two represent the city code , The last two representatives correspond to url code . In the core code fragment, a regular expression is used to match the required value , utilize pandas Library save data to excel in , Every time you save data, it will be in excel Add a new line under the last line in . To avoid the problem of missing data , Here we use try Missing values are filtered .
# clear database
def clear_data():
ip = '127.0.0.1'
password = None
r = redis.Redis(host=ip, password=password, port=6379, db=1, decode_responses=True)
cities = list(r.hgetall('city_info').values())
for city in cities:
maxLen = len(list(r.hgetall('my_city:{}'.format(city)).keys()))
for i in range(maxLen):
r.hdel('my_city:{}'.format(city), '{}'.format(i))
print(' database my_city:{} Cleared '.format(city))
print(' The database has been emptied !')Finally, after crawling all the databases url Then clear the information in the current database , So that you can crawl and record new pages again .
边栏推荐
- All in one 1319 - queue for water
- UVM Introduction Experiment 1
- Notes in "PHP Basics" PHP
- Attack and defense world MFW
- Software test interview questions (key points)
- Minio 安装与使用
- [target detection] yolov6 theoretical interpretation + practical test visdrone data set
- Shenzhi Kalan Temple
- Realization of backstage brand management function
- openGauss之TryMe初体验
猜你喜欢

Bandwidth and currency

Breadth first search

All in one 1329 cells (breadth first search)

Day5 - Flame restful request response and Sqlalchemy Foundation

Solve the problem of slow batch insertion of MySQL JDBC data

Design and development of GUI programming for fixed-point one click query

Vcenter7.0 managing esxi7.0 hosts

阿里云国际版回执消息简介与配置流程

Notes in "PHP Basics" PHP
![ROS2安装时出现Connection failed [IP: 91.189.91.39 80]](/img/7f/92b7d44cddc03c58364d8d3f19198a.png)
ROS2安装时出现Connection failed [IP: 91.189.91.39 80]
随机推荐
[uni app advanced practice] take you hand-in-hand to learn the development of a purely practical complex project 1/100
Day3 -- flag state holding, exception handling and request hook
Oppo self-developed large-scale knowledge map and its application in digital intelligence engineering
Software test interview questions (key points)
arguments
Use of "PHP Basics" delimiters
The third letter to the little sister of the test | Oracle stored procedure knowledge sharing and test instructions
【uni-app高级实战】手把手带你学习一个纯实战复杂项目的开发1/100
XxE & XML vulnerability
[netding cup 2020 Qinglong group]areuserialz (buuctf)
如何卸载--奇安信安全终端管理系统
redis配置文件下载
My senior
Record a PG master-slave setup and data synchronization performance test process
Functions and arrow functions
1176 questions of Olympiad in informatics -- who ranked K in the exam
[geek challenge 2019] finalsql 1
CMD command and NPM command
Supervisor 安装与使用
Day5 - Flame restful request response and Sqlalchemy Foundation