当前位置:网站首页>「经验」爬虫在工作中的实战应用『实现篇』
「经验」爬虫在工作中的实战应用『实现篇』
2022-06-30 17:46:00 【小火龙说数据】
预计阅读时间:10min
阅读建议:本篇为代码实现,建议收藏,业余时间慢慢研究。
解决痛点:很多同学对于爬虫会有一些疑惑,小火龙希望用简单的语言向你说明爬虫的基本原理,以及如何通过一段简单的代码实现,帮助你尽快上手,文章聚焦于爬虫初学者。
00
序言
上篇文章中,小火龙和大家分享了爬虫的基础原理,回看可戳『理论篇』。本篇,小火龙手把手带你码一个简单的爬虫,附上代码,感兴趣的同学可以自己试一试。
01
爬虫背景
近期,房贷五年期LPR下降,使得购房的贷款成本有所下降,因此想看看北京不同区域,现阶段(2022年5月)vs 疫情前(2019年5月)价格差异。数据来源于「58同城」。
02
爬虫代码
上篇文章『理论篇』中,分享了爬虫的一般流程,这里就不再冗余,直接上代码。
简单介绍一下代码结构,输入文件有四个,输出文件有两个(代码可以写在一个文件中,但介于功能解耦,因此拆分为多个,可重点关注「爬虫核心文件」)。
【输入】
运行主文件(run.py):吊起核心程序的入口文件。
# encoding:utf-8
from configparser import ConfigParser, ExtendedInterpolation
import logging
import sys
import os
FILE_PATH = os.path.dirname(os.path.realpath(__file__))
CONF_PATH = FILE_PATH + '/conf/conf.ini'
# 引入当前文件所在目录下的文件
sys.path.append(FILE_PATH + '/lib')
sys.path.append(FILE_PATH + '/spider')
from InfoSpider import InfoSpider
# 创建log日志函数
def init_logger():
logger = logging.getLogger()
logger.setLevel(level = logging.INFO)
handler = logging.FileHandler("log/log.txt")
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
return logger
# 创建conf文件函数
def init_conf(config_path):
cf = ConfigParser(interpolation=ExtendedInterpolation())
cf.read(config_path, encoding='utf-8')
return cf
if __name__ == '__main__':
logger = init_logger()
conf = init_conf(CONF_PATH)
# 爬取58数据
info_spider = InfoSpider(conf)
info_spider.getHouseInfo()爬虫核心文件(InfoSpider.py):实现爬虫的核心功能。
#encoding:utf-8
import logging
import sys
import os
import urllib
import urllib.request, urllib.parse
import re
import time
import random
from GeneralObj import GeneralObj
FILE_PATH = os.path.dirname(os.path.realpath(__file__))
FILE_PATH_MAIN = os.path.join(os.path.dirname(os.path.realpath(__file__)), "../")
#引入当前文件所在目录下的文件
sys.path.append(FILE_PATH_MAIN + '/lib')
logger = logging.getLogger(__name__)
obj = GeneralObj()
class InfoSpider(object):
def __init__(self, conf):
self.conf = conf
#随机获取ua、ip
ua_conf = self.conf.get('spider_58', 'ua')
self.ua_list = ua_conf.strip().split('\\001')
ip_conf = self.conf.get('spider_58', 'ip')
self.ip_list = ip_conf.strip().split(',')
def getHouseInfo(self):
begin_time = time.time()
#获取细分区域网址
self.getHouseInfo_area()
#获取细分区域商圈网址
self.getHouseInfo_page()
#获取细分区域商圈数据
self.getHouseInfo_detail()
end_time = time.time()
cost_time = (end_time - begin_time)//60
logger.info('cost time(min) : %i' % cost_time)
print('cost time(min) : %i' % cost_time)
def getHouseInfo_area(self):
logger.info('begin1 getHouseInfo_area url')
print('begin1 getHouseInfo_area url')
main_url = 'https://www.58.com/fangjiawang/shi-2022-100/'
headers = {"User-Agent": random.sample(self.ua_list, 1)[0]}
req = urllib.request.Request(main_url, headers=headers)
data = urllib.request.urlopen(req).read()
data = data.decode('utf-8')
data_filter = re.findall('<ul class="sel-sec" data-v-4571decc>(.*?)</ul>', data)
data_list = re.findall('<a href="(.*?)"', data_filter[0])[1:]
obj.recordsToFile(data_list, './data/getHouseInfo_area.txt', type=1, delimiter="\t", mode="w")
logger.info('finish1 getHouseInfo_area url')
print('finish1 getHouseInfo_area url')
def getHouseInfo_page(self):
logger.info('begin2 getHouseInfo_page url')
print('begin2 getHouseInfo_page url')
area_url_list = obj.fileToRecords('./data/getHouseInfo_area.txt', 1)
area_url_num = len(area_url_list)
area_url_counts = 1
for line in area_url_list:
headers = {"User-Agent": random.sample(self.ua_list, 1)[0]}
req = urllib.request.Request(line, headers=headers)
data = urllib.request.urlopen(req).read()
data = data.decode('utf-8')
data_filter = re.findall('<ul class="sel-thi" data-v-4571decc>(.*?)</ul>', data)
data_list = re.findall('<a href="(.*?)"', data_filter[0])[1:]
obj.recordsToFile(data_list, './data/getHouseInfo_page.txt', type=1, delimiter="\t", mode="a+")
logger.info('getHouseInfo_page url progress: %i/%i' % (area_url_counts, area_url_num))
print('getHouseInfo_page url progress: %i/%i' % (area_url_counts, area_url_num))
time.sleep(2)
area_url_counts += 1
def getHouseInfo_detail(self):
logger.info('begin3 getHouseInfo_detail url')
print('begin3 getHouseInfo_detail url')
detail_url_list = obj.fileToRecords('./data/getHouseInfo_page.txt', 1)
detail_url_num = len(detail_url_list)
detail_url_counts = 1
for line in detail_url_list:
try:
address2022 = line
address2019 = line.replace('shi-2022-100', 'shi-2019-100')
headers = {"User-Agent": random.sample(self.ua_list, 1)[0]}
req2022 = urllib.request.Request(address2022, headers=headers)
req2019 = urllib.request.Request(address2019, headers=headers)
data2022 = urllib.request.urlopen(req2022).read()
data2019 = urllib.request.urlopen(req2019).read()
data2022 = data2022.decode('utf-8')
data2019 = data2019.decode('utf-8')
area_list = []
# 获取2022
data_bigarea = re.findall('<div class="m-t mt20" data-v-2a40ba50>2022(.*?)各板块二手房均价</div>', data2022)
area_list.append(data_bigarea[0])
data_area = re.findall('2022年(.*?)房价走势图', data2022)
area_list.append(data_area[0])
data_price_2022 = re.findall('2022年5月房价</b><span data-v-2a40ba50>(.*?)元/㎡', data2022)
area_list.append(data_price_2022[0])
# 获取2019
data_price_2019 = re.findall('2019年5月房价</b><span data-v-2a40ba50>(.*?)元/㎡', data2019)
area_list.append(data_price_2019[0])
obj.recordsToFile(area_list, './data/getHouseInfo_detail.txt', type=4, delimiter="\t", mode="a+")
logger.info('getHouseInfo_detail progress: %i/%i content:%s' % (detail_url_counts, detail_url_num, area_list))
print('getHouseInfo_detail progress: %i/%i content:%s' % (detail_url_counts, detail_url_num, area_list))
time.sleep(2)
detail_url_counts += 1
except:
logger.info('getHouseInfo_detail progress: %i/%i NO CONTENT:%s' % (detail_url_counts, detail_url_num, line))
print('getHouseInfo_detail progress: %i/%i NO CONTENT:%s' % (detail_url_counts, detail_url_num, line))
time.sleep(2)
detail_url_counts += 1
continue配置文件(conf.ini):放置经常需更改的变量(此处筛选部分)。
[spider_58]
ua = Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0)\001Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)\001Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1
ip = 125.112.76.113,110.85.124.116自定义函数文件(GeneralObj.py):放置用户自定义函数。
#encoding:utf-8
import logging
import sys, os
import time
import random
FILE_PATH = os.path.dirname(os.path.realpath(__file__))
#引入当前文件所在目录下的文件
sys.path.append(FILE_PATH + '../data')
class GeneralObj(object):
def __init__(self):
self.logger = logging.getLogger(__name__)
#传入list,存储到本地文件
def recordsToFile(self, records, file_name, type=1, delimiter="\t", mode="w"):
f = open(file_name, mode)
if type==1:
for line in records:
f.write(line + "\n")
f.close()
if type==2:
for line in records:
f.write(delimiter.join(line) + "\n")
f.close()
if type==3:
f.write(records)
f.close()
if type==4:
f.write(delimiter.join(records) + "\n")
f.close()
#读取本地文件,存入list
def fileToRecords(self, file_name, type=1, delimiter="\t", mode="r"):
lists = []
f = open(file_name, mode)
if type==1:
for line in f.readlines():
lists.append(line.strip())
f.close()
return lists
if type==2:
for line in f.readlines():
lists.append(line.strip().split(delimiter))
f.close()
return lists【输出】
日志文件(log.txt):记录运行日志,方便在出现问题时进行排查。
爬取结果文件(data.txt):放置爬取结果。
03
数据分析
以下为数据爬取后,北京城六区整体数据及各区域价格涨幅TOP5商圈。
以上就是本期的内容分享。
边栏推荐
- Distributed transaction
- sqlserver SQL Server Management Studio和Transact-SQL创建账户、创建访问指定数据库的只读用户
- Small program container technology to promote the operation efficiency of the park
- 英飞凌--GTM架构-Generic Timer Module
- 「杂谈」对数据分析未来的几点思考
- NEON优化2:ARM优化高频指令总结
- Compilation problems and solutions of teamtalk winclient
- Pytorch learning (III)
- How to use AI technology to optimize the independent station customer service system? Listen to the experts!
- Go Redis连接池
猜你喜欢

Dependencies tool to view exe and DLL dependencies

Hospital online consultation applet source code Internet hospital source code smart hospital source code

When selecting smart speakers, do you prefer "smart" or "sound quality"? This article gives you the answer

Practice and Thinking on the architecture of a set of 100000 TPS im integrated message system

Swin-Transformer(2021-08)

小程序容器技术,促进园区运营效率提升

opencv数据类型代码表 dtype

dtd建模

The online procurement system of the electronic components industry accurately matches the procurement demand and leverages the digital development of the electronic industry

The cloud native landing practice of using rainbow for Tuowei information
随机推荐
冰河老师的书
Techo Youth2022学年高校公开课:直播连麦的背后,探索音视频技术如何应用
3.10 haas506 2.0 development tutorial example TFT
When selecting smart speakers, do you prefer "smart" or "sound quality"? This article gives you the answer
opencv数据类型代码表 dtype
Distributed transaction
Sword finger offer 16 Integer power of numeric value
【合集- 行业解决方案】如何搭建高性能的数据加速与数据编排平台
Where do the guests come from
Classic problem of leetcode dynamic programming (I)
视频内容生产与消费创新
Infineon - GTM architecture -generic timer module
医疗行业企业供应链系统解决方案:实现医疗数智化供应链协同可视
Swin-transformer --relative positional Bias
Small program container technology to promote the operation efficiency of the park
Can go struct in go question bank · 15 be compared?
NFT technology for gamefi chain game system development
Practice and Thinking on the architecture of a set of 100000 TPS im integrated message system
How does rust implement dependency injection?
The easynvr platform equipment channels are all online. What is the reason for the "network request failure" in the operation?