当前位置:网站首页>Crawl Zhejiang industry and trade news page
Crawl Zhejiang industry and trade news page
2022-07-04 10:26:00 【weixin_ forty-six million three hundred and sixty-four thousand】
import requests
import chardet
from lxml import etree
def jiexi(rep):
et = etree.HTML(rep.text)
biaoti = et.xpath("//h2/text()")[0]
zuozhe = et.xpath("//div[@class=‘zz’][1]/text()")[0].split()[0].lstrip(“ author :”)
laiyuan = et.xpath("//div[@class=‘zz’][1]/text()")[0].split()[1].lstrip(“ source :”)
shijian = et.xpath("//div[@class=‘zz’][2]/text()")[0].split("\xa0\xa0")[1].lstrip(“ Release time :”)
zw = “”
for w in et.xpath("//div[@class=‘nr-content-con fl’]/div[1]//text()"):
zw = zw + w
zw = zw.split()
zhengwen = “”
for z in zw:
zhengwen = zhengwen + z
d = {}
d[“biaoti”] = biaoti
d[“zuozhe”] = zuozhe
d[“laiyuan”] = laiyuan
d[“shijian”] = shijian
d[“zhengwen”] = zhengwen
return d
url_list = []
for i in range(0, 10):
if i == 0:
url = “http://www.zjitc.net/xwzx/xyxw.htm”
url_list.append(url)
else:
url = “http://www.zjitc.net/xwzx/xyxw/” + str(359 - i) + “.htm”
url_list.append(url)
for url in url_list:
headers = {“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36”}
response = requests.get(url, headers=headers)
response.encoding = chardet.detect(response.content)[“encoding”]
et = etree.HTML(response.text)
ul_list = []
ul_head = “http://www.zjitc.net/”
for li in et.xpath("//div[@class=‘right-1’]/ul/li"):
if li.xpath("./a/@href")[0].startswith("…/…/"):
ul_list.append(ul_head + li.xpath("./a/@href")[0].lstrip("…/…/"))
else:
ul_list.append(ul_head + li.xpath("./a/@href")[0].lstrip("…/"))
wenzhang = []
for ul in ul_list:
rep = requests.get(ul, headers=headers)
rep.encoding = chardet.detect(rep.content)[“encoding”]
d = jiexi(rep)
wenzhang.append(d)
print(wenzhang)
边栏推荐
- system design
- Container cloud notes
- Reasons and solutions for the 8-hour difference in mongodb data date display
- Sword finger offer 31 Stack push in and pop-up sequence
- 【Day1】 deep-learning-basics
- Kotlin: collection use
- Map container
- uniapp 处理过去时间对比现在时间的时间差 如刚刚、几分钟前,几小时前,几个月前
- Architecture introduction
- Software sharing: the best PDF document conversion tool and PDF Suite Enterprise version sharing | with sharing
猜你喜欢
今日睡眠质量记录78分
For programmers, if it hurts the most...
Hands on deep learning (III) -- Torch Operation (sorting out documents in detail)
BGP ---- border gateway routing protocol ----- basic experiment
Number of relationship models
六月份阶段性大总结之Doris/Clickhouse/Hudi一网打尽
Four characteristics and isolation levels of database transactions
leetcode1-3
【Day2】 convolutional-neural-networks
Basic data types of MySQL
随机推荐
Hands on deep learning (39) -- gating cycle unit Gru
Vanishing numbers
View CSDN personal resource download details
Ruby时间格式转换strftime毫秒匹配格式
If you don't know these four caching modes, dare you say you understand caching?
uniapp---初步使用websocket(长链接实现)
BGP advanced experiment
[FAQ] summary of common causes and solutions of Huawei account service error 907135701
Work order management system OTRs
Sword finger offer 05 (implemented in C language)
Exercise 9-3 plane vector addition (15 points)
Hands on deep learning (41) -- Deep recurrent neural network (deep RNN)
The time difference between the past time and the present time of uniapp processing, such as just, a few minutes ago, a few hours ago, a few months ago
Dos:disk operating system, including core startup program and command program
Realsense d435 d435i d415 depth camera obtains RGB map, left and right infrared camera map, depth map and IMU data under ROS
Lavel document reading notes -how to use @auth and @guest directives in lavel
基于线性函数近似的安全强化学习 Safe RL with Linear Function Approximation 翻译 1
IPv6 comprehensive experiment
Servlet基本原理与常见API方法的应用
六月份阶段性大总结之Doris/Clickhouse/Hudi一网打尽