当前位置:网站首页>Crawl Zhejiang industry and trade news page
Crawl Zhejiang industry and trade news page
2022-07-04 10:26:00 【weixin_ forty-six million three hundred and sixty-four thousand】
import requests
import chardet
from lxml import etree
def jiexi(rep):
et = etree.HTML(rep.text)
biaoti = et.xpath("//h2/text()")[0]
zuozhe = et.xpath("//div[@class=‘zz’][1]/text()")[0].split()[0].lstrip(“ author :”)
laiyuan = et.xpath("//div[@class=‘zz’][1]/text()")[0].split()[1].lstrip(“ source :”)
shijian = et.xpath("//div[@class=‘zz’][2]/text()")[0].split("\xa0\xa0")[1].lstrip(“ Release time :”)
zw = “”
for w in et.xpath("//div[@class=‘nr-content-con fl’]/div[1]//text()"):
zw = zw + w
zw = zw.split()
zhengwen = “”
for z in zw:
zhengwen = zhengwen + z
d = {}
d[“biaoti”] = biaoti
d[“zuozhe”] = zuozhe
d[“laiyuan”] = laiyuan
d[“shijian”] = shijian
d[“zhengwen”] = zhengwen
return d
url_list = []
for i in range(0, 10):
if i == 0:
url = “http://www.zjitc.net/xwzx/xyxw.htm”
url_list.append(url)
else:
url = “http://www.zjitc.net/xwzx/xyxw/” + str(359 - i) + “.htm”
url_list.append(url)
for url in url_list:
headers = {“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36”}
response = requests.get(url, headers=headers)
response.encoding = chardet.detect(response.content)[“encoding”]
et = etree.HTML(response.text)
ul_list = []
ul_head = “http://www.zjitc.net/”
for li in et.xpath("//div[@class=‘right-1’]/ul/li"):
if li.xpath("./a/@href")[0].startswith("…/…/"):
ul_list.append(ul_head + li.xpath("./a/@href")[0].lstrip("…/…/"))
else:
ul_list.append(ul_head + li.xpath("./a/@href")[0].lstrip("…/"))
wenzhang = []
for ul in ul_list:
rep = requests.get(ul, headers=headers)
rep.encoding = chardet.detect(rep.content)[“encoding”]
d = jiexi(rep)
wenzhang.append(d)
print(wenzhang)
边栏推荐
- [FAQ] summary of common causes and solutions of Huawei account service error 907135701
- Whether a person is reliable or not, closed loop is very important
- View CSDN personal resource download details
- Debug:==42==ERROR: AddressSanitizer: heap-buffer-overflow on address
- Basic principle of servlet and application of common API methods
- Remove linked list elements
- Delayed message center design
- C language structure to realize simple address book
- Dynamic memory management
- RHCE - day one
猜你喜欢

转载:等比数列的求和公式,及其推导过程

Sword finger offer 05 (implemented in C language)

system design

Basic data types of MySQL

VLAN part of switching technology

What is an excellent architect in my heart?

Three schemes of ZK double machine room
If you don't know these four caching modes, dare you say you understand caching?

How can Huawei online match improve the success rate of player matching

Safety reinforcement learning based on linear function approximation safe RL with linear function approximation translation 2
随机推荐
Exercise 9-3 plane vector addition (15 points)
Whether a person is reliable or not, closed loop is very important
How to teach yourself to learn programming
Rhsca day 11 operation
Exercise 7-3 store the numbers in the array in reverse order (20 points)
Does any teacher know how to inherit richsourcefunction custom reading Mysql to do increment?
2. Data type
View CSDN personal resource download details
Exercise 7-2 finding the maximum value and its subscript (20 points)
Tables in the thesis of latex learning
Exercise 7-8 converting strings to decimal integers (15 points)
Devop basic command
基于线性函数近似的安全强化学习 Safe RL with Linear Function Approximation 翻译 1
Batch distribution of SSH keys and batch execution of ansible
Latex error: missing delimiter (. Inserted) {\xi \left( {p,{p_q}} \right)} \right|}}
Software sharing: the best PDF document conversion tool and PDF Suite Enterprise version sharing | with sharing
VLAN part of switching technology
Doris / Clickhouse / Hudi, a phased summary in June
Hands on deep learning (40) -- short and long term memory network (LSTM)
System. Currenttimemillis() and system Nanotime (), which is faster? Don't use it wrong!