当前位置:网站首页>Crawl Zhejiang industry and trade news page
Crawl Zhejiang industry and trade news page
2022-07-04 10:26:00 【weixin_ forty-six million three hundred and sixty-four thousand】
import requests
import chardet
from lxml import etree
def jiexi(rep):
et = etree.HTML(rep.text)
biaoti = et.xpath("//h2/text()")[0]
zuozhe = et.xpath("//div[@class=‘zz’][1]/text()")[0].split()[0].lstrip(“ author :”)
laiyuan = et.xpath("//div[@class=‘zz’][1]/text()")[0].split()[1].lstrip(“ source :”)
shijian = et.xpath("//div[@class=‘zz’][2]/text()")[0].split("\xa0\xa0")[1].lstrip(“ Release time :”)
zw = “”
for w in et.xpath("//div[@class=‘nr-content-con fl’]/div[1]//text()"):
zw = zw + w
zw = zw.split()
zhengwen = “”
for z in zw:
zhengwen = zhengwen + z
d = {}
d[“biaoti”] = biaoti
d[“zuozhe”] = zuozhe
d[“laiyuan”] = laiyuan
d[“shijian”] = shijian
d[“zhengwen”] = zhengwen
return d
url_list = []
for i in range(0, 10):
if i == 0:
url = “http://www.zjitc.net/xwzx/xyxw.htm”
url_list.append(url)
else:
url = “http://www.zjitc.net/xwzx/xyxw/” + str(359 - i) + “.htm”
url_list.append(url)
for url in url_list:
headers = {“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36”}
response = requests.get(url, headers=headers)
response.encoding = chardet.detect(response.content)[“encoding”]
et = etree.HTML(response.text)
ul_list = []
ul_head = “http://www.zjitc.net/”
for li in et.xpath("//div[@class=‘right-1’]/ul/li"):
if li.xpath("./a/@href")[0].startswith("…/…/"):
ul_list.append(ul_head + li.xpath("./a/@href")[0].lstrip("…/…/"))
else:
ul_list.append(ul_head + li.xpath("./a/@href")[0].lstrip("…/"))
wenzhang = []
for ul in ul_list:
rep = requests.get(ul, headers=headers)
rep.encoding = chardet.detect(rep.content)[“encoding”]
d = jiexi(rep)
wenzhang.append(d)
print(wenzhang)
边栏推荐
- 使用 C# 提取 PDF 文件中的所有文字(支持 .NET Core)
- Architecture introduction
- System.currentTimeMillis() 和 System.nanoTime() 哪个更快?别用错了!
- PHP代码审计3—系统重装漏洞
- What is devsecops? Definitions, processes, frameworks and best practices for 2022
- Debug:==42==ERROR: AddressSanitizer: heap-buffer-overflow on address
- Exercise 9-3 plane vector addition (15 points)
- Rhsca day 11 operation
- 对于程序员来说,伤害力度最大的话。。。
- Dos:disk operating system, including core startup program and command program
猜你喜欢
IPv6 comprehensive experiment
基于线性函数近似的安全强化学习 Safe RL with Linear Function Approximation 翻译 1
leetcode842. Split the array into Fibonacci sequences
Basic principle of servlet and application of common API methods
[200 opencv routines] 218 Multi line italic text watermark
Static comprehensive experiment ---hcip1
C language structure to realize simple address book
system design
VLAN part of switching technology
【Day1】 deep-learning-basics
随机推荐
How to teach yourself to learn programming
Custom type: structure, enumeration, union
2021-08-10 character pointer
MPLS: multi protocol label switching
Kotlin: collection use
【FAQ】华为帐号服务报错 907135701的常见原因总结和解决方法
Whether a person is reliable or not, closed loop is very important
Histogram equalization
Es advanced series - 1 JVM memory allocation
PHP code audit 3 - system reload vulnerability
转载:等比数列的求和公式,及其推导过程
Press the button wizard to learn how to fight monsters - identify the map, run the map, enter the gang and identify NPC
Map container
Hands on deep learning (III) -- Torch Operation (sorting out documents in detail)
Does any teacher know how to inherit richsourcefunction custom reading Mysql to do increment?
Tables in the thesis of latex learning
What is an excellent architect in my heart?
【Day2】 convolutional-neural-networks
Deep learning 500 questions
BGP ---- border gateway routing protocol ----- basic experiment