当前位置:网站首页>Crawl Zhejiang industry and trade news page
Crawl Zhejiang industry and trade news page
2022-07-04 10:26:00 【weixin_ forty-six million three hundred and sixty-four thousand】
import requests
import chardet
from lxml import etree
def jiexi(rep):
et = etree.HTML(rep.text)
biaoti = et.xpath("//h2/text()")[0]
zuozhe = et.xpath("//div[@class=‘zz’][1]/text()")[0].split()[0].lstrip(“ author :”)
laiyuan = et.xpath("//div[@class=‘zz’][1]/text()")[0].split()[1].lstrip(“ source :”)
shijian = et.xpath("//div[@class=‘zz’][2]/text()")[0].split("\xa0\xa0")[1].lstrip(“ Release time :”)
zw = “”
for w in et.xpath("//div[@class=‘nr-content-con fl’]/div[1]//text()"):
zw = zw + w
zw = zw.split()
zhengwen = “”
for z in zw:
zhengwen = zhengwen + z
d = {}
d[“biaoti”] = biaoti
d[“zuozhe”] = zuozhe
d[“laiyuan”] = laiyuan
d[“shijian”] = shijian
d[“zhengwen”] = zhengwen
return d
url_list = []
for i in range(0, 10):
if i == 0:
url = “http://www.zjitc.net/xwzx/xyxw.htm”
url_list.append(url)
else:
url = “http://www.zjitc.net/xwzx/xyxw/” + str(359 - i) + “.htm”
url_list.append(url)
for url in url_list:
headers = {“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36”}
response = requests.get(url, headers=headers)
response.encoding = chardet.detect(response.content)[“encoding”]
et = etree.HTML(response.text)
ul_list = []
ul_head = “http://www.zjitc.net/”
for li in et.xpath("//div[@class=‘right-1’]/ul/li"):
if li.xpath("./a/@href")[0].startswith("…/…/"):
ul_list.append(ul_head + li.xpath("./a/@href")[0].lstrip("…/…/"))
else:
ul_list.append(ul_head + li.xpath("./a/@href")[0].lstrip("…/"))
wenzhang = []
for ul in ul_list:
rep = requests.get(ul, headers=headers)
rep.encoding = chardet.detect(rep.content)[“encoding”]
d = jiexi(rep)
wenzhang.append(d)
print(wenzhang)
边栏推荐
- Reprint: summation formula of proportional series and its derivation process
- Dynamic memory management
- Batch distribution of SSH keys and batch execution of ansible
- Sword finger offer 31 Stack push in and pop-up sequence
- Rhcsa operation
- 使用 C# 提取 PDF 文件中的所有文字(支持 .NET Core)
- When I forget how to write SQL, I
- RHCE day 3
- uniapp---初步使用websocket(长链接实现)
- 7-17 crawling worms (15 points)
猜你喜欢

5g/4g wireless networking scheme for brand chain stores

Hands on deep learning (42) -- bi-directional recurrent neural network (BI RNN)

PHP code audit 3 - system reload vulnerability

Rhcsa day 9

Basic principle of servlet and application of common API methods

Architecture introduction

From programmers to large-scale distributed architects, where are you (I)

183 sets of free resume templates to help everyone find a good job

Evolution from monomer architecture to microservice architecture

Number of relationship models
随机推荐
Whether a person is reliable or not, closed loop is very important
AUTOSAR从入门到精通100讲(106)-域控制器中的SOA
Hands on deep learning (III) -- Torch Operation (sorting out documents in detail)
Debug:==42==ERROR: AddressSanitizer: heap-buffer-overflow on address
Differences among opencv versions
Exercise 7-4 find out the elements that are not common to two arrays (20 points)
Exercise 9-5 address book sorting (20 points)
Doris / Clickhouse / Hudi, a phased summary in June
Rhcsa - day 13
2. Data type
What is devsecops? Definitions, processes, frameworks and best practices for 2022
Velodyne configuration command
Exercise 8-10 output student grades (20 points)
What is an excellent architect in my heart?
System. Currenttimemillis() and system Nanotime (), which is faster? Don't use it wrong!
Es advanced series - 1 JVM memory allocation
Exercise 9-1 time conversion (15 points)
Latex learning insertion number - list of filled dots, bars, numbers
BGP ---- border gateway routing protocol ----- basic experiment
Development guidance document of CMDB