当前位置:网站首页>Crawl Zhejiang industry and trade news page
Crawl Zhejiang industry and trade news page
2022-07-04 10:26:00 【weixin_ forty-six million three hundred and sixty-four thousand】
import requests
import chardet
from lxml import etree
def jiexi(rep):
et = etree.HTML(rep.text)
biaoti = et.xpath("//h2/text()")[0]
zuozhe = et.xpath("//div[@class=‘zz’][1]/text()")[0].split()[0].lstrip(“ author :”)
laiyuan = et.xpath("//div[@class=‘zz’][1]/text()")[0].split()[1].lstrip(“ source :”)
shijian = et.xpath("//div[@class=‘zz’][2]/text()")[0].split("\xa0\xa0")[1].lstrip(“ Release time :”)
zw = “”
for w in et.xpath("//div[@class=‘nr-content-con fl’]/div[1]//text()"):
zw = zw + w
zw = zw.split()
zhengwen = “”
for z in zw:
zhengwen = zhengwen + z
d = {}
d[“biaoti”] = biaoti
d[“zuozhe”] = zuozhe
d[“laiyuan”] = laiyuan
d[“shijian”] = shijian
d[“zhengwen”] = zhengwen
return d
url_list = []
for i in range(0, 10):
if i == 0:
url = “http://www.zjitc.net/xwzx/xyxw.htm”
url_list.append(url)
else:
url = “http://www.zjitc.net/xwzx/xyxw/” + str(359 - i) + “.htm”
url_list.append(url)
for url in url_list:
headers = {“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36”}
response = requests.get(url, headers=headers)
response.encoding = chardet.detect(response.content)[“encoding”]
et = etree.HTML(response.text)
ul_list = []
ul_head = “http://www.zjitc.net/”
for li in et.xpath("//div[@class=‘right-1’]/ul/li"):
if li.xpath("./a/@href")[0].startswith("…/…/"):
ul_list.append(ul_head + li.xpath("./a/@href")[0].lstrip("…/…/"))
else:
ul_list.append(ul_head + li.xpath("./a/@href")[0].lstrip("…/"))
wenzhang = []
for ul in ul_list:
rep = requests.get(ul, headers=headers)
rep.encoding = chardet.detect(rep.content)[“encoding”]
d = jiexi(rep)
wenzhang.append(d)
print(wenzhang)
边栏推荐
- Use the data to tell you where is the most difficult province for the college entrance examination!
- Application of safety monitoring in zhizhilu Denggan reservoir area
- RHCE - day one
- 今日睡眠质量记录78分
- Rhcsa operation
- Latex learning insertion number - list of filled dots, bars, numbers
- MongoDB数据日期显示相差8小时 原因和解决方案
- Advanced technology management - how to design and follow up the performance of students at different levels
- Batch distribution of SSH keys and batch execution of ansible
- Hands on deep learning (III) -- Torch Operation (sorting out documents in detail)
猜你喜欢

Realsense of d435i, d435, d415, t265_ Matching and installation of viewer environment

A little feeling

Latex learning insertion number - list of filled dots, bars, numbers

Debug:==42==ERROR: AddressSanitizer: heap-buffer-overflow on address

Software sharing: the best PDF document conversion tool and PDF Suite Enterprise version sharing | with sharing

leetcode1-3

DCL statement of MySQL Foundation

uniapp 小于1000 按原数字显示 超过1000 数字换算成10w+ 1.3k+ 显示

Delayed message center design

Static comprehensive experiment ---hcip1
随机推荐
[FAQ] summary of common causes and solutions of Huawei account service error 907135701
Four characteristics and isolation levels of database transactions
IPv6 comprehensive experiment
Rhcsa learning practice
【Day2】 convolutional-neural-networks
If the uniapp is less than 1000, it will be displayed according to the original number. If the number exceeds 1000, it will be converted into 10w+ 1.3k+ display
If you don't know these four caching modes, dare you say you understand caching?
leetcode1-3
Application of safety monitoring in zhizhilu Denggan reservoir area
Use the data to tell you where is the most difficult province for the college entrance examination!
Reprint: summation formula of proportional series and its derivation process
Doris / Clickhouse / Hudi, a phased summary in June
DCL statement of MySQL Foundation
2. Data type
入职中国平安三周年的一些总结
Latex error: missing delimiter (. Inserted) {\xi \left( {p,{p_q}} \right)} \right|}}
leetcode1229. Schedule the meeting
用数据告诉你高考最难的省份是哪里!
基于线性函数近似的安全强化学习 Safe RL with Linear Function Approximation 翻译 2
Uniapp--- initial use of websocket (long link implementation)