当前位置:网站首页>How to extract dates from web pages?
How to extract dates from web pages?
2022-06-24 22:10:00 【Blue92120】
Although when extracting the news text , The accuracy is relatively high , But because the regular expression is used to extract the news release time , Therefore, the extraction effect is sometimes not so satisfactory .
Recently I found out Python A third party library , be called htmldate, After testing , It is more accurate to extract the release time of news . Let's see how this library works .
use first pip install :
python3 -m pip install htmldate
then , We use Requests perhaps Selenium Get the source code of the website :
import requests
from htmldate import find_date
html = requests.get('https://www.kingname.info/2022/03/09/this-is-gnelist/').content.decode('utf-8')
date = find_date(html)
print(date)
The operation effect is shown in the figure below :

And the release time of this article , Is, indeed, 3 month 9 Number

Let's take another look at Netease News , Encourage each other Enhance friendship ( Wonderful bloom ) | Paralympic Games | Chinese delegation | Snowboarding | Win gold _ Netease government affairs [2] The release time of this news is shown in the figure below :

Now let's use Requests Get its source code , Then extract the release time :

The release date is really right , But how did the later time get lost ? If you want to keep the hours, minutes and seconds , One parameter can be added outputformat, Its value is that you are datetime.strftime The value entered in :
find_date(html, outputformat='%Y-%m-%d %H:%M:%S')
The operation effect is as shown in the figure :

find_date Parameters of , In addition to the web page source code , You can also pass in URL, Or is it lxml Inside Dom object ,
for example :
from lxml.html import fromstring
selector = fromstring(html)
date = find_date(selector)
边栏推荐
- You are using pip version 21.1.2; however, version 22.1.2 is available
- Redis+Caffeine两级缓存,让访问速度纵享丝滑
- LINQ query collection class introductory cases Wulin expert class
- 60 divine vs Code plug-ins!!
- 火狐拖放后,总会默认打开百度搜索,如果是图片,则会打开图片。
- Double linked list implementation
- leetcode:515. Find the maximum value in each tree row [brainless BFS]
- Redis+caffeine two-level cache enables smooth access speed
- Collapse code using region
- [notes of Wu Enda] convolutional neural network
猜你喜欢

Reduce the pip to the specified version (upgrade the PIP through pycharm, and then reduce it to the original version)

socket(1)

【OpenCV 例程200篇】209. HSV 颜色空间的彩色图像分割

A deep learning model for urban traffic flow prediction with traffic events mined from twitter

How to achieve energy conservation and environmental protection of full-color outdoor LED display

零代码即可将数据可视化应用到企业管理中

机器学习:线性回归

字符串习题总结2

Multithreaded finalization

ansible基本配置
随机推荐
You are using pip version 21.1.2; however, version 22.1.2 is available
降低pip到指定版本(通過PyCharm昇級pip,在降低到原來版本)
印刷行业的ERP软件的领头羊
TKKC round#3
如何抓手机的包进行分析,Fiddler神器或许能帮到您!
面试官:你说你精通Redis,你看过持久化的配置吗?
心楼:华为运动健康的七年筑造之旅
Cannot find reference 'imread' in 'appears in pycharm__ init__. py‘
First order model realizes photo moving (with tool code) | machine learning
想当测试Leader,这6项技能你会吗?
Flutter: Unsupported value: false/true
直播软件app开发,左右自动滑动的轮播图广告
The logic of "Ali health" has long changed
Redis+Caffeine两级缓存,让访问速度纵享丝滑
leetcode:45. Jumping game II [classic greed]
Interviewer: you said you are proficient in redis. Have you seen the persistent configuration?
The collection of zero code enterprise application cases in various industries was officially released
These map operations in guava have reduced my code by 50%
985测试工程师被吊打,学历和经验到底谁更重要?
Guava中这些Map的骚操作,让我的代码量减少了50%