当前位置：网站首页>How to extract dates from web pages?

How to extract dates from web pages?

2022-06-24 22:10:00 【Blue92120】

Although when extracting the news text , The accuracy is relatively high , But because the regular expression is used to extract the news release time , Therefore, the extraction effect is sometimes not so satisfactory .

Recently I found out Python A third party library , be called htmldate, After testing , It is more accurate to extract the release time of news . Let's see how this library works .

use first pip install ：

python3 -m pip install htmldate

then , We use Requests perhaps Selenium Get the source code of the website ：

import requests

from htmldate import find_date

html = requests.get('https://www.kingname.info/2022/03/09/this-is-gnelist/').content.decode('utf-8')

date = find_date(html)

print(date)

The operation effect is shown in the figure below ：

And the release time of this article , Is, indeed, 3 month 9 Number

Let's take another look at Netease News , Encourage each other Enhance friendship （ Wonderful bloom ） | Paralympic Games | Chinese delegation | Snowboarding | Win gold _ Netease government affairs [2] The release time of this news is shown in the figure below ：

Now let's use Requests Get its source code , Then extract the release time ：

The release date is really right , But how did the later time get lost ？ If you want to keep the hours, minutes and seconds , One parameter can be added outputformat, Its value is that you are datetime.strftime The value entered in ：

find_date(html, outputformat='%Y-%m-%d %H:%M:%S')

The operation effect is as shown in the figure ：