当前位置:网站首页>Crawling exercise: Notice of crawling Henan Agricultural University
Crawling exercise: Notice of crawling Henan Agricultural University
2022-07-06 07:07:00 【Rong AI holiday】
Notice of Henan Agricultural University
Introduction to reptiles
With the rapid development of network , The world wide web has become the carrier of a lot of information , How to effectively extract and use this information has become a huge challenge . Search engine (Search Engine), For example, the traditional general search engine AltaVista, Baidu ,Yahoo! and Google etc. , As a tool to assist people in retrieving information, it has become an entrance and guide for users to access the World Wide Web . however , These general search engines also have some limitations , Such as :
(1) Different fields 、 Users with different backgrounds often have different retrieval purposes and requirements , The results returned by general search engines contain a large number of web pages that users don't care about .
(2) The goal of general search engines is to maximize the network coverage , The contradiction between limited search engine server resources and unlimited network data resources will be further deepened .
(3) The rich data form of the world wide web and the continuous development of network technology , picture 、 database 、 Audio / Video multimedia and other different data appear in large quantities , General search engines are often powerless to these data with dense information content and certain structure , Can't find and get .
(4) General search engines mostly provide keyword based search , It is difficult to support queries based on semantic information .
In order to solve the above problems , Focused crawlers, which can capture relevant web resources directionally, emerge as the times require . Focus crawler is a program that automatically downloads Web pages , It grabs the target according to the set target , Selective access to web pages and related links on the world wide web , Get the information you need . With universal crawlers (generalpurpose web crawler) Different , Focused crawlers don't seek big coverage , The goal is to capture the web pages related to a particular topic , Preparing data resources for topic oriented user queries .
Crawl Henan Agricultural University notice announcement
Crawling purpose
It's school season , I want to pay attention to the dynamic of the school . See if there is any information about the beginning of school .
Learn the code , The program completes the notice and announcement of Henan Agricultural University
https://www.henau.edu.cn/news/xwgg/index.shtml Data capture and storage
Analysis steps
1. Import library
import requests
from bs4 import BeautifulSoup
import csv
from tqdm import tqdm
2. Analyze the web
Compare the following links
https://www.henau.edu.cn/news/xwgg/index.shtml
https://www.henau.edu.cn/news/xwgg/index_2.shtml
https://www.henau.edu.cn/news/xwgg/index_3.shtml
It is obvious that the law , In this way, you can use the cycle to crawl
3. Single page analysis
It can be seen that the general structure is consistent , Just select .news_list The information in
4. Save the file
with open(' Notice notice 1.csv', 'w', newline='', encoding='utf-8-sig') as file:
fileWriter = csv.writer(file)
fileWriter.writerow([' date ',' title '])
fileWriter.writerows(info_list)
- with open(‘ Notice notice 1.csv’, ‘w’, newline=’’, encoding=‘utf-8-sig’) as file: This sentence must be written like this to prevent garbled code
Crawling results show
Code
Part of the code explains
t = li.find('span')
if t :
date = t.string
t = li.find('a')
if t :
title = t.string
Be sure to use find, Because there's only one span Elements , Out-of-service find_all function , In addition, judge t If there is a value , Otherwise there will be problems
for i in tqdm(range(1,5)):
if i==1 :
url = base_url + "index.shtml"
else:
str = "index_%s.shtml" % (i)
url = base_url + str
The core code for paging crawling
All the code
import requests
from bs4 import BeautifulSoup
import csv
from tqdm import tqdm
def get_page(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'
}
resp = requests.get(url,headers)
resp.encoding = resp.status_code
resp.encoding = resp.apparent_encoding
#resp.encoding = 'utf-8'
page_text = resp.text
r = BeautifulSoup(page_text,'lxml')
lists = r.select('.news_list ul li')
#print(lists)
info_list_page = []
for li in lists:
t = li.find('span')
if t :
date = t.string
t = li.find('a')
if t :
title = t.string
info = [date,title]
info_list_page.append(info)
return info_list_page
def main():
info_list = []
base_url = "https://www.henau.edu.cn/news/xwgg/"
print(" Topic crawling ...........")
for i in tqdm(range(1,5)):
if i==1 :
url = base_url + "index.shtml"
else:
str = "index_%s.shtml" % (i)
url = base_url + str
#print()
info_page = get_page(url)
#print(info_page)
info_list+=info_page
print(" Climb to success !!!!!!!!!!")
with open(' Notice notice 1.csv', 'w', newline='', encoding='utf-8-sig') as file:
fileWriter = csv.writer(file)
fileWriter.writerow([' date ',' title '])
fileWriter.writerows(info_list)
if __name__ == "__main__" :
main()
If this article is helpful to my friends , I hope you can give me some praise and support ~ Thank you very much. ~
边栏推荐
- Refer to how customer push e-commerce does content operation
- leetcode841. 钥匙和房间(中等)
- “无聊猿” BAYC 的内忧与外患
- Idea console color log
- What does UDP attack mean? UDP attack prevention measures
- Uncaught TypeError: Cannot red propertites of undefined(reading ‘beforeEach‘)解决方案
- Thought map of data warehouse construction
- Simple use of JWT
- The psychological process from autojs to ice fox intelligent assistance
- Wechat brain competition answer applet_ Support the flow main belt with the latest question bank file
猜你喜欢
顶测分享:想转行,这些问题一定要考虑清楚!
巴比特 | 元宇宙每日必读:中国互联网企业涌入元宇宙的群像:“只有各种求生欲,没有前瞻创新的雄心”...
Simple use of MySQL database: add, delete, modify and query
树莓派3B更新vim
How are the open source Netease cloud music API projects implemented?
ROS2安装及基础知识介绍
Uncaught TypeError: Cannot red propertites of undefined(reading ‘beforeEach‘)解决方案
Kubernetes cluster builds ZABBIX monitoring platform
配置树莓派接入网络
Depth residual network
随机推荐
3. Business and load balancing of high architecture
Latex文字加颜色的三种办法
Oracle数据库11gr2使用tde透明数据加密报错ora28353,如果运行关闭wallet会报错ora28365,运行打开wallet就报错ora28353无法打开wallet
ROS2安装及基础知识介绍
Setting and using richview trvstyle template style
前缀和数组系列
Wechat official account infinite callback authorization system source code, launched in the whole network
Yield method of tread
医疗软件检测机构怎么找,一航软件测评是专家
Blue Bridge Cup zero Foundation National Championship - day 20
The psychological process from autojs to ice fox intelligent assistance
PCL实现选框裁剪点云
18.多级页表与快表
A brief introduction of reverseme in misc in the world of attack and defense
Cif10 actual combat (resnet18)
leetcode704. 二分查找(查找某个元素,简单,不同写法)
TS基础篇
Uncaught typeerror: cannot red properties of undefined (reading 'beforeeach') solution
Windows Server 2016 standard installing Oracle
Development of entity developer database application