当前位置：网站首页>Crawl national laws and Regulations Database

Crawl national laws and Regulations Database

2022-06-27 19:40:00 【Mozi】

Part1 The experiment purpose

Now the era of big data has come , Web crawler technology has become an indispensable part of this era . The target objects of the crawler are also very rich , No matter the words 、 picture 、 video , Any structured or unstructured data crawler can crawl . Enterprises need data to analyze user behavior , To analyze the shortcomings of our products , To analyze competitors' information, etc . Personal adoption Spyder Crawler software , Crawl the information we need , We can get some information in advance through the crawler , Such as ： We can rob tickets by crawling , Scramble for classes , Get information about the transfer of postgraduate entrance examination, etc . This paper takes crawling the database data of national laws and regulations as an example .

The author of this article ： Jiangxi Agricultural University School of economics and management Finance 1903 Fuyan

Part2 The experimental steps

1 Watch the web page

By looking at the web page, we find , Enter the provisions of laws and regulations , The website has not changed , Say that the web page is a dynamic web page .

2 Request web page

Right click on the browser “ Check ”, Click on “ The Internet ”, Select... In the interface that appears “Fetch/XHR” Button , And refresh the page . adopt Ctrl+F Find out about , Click the link that appears , We click “ preview ”, Found the title we need 、 Enacting authority 、 Legal nature 、 timeliness 、 The release date and other information are here , Click on “ header ” You can see the real address of the web page address .

3 Try to get the information on the first page

Use requests Request database , The request method is get,

We see the “ header ” The discovery request method is get request , see “ load ” And click the , That is to say get Requested parameters .Request The request code is as follows ：

import requests
items=[]
url='https://flk.npc.gov.cn/api/?type=flfg&xlwj=05&searchType=title%3Baccurate&sortTr=f_bbrq_s%3Bdesc&gbrqStart=&gbrqEnd=&sxrqStart=&sxrqEnd=&sort=true&page=1&size=10&_=1654157294070'
r=requests.get(url)

4 Parsing data , Store the data

Because the web page returns json Format data , Get the title we need 、 Enacting authority 、 Legal nature 、 timeliness 、 Release date , We can access it through the dictionary . How to get the web link of each regulation ？ We click on the first rule , Found its URL suffix stored in url in , You can get a complete link to the detailed interface of regulations .

First embed the dictionary parsing library , By accessing the dictionary , Extract the data layer by layer to get all the information of a page , Edit code ：

json=r.json()
pagelist=json['result']['data']
office=page['office']
title=page['title']
type=page['type']
date=page['publish']
page=[url]

5 Through the loop , Crawl all pages of regulatory data

The key to page crawling is to find the real address “ Page turning ” law . Let's click on... Respectively 1 page 、 The first 2 page 、 The first 3 page , Find different page numbers except page Inconsistent parameters , The rest are the same . The first 1 page “page” yes 1, The first 2 page “page” yes 2, The first 2 page “page” yes 2, And so on .

We nested one For loop , And pass pandas as pd Store the data . Run the code to make it crawl automatically 1-11 Regulatory information for , And store it 666661.csv In the file of , All the codes are as follows ：

items=[]
url='https://flk.npc.gov.cn/api/?type=flfg&xlwj=05&searchType=title%3Baccurate&sortTr=f_bbrq_s%3Bdesc&gbrqStart=&gbrqEnd=&sxrqStart=&sxrqEnd=&sort=true&page=1&size=10&_=1654157294070'
for i in range(1,11):
    form_data={'page': 1,
    'size': 10}
    r=requests.get(url)
    json=r.json()
    pagelist=json['result']['data']

    for page in pagelist:
        office=page['office']
        title=page['title']
        type=page['type']
        date=page['publish']
        page=[url]
        item=[title,office,type,date,page]
        items.append(item)
import pandas as pd
df=pd.DataFrame(items)
df.to_csv('666661.csv',encoding='utf-8-sig')

Last , The crawling data results are as follows ：