当前位置:网站首页>The crawler parses the object of the web page. Element name method
The crawler parses the object of the web page. Element name method
2022-07-27 00:21:00 【For a long time, the duck will become a goose】
List of articles
Parser for parsing web pages ‘html.parser’
On the web HTML In the document , Find the of the element charset attribute , This attribute stores the encoding format of the web page , This encoding format can be used as Response object encoding The value of the property .
What you get is HTML Webpage , Parsing data is to put HTML The document is transformed into Python The program can handle Python object . analysis HTML The tool of document is parser .
bs4 library
bs4 Libraries are parsed by parsers HTML Third party Library of documents .
analysis HTML The process of documentation is Instantiation BeautifulSoup class , obtain BeautifulSoup Object procedure
Import
# use from import introduce bs In the library BeautifulSoup Class in
from bs4 import BeautifulSoup
Instantiation BeautifulSoup Class syntax :BeautifulSoup(html,‘html.parser’)
# Import requests library
import requests
# from bs4 Import BeautifulSoup class
from bs4 import BeautifulSoup
# 《 The mob 》 Web page URL
url = 'https://wp.forchange.cn/psychology/11069/'
# Request web page , And assign the result to the variable res
res = requests.get(url)
# Print the response status code , Check whether the request is successful
print(res.status_code)
# Set the encoding format of the response content
res.encoding = 'utf-8'
# use BeautifulSoup And parsers 'html.parser' Parse the requested page
bs =BeautifulSoup(res.text,'html.parser')
# Print and view the parsing results
print(bs)
Extract the data
Before that , Among the objects we know :
1) The structure of the list is in the form of sequence , The list elements are arranged in order , Use the index value ;
2) The structure of a dictionary is a mapping form , Keys and values correspond to each other , Value with key ;
3)Excel The internal structure of worksheet objects is that cells are arranged in rows 、 Column arrangement , You can press the line 、 Cell value .
and BeautifulSoup The object is a kind of Tree structure , To recall HTML A layer by layer structure .
BeautifulSoup Each node in the object is another Python object :Tag object . for example ,< html> node ,< head> node ,< body> node ,< header> Nodes are Tag object .
BeautifulSoup The object represents the whole HTML file .Tag The object is the same as HTML The elements in the document correspond one by one . If we want to find someone HTML Elements , Let the program find the corresponding Tag Objects will do .
It's like HTML Elements in the document are nested with elements ,BeautifulSoup In the object , The relationship between nodes is also nested layer by layer . such as ,< html> Nodes are nested < head> Nodes and < body> node ;< body> Nodes are nested < svg> node 、< header> Nodes etc. .
stay BeautifulSoup In the object tree view , For two interconnected nodes : The upper node is called the lower node parent node , Also called parent node ; The lower node is called the upper node child node , Also known as child nodes .
BeautifulSoup The nodes in the object are related to HTML Elements in the document ( Not the label of the element ) One-to-one correspondence .
according to BeautifulSoup Extract nodes from node nesting relationships in objects
BeautifulSoup . Element name Method
. The element name is BeautifulSoup Objects and Tag Object general operations , You can get the descendant nodes nested in the current node . The element name refers to HTML The name of the element in the document ,. The execution result of the element name is to get a Tag object .
With BeautifulSoup Object as an example , The corresponding syntax is BeautifulSoup object . Element name .
about BeautifulSoup The object is , all Tag object All nested in BeautifulSoup object in . therefore , Try to use BeautifulSoup object . Element name Take any node . however , If one HTML There is an element with the same name in the document , Only the first element with matching name can be returned Tag object .
from bs4 import BeautifulSoup
html = '''
<html>
<head>
<meta charset="utf-8">
<title> Dachuan God's reptilian world </title>
</head>
<body>
<div id="header">
<h1> Sichuan God teaches you HTML</h1>
</div>
<div class="poems" id="section1">
<h2> In the Quiet Night </h2>
<h3> Li Bai ( The tang dynasty )</h3>
<p> abed, I see a silver light , The frost on the ground .<br> look at the bright moon , Bow your head and think of your hometown .</p>
</div>
<div class="poems" id="section2">
<h2> Early onset Baidicheng </h2>
<h3> Li Bai ( The tang dynasty )</h3>
<p> Leaving at dawn the White King crowned with rainbow cloud , I have sailed a thousand miles through Three Georges in a day .<br> With monkeys' sad adieus the riverbanks are loud , My boat has left ten thought mountains far away .</p>
</div>
</body>
</html>
'''
# analysis HTML file
bs = BeautifulSoup(html, 'html.parser')
# use .div obtain <div> node
div_tag = bs.div
# Print the results
print(div_tag)
bs.div use .div obtain < div > node 
As long as you know HTML The element name of the element , I can get it Tag object . But you can only get BeautifulSoup The name of the first element in the object matches Tag object .
Tag object . Element name Method
The function is to get Nested in Tag In the object The first element name matches Tag object .
from bs4 import BeautifulSoup
html = '''
<html>
<head>
<meta charset="utf-8">
<title> Dachuan God's reptilian world </title>
</head>
<body>
<div id="header">
<h1> Sichuan God teaches you HTML</h1>
</div>
<div class="poems" id="section1">
<h2> In the Quiet Night </h2>
<h3> Li Bai ( The tang dynasty )</h3>
<p> abed, I see a silver light , The frost on the ground .<br> look at the bright moon , Bow your head and think of your hometown .</p>
</div>
<div class="poems" id="section2">
<h2> Early onset Baidicheng </h2>
<h3> Li Bai ( The tang dynasty )</h3>
<p> Leaving at dawn the White King crowned with rainbow cloud , I have sailed a thousand miles through Three Georges in a day .<br> With monkeys' sad adieus the riverbanks are loud , My boat has left ten thought mountains far away .</p>
</div>
</body>
</html>
'''
# analysis HTML file
bs = BeautifulSoup(html, 'html.parser')
# use .div obtain <div> node
div_tag = bs.div
# from <div> Get <h1> node
h1_tag = div_tag.h1
# Print the results
print(h1_tag)
div_tag = bs.div
h1_tag = div_tag.h1

. Element name Operation also has an extended usage
. Element name . Element name .……,BeautifulSoup Objects and Tag Objects can be operated like this .
It can be like this “ ultimately ” The reason is BeautifulSoup Objects and Tag object , as well as Tag Objects and Tag Nested relationships between objects . for example : use bs.div.h1 Also available ==< h1>== node



边栏推荐
- Dynamic memory management
- 20220720折腾deeplabcut2
- CCPD data set processing (target detection and text recognition)
- [Gorm] model relationship -hasone
- Method of realizing program startup and self startup through registry
- 机器人学台大林教授课程笔记
- Identity server4 authorization successful page Jump encountered an error: exception: correlation failed Solution of unknown location
- Typesript generic constraint
- C and pointer Chapter 18 runtime efficiency 18.3 runtime efficiency
- Complete backpack and 01 Backpack
猜你喜欢

Opencv camera calibration and distortion correction

Anaconda = > pycharm=> CUDA=> cudnn=> pytorch environment configuration

20220720 toss deeplobcut2

Practice of data storage scheme in distributed system

Add an article ----- scanf usage

Nacos installation and pit stepping

Database: MySQL foundation +crud basic operation

The difference between SQL join and related subinquiry

CCPD data set processing (target detection and text recognition)

CSDN文章语法规则
随机推荐
Pyautogui usage example
SSRF (server side request forgery) -- Principle & bypass & Defense
PTA 7-1 play with binary tree
Drawing warehouse Tsai
Leetcode - linked list
13_集成学习和随机森林(Ensemble Learning and Random Forests)
Chapter 1 requirements analysis and SSM environment preparation
爬虫解析网页的 对象.元素名方法
Complete backpack and 01 Backpack
Anaconda => PyCharm => CUDA => cudnn => PyTorch 环境配置
Chapter 7 course summary
卷积神经网络——LeNet(pytorch实现)
Baidu website Collection
[literature reading] an investigation on hardware aware vision transformer scaling
CSDN文章语法规则
滑动窗口问题总结
What is Tencent cloud lightweight application server? What are the differences between CVM and ECS?
4. Talk about the famous Zhang Zhengyou calibration method
AlexNet(Pytorch实现)
Recbole use 1