当前位置：网站首页>Data extraction 2

Data extraction 2

2022-07-27 07:59:00 【Horse sailing】

Regularity of data extraction

The goal is ： Master the common syntax of regular expressions
master re Common syntax of modules
Master the original string r Usage of

1. What is regular expression

With some specific characters defined in advance , And the combination of these specific characters , Form a regular string , This rule string is used to express a kind of filtering logic for Strings

2. Common syntax of regular expressions

Knowledge point ：

Characters in regular

Predefined character set in regular

Quantifier in regular

Insert picture description here

There are many regular grammars , Can't review all , For other grammars , You can check the information temporarily , such as ： Indicates or can be used |

practice ： What is the output below ？

import re
str1 = '<meta http-equiv="content-type" content="text/html;charset=utf-8"/>adacc/sd/sdef/24'

result = re.findall(r'<.*>', str1)
print(result)

3. re Common methods of modules

pattern.match（ Find one from the beginning ）

pattern.search（ Find one. ）

pattern.findall（ Find all ）

Greedy pattern under the premise of successful expression matching , As many matches as possible

The non greedy pattern is based on the successful expression matching , As few matches as possible

Return a list , No, it's an empty list

 re.findall("\d", "aef5teacher2") >>>>  ['5', '2']

pattern.sub（ Replace ）

re.sub("\d", "_", "aef5teacher2") >>>>  ['aef_teacher_']

re.compile（ compile ）

Return a model p, With and re The same way , But the parameters passed are different

The matching pattern needs to be passed to compile in

p = re.compile("\d", re.s)
p.findall("aef_teacher")

4. python Original string r Usage of

Definition ： All strings are used literally , Special characters without escape or characters that cannot be printed , The original string is often for special characters , for example ：“\n" The original string of is ”\n"

The length of the original string ：

len('\n')
#  result  1
len(r'\n')
#  result  2
'\n'[0]
#  result  '\n'
r'\n'[0]
#  result  '\\'

Data extraction lxml Module and xpath Tools

The goal is ： understand xpath The definition of
understand lxml
master xpath The grammar of
lxml It is a high-tech python html/xml Parser , We can use xpath, To quickly locate specific elements and get node information

1. understand lxml Module and xpath grammar

Yes html or xml Form text to extract specific content , We need to master lxml Use of modules and xpath grammar .

lxml Modules can use XPath Rule grammar , To quickly locate HTML\XML Specific elements in the document and getting node information （ Text content 、 Property value ）
XPath (XML Path Language) Is a door in HTML\XML Looking for information in the document Language , Can be used in HTML\XML In the document for Elements and attributes are traversed .
- W3School Official documents ：http://www.w3school.com.cn/xpath/index.asp
extract xml、html The data in requires lxml Module and xpath Use grammar with

2. Google browser xpath helper Installation and use of plug-ins

If you want to make use of lxml Module extract data , We need to master xpath Rule of grammar . Next let's get to know xpath helper plug-in unit , It can help us practice xpath grammar **( See courseware for installation package – Tools folder )**

download Chrome plug-in unit XPath Helper
- Can be in chrome Download in App Store
take rar Extract the compressed package to the current folder
Open Google browser ----> Three points in the upper right corner ----> More tools ----> add-in
In the extended program interface , Click the switch in the upper right corner , After entering the developer mode , take xpath Drag folder into , Release the mouse

installation is complete , check

3. xpath The node relationship of

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-D6IwuiMy-1658853026856)(../img/ node .png)]

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-lRUWk5Nv-1658853026858)(../img/xpath The relationship between nodes in .png)]

4. xpath grammar - Base node selection Syntax

XPath Use path expressions to select XML A node or set of nodes in a document .
These path expressions and our regular Expressions seen in the computer file system Very similar .
Use chrome When the plug-in selects the tag , When selected , The selected tag will add properties class=“xh-highlight”

xpath Syntax for locating nodes and extracting attributes or text content

expression	describe
nodename	Select the element .
/	Select from root node 、 Or the transition between elements .
//	Select the node in the document from the current node that matches the selection , Regardless of their location .
.	Select the current node .
…	Select the parent of the current node .
@	Select Properties .
text()	Select the text .

5. xpath grammar - Node modifier Syntax

According to the attribute value of the tag 、 Subscripts and so on to obtain specific nodes

5.1 Node modifier Syntax

Path expression	result
//title[@lang=“eng”]	choice lang The property value is eng All of the title Elements
/bookstore/book[1]	Choose to belong to bookstore The first of the child elements book Elements .
/bookstore/book[last()]	Choose to belong to bookstore The last of the child elements book Elements .
/bookstore/book[last()-1]	Choose to belong to bookstore The penultimate of a child element book Elements .
/bookstore/book[position()>1]	choice bookstore Below book Elements , Choose... From the second
//book/title[text()=‘Harry Potter’]	Select all book Under the title Elements , Just select the text as Harry Potter Of title Elements
/bookstore/book[price>35.00]/title	selection bookstore In the element book All of the elements title Elements , And one of them price The value of the element must be greater than 35.00.

5.2 About xpath The subscript

stay xpath in , The position of the first element is 1
The position of the last element is last()
The next to last is last()-1

6. xpath grammar - Other common node selection syntax

// Use of
//a At present html All on the page a
bookstore//book bookstore All under book Elements
@ Use
//a/@href be-all a Of href
//title[@lang=“eng”] choice lang=eng Of title label
text() Use
//a/text() Get all a The text below
//a[texts()=‘ The next page ’] Gets the text for the next page a label
a//text() a All text under
xpath Find a specific node
//a[1] Select first s
//a[last()] the last one
//a[position()<4] The first three
contain
//a[contains(text(),“ The next page ”)] Select a text that contains three words from the next page a label **
//a[contains(@class,‘n’)] class contain n Of a label

7. lxml Module installation and use examples

lxml Module is a third-party module , Use... After installation

7.1 lxml Module installation

Get on send request xml or html Form the response content to extract

pip/pip3 install lxml

7.2 Reptilian right html Extracted content

Extract... From the tag Text content
Extract... From the tag The value of the property
- such as , extract a In the label href The value of the property , obtain url, And continue to make requests

7.3 lxml Use of modules

Import lxml Of etree library
from lxml import etree
utilize etree.HTML, take html character string （bytes Type or str type ） Turn into Element object ,Element Object has xpath Methods , Return a list of results
```
html = etree.HTML(text) 
ret_list = html.xpath("xpath Syntax rule string ")
```
xpath Method returns three cases of the list
- Return to empty list ： according to xpath Syntax rule string , No elements are located
- Returns a list of strings ：xpath String rules must match the text content or the value of a property
- Return from Element List of objects ：xpath The rule string matches the label , In the list Element The object can continue xpath

lxml Module use example

Run the following code , View printed results

from lxml import etree
text = ''' <div> <ul> <li class="item-1"> <a href="link1.html">first item</a> </li> <li class="item-1"> <a href="link2.html">second item</a> </li> <li class="item-inactive"> <a href="link3.html">third item</a> </li> <li class="item-1"> <a href="link4.html">fourth item</a> </li> <li class="item-0"> a href="link5.html">fifth item</a> </ul> </div> '''

html = etree.HTML(text)

# obtain href The list and title A list of 
href_list = html.xpath("//li[@class='item-1']/a/@href")
title_list = html.xpath("//li[@class='item-1']/a/text()")

# Assemble into a dictionary 
for href in href_list:
    item = {
    }
    item["href"] = href
    item["title"] = title_list[href_list.index(href)]
    print(item)

practice

The following html In the document string , Each one class by item-1 Of li Label as 1 News data . extract a The text content of the tag and the links , Assemble into a dictionary .

text = ''' <div> <ul> <li class="item-1"><a>first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> '''

Be careful ：
- Group first , Then extract the data , It can avoid data confusion
- For null value to judge

Continue to extract data in each group

 for li in li_list:
      item = {
    }
      item["href"] = li.xpath("./a/@href")[0] if len(li.xpath("./a/@href"))>0 else None
      item["title"] = li.xpath("./a/text()")[0] if len(li.xpath("./a/text()"))>0 else None
      print(item)

#####  Knowledge point ： master  lxml Use... In the module xpath Syntax positioning elements extract attribute values or text content 
##### lxml Module etree.tostring Use of functions 
####>  Run the code below , Observe the contrast html The original string and printout results 

from lxml import etree
html_str = ''' <div> <ul> <li class="item-1"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> '''

html = etree.HTML(html_str)

handeled_html_str = etree.tostring(html).decode()
print(handeled_html_str)

Phenomena and conclusions

The results are compared with the original ：
Automatically complete what is missing li label
Automatic completion html Such as tag

<html><body><div> <ul> 
<li class="item-1"><a href="link1.html">first item</a></li> 
<li class="item-1"><a href="link2.html">second item</a></li> 
<li class="item-inactive"><a href="link3.html">third item</a></li> 
<li class="item-1"><a href="link4.html">fourth item</a></li> 
<li class="item-0"><a href="link5.html">fifth item</a> 
</li></ul> </div> </body></html>

Conclusion ：

lxml.etree.HTML(html_str) Can automatically complete the label
lxml.etree.tostring Function can be converted to Element The object is then converted back to html character string
If reptiles use lxml To extract data , Should take lxml.etree.tostring As the basis of data extraction

Practice after class

Preliminary use

We use it to analyze HTML Code , A simple example ：

# lxml_test.py

#  Use  lxml  Of  etree  library 
from lxml import etree

html = ''' <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> #  Be careful , One is missing here  </li>  Closed label  </ul> </div> '''

# utilize etree.HTML, Parse string to HTML file 
xml_doc = etree.HTML(html)

#  Serialize by string HTML file 
html_doc = etree.tostring(xml_doc)

print(html_doc)

Output results ：

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
 </div>
</body></html>

lxml It can be corrected automatically html Code , The example not only completes li label , And added body,html label .

File read ：

In addition to reading strings directly ,lxml It also supports reading content from files . We build a new one hello.html file ：

<!-- hello.html -->

<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>

recycling etree.parse() Method to read the file .

# lxml_parse.py

from lxml import etree

#  Read external files  hello.html
html = etree.parse('./hello.html')
result = etree.tostring(html, pretty_print=True)

print(result)

The output is the same as before ：

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
 </div>
</body></html>

XPath Example test

1. Get all `<li>` label

# xpath_li.py

from lxml import etree

xml_doc = etree.parse('hello.html')
print type(html)  #  Show etree.parse()  Return type 

result = xml_doc.xpath('//li')

print result  #  Print <li> List of elements of the tag 
print len(result)
print type(result)
print type(result[0])

Output results ：

<type 'lxml.etree._ElementTree'>
[<Element li at 0x1014e0e18>, <Element li at 0x1014e0ef0>, <Element li at 0x1014e0f38>, <Element li at 0x1014e0f80>, <Element li at 0x1014e0fc8>]
5
<type 'list'>
<type 'lxml.etree._Element'>

2. Keep getting `<li>` All of the tags `class` attribute

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')
result = html.xpath('//li/@class')

print result

Running results

['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0']

3. Keep getting `<li>` Under the label `href` by `link1.html` Of `<a>` label

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')
result = html.xpath('//li/a[@href="link1.html"]')

print result

Running results

[<Element a at 0x10ffaae18>]

4. obtain `<li>` All under the label `<span>` label

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')

#result = html.xpath('//li/span')
# Note that this is wrong ：
# because  /  Is used to get child elements , and  <span>  Not at all  <li>  Child elements , therefore , Use a double slash 

result = html.xpath('//li//span')

print result

Running results

[<Element span at 0x10d698e18>]

5. obtain `<li>` Label under `<a>` Everything in the label class

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')
result = html.xpath('//li/a//@class')

print result

Running results

['blod']

6. Get the last `<li>` Inside `<a>` Of href Property value

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')

result = html.xpath('//li[last()]/a/@href')
#  Predicate  [last()]  You can find the last element 

print result

Running results

['link5.html']

7. Get the content of the penultimate element

# xpath_li.py

from lxml import etree

<a href="www.xxx.com">abcd</a>

html = etree.parse('hello.html')
result = html.xpath('//li[last()-1]/a')

# text  Method can get the element content 
print result[0].text

Running results

fourth item

8. obtain `class` The value is `bold` The tag name of

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')

result = html.xpath('//*[@class="bold"]')

# tag Method to get the tag name 
print result[0].tag

Running results

span

Data extraction BeautifuSoup Module and Css Selectors ( expand )

""" # BeautifulSoup  It is an efficient web page parsing library , It can be downloaded from HTML or XML Extract data from files   Supports different parsers , such as , Yes HTML analysis , Yes XML analysis , Yes HTML5 analysis   Is a very powerful tool , Reptile weapon   An inspirational and convenient web page parsing library , Handle efficiently , Support for multiple parsers   Using it, we can easily grasp web information without writing regular expressions  """

#  install  pip3 install BeautifulSoup4


#  tag chooser 
###  Select by tag 
#### .string() -- Get the text node and content 

html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="title" name="dromouse"><b><span>The Dormouse's story</span></b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
from bs4 import BeautifulSoup   #  Guide pack 
soup = BeautifulSoup(html, 'lxml')  #  Parameters 1： To parse html  Parameters 2： Parser 

# print(soup.prettify()) #  Code completion 

print(soup.html.head.title.string)

print(soup.title.string)  #title It's a node , .string Attribute   The function is to get the string text 

#  Select entire head, Include the tag itself 
print(soup.head) #  contain head Everything including labels 

print(soup.p) #  Return the first result of the match 


#%% md

###  Get the name 
#### .name() -- Get the name of the tag itself  

#%%

html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

print(soup.title.name)  #  The result is the tag itself  --> title
print(soup.p.name)  # -->  Get the tag name 

#%% md

###  Get attribute value 

#### .attrs() -- Get attribute value through attribute  

#%%

html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title asdas" name="abc" id = "qwe"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/123" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

print(soup.p.attrs['name'])#  obtain p label name Property value of property 

print(soup.p.attrs['id']) #  obtain p label id Property value of property 
print(soup.p['id']) # The second way 

print(soup.p['class']) #  Save as a list 
print(soup.a['href'])  #  Also, only the first value is returned 

#%% md

###  Nested selection 

 There must be a child parent relationship 

#%%

html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The abc Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.body.p.b.string)  # Look down layer by layer 

#%% md

###  Child nodes and descendant nodes 

#%%

html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

#  The tag selector can only get part of the content  , Can't get all , Then how to solve ？？

# .contents Property can be used to tag( label ) The child nodes of are output in the form of a list 
# print(soup.p.contents) #  obtain P Label all child node contents   Return to one list

for i in soup.p.contents:
    print(i)


#%%



#%%

html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# .children It's a list Iterator of type 
print(soup.p.children)  #  Get child nodes   Returns an iterator 

for i in soup.p.children:
    print(i)

for i, child in enumerate(soup.p.children):  
    print(i, child)

#%%

html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)  #  Get children's node   Returns an iterator 
for i, child in enumerate(soup.p.descendants):
    print(i, child)

#%% md

###  Parent node and ancestor node 

#%%

html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)  #  Get parent node 

#%%

html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(list(enumerate(soup.a.parents)))  #  Get ancestor nodes 

#%% md

###  Brother node 

#%%

html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> <span>abcqweasd</span> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(list(enumerate(soup.a.next_siblings)))  #  All brother nodes behind 
print('---'*15)
print(list(enumerate(soup.a.previous_siblings))) #  The front one 

#%% md

##  practical ： Standard selector 

#%% md

### find_all( name , attrs , recursive , text , **kwargs )

#%% md

 According to the tag name 、 attribute 、 Content search document 

#%% md

####  Use find_all Look by tag name 

#%%

html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo-2</li> <li class="element">Bar-2</li> </ul> </div> </div> '''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

print(soup.find_all('ul'))  #  Get all of it ul Label and its contents 
print(soup.find_all('ul')[0])

ul = soup.find_all('ul')
print(ul) #  Get the whole ul Label and its contents 
print('____________'*10)

for ul in soup.find_all('ul'):
# print(ul) #  Traverse ul label 
    for li in ul:
# print(li) # Traverse li label 
        print(li.string)  #  Get all of it li The text content in the label 

#%% md

####  Get the text value 

#%%

for ul in soup.find_all('ul'):
    for i in ul.find_all("li"):
        print(i.string)

#%% md

####  Find... By attributes 

#%%

html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> '''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

#  The first way to write it   adopt attrs
# print(soup.find_all(attrs={'id': 'list-1'})) #  according to id attribute 
print("-----"*10)
# print(soup.find_all(attrs={'name': 'elements'})) #  according to name attribute 

for ul in soup.find_all(attrs={
    'name': 'elements'}):
    print(ul)  
    print(ul.li.string)  # Only the first value can be returned to you 
# # # # print('-----')
    for li in ul:
# print(li)
        print(li.string)

#%% md

####  Special attribute search  

#%%

html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> '''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

#  The second way 
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))  # class Belong to Python keyword , Do special treatment  _

#  Recommended search methods  li Label under class attribute 
print(soup.find_all('li',{
    'class','element'}))  

#%% md

####  Select... According to the text value  text

#%%

html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> '''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

print(soup.find_all(text='Foo')) #  It can be used for content statistics 
print(soup.find_all(text='Bar'))
print(len(soup.find_all(text='Foo'))) #  Statistical quantity 

#%% md

### find( name , attrs , recursive , text , **kwargs )

#%% md

find Return single element ,find_all Return all elements 

#%%

html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> '''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find('ul')) #  Only the first one that matches is returned 
print(soup.find('li'))

print(soup.find('page')) #  If the tag does not exist, return None

#%% md

### find_parents() find_parent()

#%% md

find_parents() Return all ancestor nodes ,find_parent() Return direct parent .

#%% md

### find_next_siblings() find_next_sibling()

#%% md

find_next_siblings() Back to all the sibling nodes in the back ,find_next_sibling() Go back to the first sibling node .

#%% md

### find_previous_siblings() find_previous_sibling()

#%% md

find_previous_siblings() Return to all previous sibling nodes ,find_previous_sibling() Return to the first sibling node .

#%% md

### find_all_next() find_next()

#%% md

find_all_next() All the eligible nodes after returning the node , find_next() Return the first eligible node 

#%% md

### find_all_previous()  and  find_previous()

#%% md

find_all_previous() All the eligible nodes after returning the node , find_previous() Return the first eligible node 

#%% md

## CSS Selectors 

#%% md

 adopt select() Direct in CSS The selector completes the selection 

 If the HTML Inside CSS Selectors are familiar and can be considered in this way 

#%% md

 Be careful ：

    1, Write CSS when , The tag name is not decorated , Add before class name . , id Add before name # 
    
    2, The method used soup.select(), The return type is list
    
    3, Multiple filter conditions need to be separated by spaces , From the past to the future, it is screening layer by layer 


#%%

html=''' <div class="pan">q321312321</div> <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> '''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

#  Hierarchy  ul li
print(soup.select('ul li'))  #  The label is not decorated 
print("----"*10)
print(soup.select('.panel .panel-heading')) #  Add before class name .
print("----"*10)

print(soup.select('#list-1 .element')) 
print("----"*10)


#%%

html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> '''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    for i in ul.select('li'):
        print(i.string)
        

###  get attribute 


html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> '''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# [] obtain id attribute  attrs obtain class attribute 
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['class'])


###  Get content 
### get_text() 

html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element2">Foo</li> <li class="element2">Bar</li> </ul> </div> </div> '''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
    print(li.string)
    print(li.get_text())  #  Get content 

*  Recommended lxml Parsing library , Use... If necessary html.parser
*  Tag selection and filtering is weak but fast 
*  It is recommended to use find()、find_all()  The query matches a single result or multiple results 
*  If the CSS The selector is familiar with the recommended use of select()
*  Remember the common ways to get attributes and text values

Data extraction CSS Selectors

css Syntax summary

Students who are familiar with the front end are right css Selectors must be familiar , such as jquery Through all kinds of css Selector syntax DOM Operation etc.

Data extraction performance comparison

Use in reptiles css Selectors , Code tutorial

>>> from requests_html import session

#  Return to one Response object 
>>> r = session.get('https://python.org/')

#  Get all the links 
>>> r.html.links
{
    '/users/membership/', '/about/gettingstarted/'}

#  Use css Get an element by selector 
>>> about = r.html.find('#about')[0]

>>> print(about.text)
About
Applications
Quotes
Getting Started
Help
Python Brochure