当前位置：网站首页>Use of parsel

Use of parsel

2022-07-27 10:57:00 【W_ chuanqi】

Personal profile
Author's brief introduction ： Hello everyone , I am a W_chuanqi, A programming enthusiast
Personal home page ：W_chaunqi
Stand by me ： give the thumbs-up + Collection ️+ Leaving a message.
May you and I share ：“ If you are in the mire , The heart is also in the mire , Then all eyes are muddy ; If you are in the mire , And I miss Kun Peng , Then you can see 90000 miles of heaven and earth .”

List of articles

parsel Use

parsel Use

1. brief introduction

parsel This library can parse HTML and XML, And support the use of XPath and CSS Selectors extract and modify content , At the same time, it also integrates the extraction function of regular expressions .parsel Flexible and powerful , It's also Python The most popular crawler framework Scrapy Underlying support for .

2. preparation

Before we start , Please make sure it's installed parsel library , If not already installed , Use pip3 Just install it ：

pip install parsel

Once installed , We can start this section .

3. initialization

First , Statement html The variables are as follows ：

html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''

next , We usually use parsel In the database Selector This class declares a Selector object , It is written as follows ：

from parsel import Selector
selector = Selector(text=html)

So we've created one Selector object , Pass it on text Parameters , The content is just stated HTML character string , Then assign the created object selector Variable .

With Selector After object , We can use css and xpath Methods are passed in CSS Selectors and XPath Do content extraction , For example, here we want to extract class contain item-0 The node of , It is written as follows ：

items = selector.css('.item-0')
print(len(items), type(items), items)
items2 = selector.xpath('//li[contains(@class,"item-0")]')
print(len(items2), type(items), items2)

First, use css Method to extract nodes , Then the length and content of the extraction result are output .xpath The method is the same , The operation results are as follows ：

You can see that both results are SelectorList object , This is actually an iteratable object . use len Method gets the length of the result , All are 3. in addition , The node represented by the extraction result is the same , Is the first 1、3、5 individual 1i node , Each node is still represented by Selector Object's form return , Each of them Selector Object's data The attribute contains the corresponding extraction node HTML Code .

Here you may have a question , The first time is not with css Method to extract nodes ？ Why in the result Selector Object outputs xpath Property instead of css attribute ？ This is because in the css Behind the method , We passed on CSS The selector is first converted to XPath, What is really used for node extraction is XPath. among CSS The selector is converted to XPath The process of is from the bottom csselect This library implements , for example .item-0 This CSS The selector is converted to XPath The result is that descendant-or-self:[@class and contains(concat(‘’, normalize-space(@class),‘’),‘item-0’)], So the output Selector The object has xpath attribute . But don't worry , This has no effect on the extraction results , It's just a different representation .

4. Extract text

Since the result of the extraction just now is an iteratable object Selectorlist, So to get all the extracted 1i The text content of the node , It's time to traverse the results , It is written as follows ：

from parsel import Selector
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''
selector = Selector(text=html)
items = selector.css('.item-0')
for item in items:
    text = item.xpath('.//text()').get()
    print(text)

Here we traverse items Variable , And assignment item, So here item Become a Selector object , At this time, you can call its css or xpath Method for content extraction . Here we use .//text() This XPath The writing method extracts all the contents of the current node , At this point, if you do not call other methods , Then the return result should still be Selector Constitutes an iteratable object Selectorlist.Selectorlist There is one of them. get Method , Can be SelectorList Contains Selector Extract the content of the object .

The operation results are as follows ：

get The purpose of the method is to Solectortuist Extract the first one Selector object , Then output the results in this .

Let's take another example ：

from parsel import Selector
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''
selector = Selector(text=html)
result = selector.xpath('//li[contains(@class,"item-0")]//text()').get()
print(result)

The output is as follows ：

Here we use //li[contains(@class,“item-0”)]//text() All selected class contain item-0 The text content of the node . To be precise , Return results SelectorList It should correspond to three li object , And here get Method only returns the first li The text content of the object . Because it only extracts the first Selector The result of the object .

Is it possible to extract all Selector The method of corresponding content ？ Yes , That's it getall Method . So if you want to extract all the corresponding li The text content of the node , The writing method can be rewritten as follows ：

result = selector.xpath('//li[contains(@class,"item-0")]//text()').getall()
print(result)

The output is as follows ：

Now , What we get is the result of list type , Each of them and Selector The object is —— Corresponding . therefore , If you want to extract Selectorlist The corresponding result , have access to get or getall Method , The former will get the first Selector The contents of the object , The latter will get each in turn Selector The result corresponding to the object .

In addition, in the above case , If you put xpath The method is rewritten as css Method , That's how it works ：

result = selector.css('.item-0 *::text').getall()
print(result)

here * Used to extract all child nodes （ Include plain text nodes ）, Extracting text requires adding ::text, The final running result is the same as above .

Come here , We simply understand the method of extracting text .

5. Extract attributes

Just now we demonstrated HTML Text extraction in , Directly in XPath Add //text(） that will do , How to extract attributes ？ The way is similar , It's also directly in XPath perhaps CSS Just show it in the selector .

For example, we extract the third li Inside the node a Node href attribute , It is written as follows ：

from parsel import Selector
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''
selector = Selector(text=html)
result = selector.css('.item-0.active a::attr(href)').get()
print(result)
result = selector.xpath(
    '//li[contains(@class,"item-0") and contains(@class,"active")]/a/@href').get()
print(result)

Here we realize two ways of writing , Use them separately css and xpath Method realization . We also include item-0 and active Two class On the basis of , To choose the third li node , Then I further selected the inside a node . about CSS Selectors , You need to add ::attr(), Only when the attributes corresponding to the parallel transmission are called can they be selected ; about XPath, Direct use /@ Add the attribute name to select . At last, we can use get Methods extract results .

The operation results are as follows ：

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-lgNFIv53-1658841983527)(https://s2.loli.net/2022/07/23/UrVHAl5iMGvIzNe.png)]

We can see that both methods correctly extract the corresponding href attribute .

6. Regular extraction

Except for the common css and xpath Method ,Selector Object also provides regular expression extraction methods , Let's use an example to understand ：

from parsel import Selector
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''
selector = Selector(text=html)
result = selector.css('.item-0').re('link.*')
print(result)

The operation results are as follows ：

Insert picture description here

You can see ,re Method traverses all the extracted Selector object , Then according to the regular expression passed in , Find the source code of the node that meets the rules and return it in the form of a list .

Of course , If you're calling css When the method is used , Further results have been extracted , For example, the node text value is extracted , that re Method will only extract the text value of the node ：

from parsel import Selector
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''
selector = Selector(text=html)
result = selector.css('.item-0 *::text').re('.*item')
print(result)

The operation results are as follows ：

We can also use it re_first Method to extract the first result that conforms to the rule ：

from parsel import Selector
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''
selector = Selector(text=html)
result = selector.css('.item-0').re_first('<span class="bold">(.*?)</span>')
print(result)

Here we call re_ first Method , Extracted by Span The text value contained in the tag , The extraction result is enclosed in parentheses to represent an extraction group , The final output is the part surrounded by parentheses , The operation results are as follows ：

Through these examples , We know some ways to use regular matching ,re Corresponding to multiple results ,re_first Corresponding to a single result , In different cases, you can choose the appropriate method to extract .

原网站

版权声明
本文为[W_ chuanqi]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/208/202207271027214665.html