当前位置:网站首页>Use of parsel
Use of parsel
2022-07-27 10:57:00 【W_ chuanqi】
Personal profile
Author's brief introduction : Hello everyone , I am a W_chuanqi, A programming enthusiast
Personal home page :W_chaunqi
Stand by me : give the thumbs-up + Collection ️+ Leaving a message.
May you and I share :“ If you are in the mire , The heart is also in the mire , Then all eyes are muddy ; If you are in the mire , And I miss Kun Peng , Then you can see 90000 miles of heaven and earth .”

List of articles
parsel Use
1. brief introduction
parsel This library can parse HTML and XML, And support the use of XPath and CSS Selectors extract and modify content , At the same time, it also integrates the extraction function of regular expressions .parsel Flexible and powerful , It's also Python The most popular crawler framework Scrapy Underlying support for .
2. preparation
Before we start , Please make sure it's installed parsel library , If not already installed , Use pip3 Just install it :
pip install parsel

Once installed , We can start this section .
3. initialization
First , Statement html The variables are as follows :
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''
next , We usually use parsel In the database Selector This class declares a Selector object , It is written as follows :
from parsel import Selector
selector = Selector(text=html)
So we've created one Selector object , Pass it on text Parameters , The content is just stated HTML character string , Then assign the created object selector Variable .
With Selector After object , We can use css and xpath Methods are passed in CSS Selectors and XPath Do content extraction , For example, here we want to extract class contain item-0 The node of , It is written as follows :
items = selector.css('.item-0')
print(len(items), type(items), items)
items2 = selector.xpath('//li[contains(@class,"item-0")]')
print(len(items2), type(items), items2)
First, use css Method to extract nodes , Then the length and content of the extraction result are output .xpath The method is the same , The operation results are as follows :

You can see that both results are SelectorList object , This is actually an iteratable object . use len Method gets the length of the result , All are 3. in addition , The node represented by the extraction result is the same , Is the first 1、3、5 individual 1i node , Each node is still represented by Selector Object's form return , Each of them Selector Object's data The attribute contains the corresponding extraction node HTML Code .
Here you may have a question , The first time is not with css Method to extract nodes ? Why in the result Selector Object outputs xpath Property instead of css attribute ? This is because in the css Behind the method , We passed on CSS The selector is first converted to XPath, What is really used for node extraction is XPath. among CSS The selector is converted to XPath The process of is from the bottom csselect This library implements , for example .item-0 This CSS The selector is converted to XPath The result is that descendant-or-self:[@class and contains(concat(‘’, normalize-space(@class),‘’),‘item-0’)], So the output Selector The object has xpath attribute . But don't worry , This has no effect on the extraction results , It's just a different representation .
4. Extract text
Since the result of the extraction just now is an iteratable object Selectorlist, So to get all the extracted 1i The text content of the node , It's time to traverse the results , It is written as follows :
from parsel import Selector
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''
selector = Selector(text=html)
items = selector.css('.item-0')
for item in items:
text = item.xpath('.//text()').get()
print(text)
Here we traverse items Variable , And assignment item, So here item Become a Selector object , At this time, you can call its css or xpath Method for content extraction . Here we use .//text() This XPath The writing method extracts all the contents of the current node , At this point, if you do not call other methods , Then the return result should still be Selector Constitutes an iteratable object Selectorlist.Selectorlist There is one of them. get Method , Can be SelectorList Contains Selector Extract the content of the object .
The operation results are as follows :

get The purpose of the method is to Solectortuist Extract the first one Selector object , Then output the results in this .
Let's take another example :
from parsel import Selector
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''
selector = Selector(text=html)
result = selector.xpath('//li[contains(@class,"item-0")]//text()').get()
print(result)
The output is as follows :

Here we use //li[contains(@class,“item-0”)]//text() All selected class contain item-0 The text content of the node . To be precise , Return results SelectorList It should correspond to three li object , And here get Method only returns the first li The text content of the object . Because it only extracts the first Selector The result of the object .
Is it possible to extract all Selector The method of corresponding content ? Yes , That's it getall Method . So if you want to extract all the corresponding li The text content of the node , The writing method can be rewritten as follows :
result = selector.xpath('//li[contains(@class,"item-0")]//text()').getall()
print(result)
The output is as follows :

Now , What we get is the result of list type , Each of them and Selector The object is —— Corresponding . therefore , If you want to extract Selectorlist The corresponding result , have access to get or getall Method , The former will get the first Selector The contents of the object , The latter will get each in turn Selector The result corresponding to the object .
In addition, in the above case , If you put xpath The method is rewritten as css Method , That's how it works :
result = selector.css('.item-0 *::text').getall()
print(result)
here * Used to extract all child nodes ( Include plain text nodes ), Extracting text requires adding ::text, The final running result is the same as above .
Come here , We simply understand the method of extracting text .
5. Extract attributes
Just now we demonstrated HTML Text extraction in , Directly in XPath Add //text() that will do , How to extract attributes ? The way is similar , It's also directly in XPath perhaps CSS Just show it in the selector .
For example, we extract the third li Inside the node a Node href attribute , It is written as follows :
from parsel import Selector
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''
selector = Selector(text=html)
result = selector.css('.item-0.active a::attr(href)').get()
print(result)
result = selector.xpath(
'//li[contains(@class,"item-0") and contains(@class,"active")]/a/@href').get()
print(result)
Here we realize two ways of writing , Use them separately css and xpath Method realization . We also include item-0 and active Two class On the basis of , To choose the third li node , Then I further selected the inside a node . about CSS Selectors , You need to add ::attr(), Only when the attributes corresponding to the parallel transmission are called can they be selected ; about XPath, Direct use /@ Add the attribute name to select . At last, we can use get Methods extract results .
The operation results are as follows :
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-lgNFIv53-1658841983527)(https://s2.loli.net/2022/07/23/UrVHAl5iMGvIzNe.png)]
We can see that both methods correctly extract the corresponding href attribute .
6. Regular extraction
Except for the common css and xpath Method ,Selector Object also provides regular expression extraction methods , Let's use an example to understand :
from parsel import Selector
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''
selector = Selector(text=html)
result = selector.css('.item-0').re('link.*')
print(result)
The operation results are as follows :

You can see ,re Method traverses all the extracted Selector object , Then according to the regular expression passed in , Find the source code of the node that meets the rules and return it in the form of a list .
Of course , If you're calling css When the method is used , Further results have been extracted , For example, the node text value is extracted , that re Method will only extract the text value of the node :
from parsel import Selector
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''
selector = Selector(text=html)
result = selector.css('.item-0 *::text').re('.*item')
print(result)
The operation results are as follows :

We can also use it re_first Method to extract the first result that conforms to the rule :
from parsel import Selector
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''
selector = Selector(text=html)
result = selector.css('.item-0').re_first('<span class="bold">(.*?)</span>')
print(result)
Here we call re_ first Method , Extracted by Span The text value contained in the tag , The extraction result is enclosed in parentheses to represent an extraction group , The final output is the part surrounded by parentheses , The operation results are as follows :

Through these examples , We know some ways to use regular matching ,re Corresponding to multiple results ,re_first Corresponding to a single result , In different cases, you can choose the appropriate method to extract .
边栏推荐
- 推导重叠积分的详细展开式 STO overlap integrals
- Alibaba mailbox web login turn processing
- Views, triggers and stored procedures in MySQL
- 简单几步教您实现为工业树莓派共享网络
- MySQL master-slave architecture, read-write separation, and high availability architecture
- JVM -- Analysis of bytecode
- [QNX hypervisor 2.2 user manual]9.9 logger
- Your appearance is amazing! Two JSON visualization tools are recommended for use with swagger. It's really fragrant
- Deep analysis: what is diffusion model?
- 异构计算技术分析
猜你喜欢

开源项目丨Taier1.2版本发布,新增工作流、租户绑定简化等多项功能

Record of a cross domain problem

It is thought-provoking: is syntax really important? Qiu Xipeng group proposed a powerful baseline for aspect based emotional analysis

Family Trivia

JVM -- Analysis of bytecode

pyquery 的使用

Set up Samba service

Advanced operation of MySQL data table

Detailed analysis of graphs of echats diagram les miserables (chord diagram)

Apache cannot start in phpstudy
随机推荐
It is thought-provoking: is syntax really important? Qiu Xipeng group proposed a powerful baseline for aspect based emotional analysis
基于Spark封装的二次开发工程edata-base,介绍
对象数组去重
黑白像素分布对迭代次数的影响
JSP自定义标签之自定义分页01
推导STO双中心动能积分的详细展开式
It is thought-provoking: is syntax really important? Qiu Xipeng group proposed a powerful baseline for aspect based emotional analysis
Gamer questions
OpenAtom OpenHarmony分论坛,今天14:00见!附大事记精彩发布
已解决SyntaxError: (unicode error) ‘unicodeescape‘ codec can‘t decode bytes in position 2-3: truncated
WEB服务如何平滑的上下线
Camera switching
Set up Samba service
服务器访问速度
正则form表单判断
Research on synaesthesia integration and its challenges
Beijing publicized the spot check of 8 batches of children's shoes, and qierte was listed as unqualified
让人深思:句法真的重要吗?邱锡鹏组提出一种基于Aspect的情感分析的强大基线...
荒野觅踪---寻找迭代次数
Solved syntaxerror: (Unicode error) 'Unicode scape' codec can't decode bytes in position 2-3: truncated