当前位置:网站首页>Use of parsel
Use of parsel
2022-07-27 10:57:00 【W_ chuanqi】
Personal profile
Author's brief introduction : Hello everyone , I am a W_chuanqi, A programming enthusiast
Personal home page :W_chaunqi
Stand by me : give the thumbs-up + Collection ️+ Leaving a message.
May you and I share :“ If you are in the mire , The heart is also in the mire , Then all eyes are muddy ; If you are in the mire , And I miss Kun Peng , Then you can see 90000 miles of heaven and earth .”

List of articles
parsel Use
1. brief introduction
parsel This library can parse HTML and XML, And support the use of XPath and CSS Selectors extract and modify content , At the same time, it also integrates the extraction function of regular expressions .parsel Flexible and powerful , It's also Python The most popular crawler framework Scrapy Underlying support for .
2. preparation
Before we start , Please make sure it's installed parsel library , If not already installed , Use pip3 Just install it :
pip install parsel

Once installed , We can start this section .
3. initialization
First , Statement html The variables are as follows :
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''
next , We usually use parsel In the database Selector This class declares a Selector object , It is written as follows :
from parsel import Selector
selector = Selector(text=html)
So we've created one Selector object , Pass it on text Parameters , The content is just stated HTML character string , Then assign the created object selector Variable .
With Selector After object , We can use css and xpath Methods are passed in CSS Selectors and XPath Do content extraction , For example, here we want to extract class contain item-0 The node of , It is written as follows :
items = selector.css('.item-0')
print(len(items), type(items), items)
items2 = selector.xpath('//li[contains(@class,"item-0")]')
print(len(items2), type(items), items2)
First, use css Method to extract nodes , Then the length and content of the extraction result are output .xpath The method is the same , The operation results are as follows :

You can see that both results are SelectorList object , This is actually an iteratable object . use len Method gets the length of the result , All are 3. in addition , The node represented by the extraction result is the same , Is the first 1、3、5 individual 1i node , Each node is still represented by Selector Object's form return , Each of them Selector Object's data The attribute contains the corresponding extraction node HTML Code .
Here you may have a question , The first time is not with css Method to extract nodes ? Why in the result Selector Object outputs xpath Property instead of css attribute ? This is because in the css Behind the method , We passed on CSS The selector is first converted to XPath, What is really used for node extraction is XPath. among CSS The selector is converted to XPath The process of is from the bottom csselect This library implements , for example .item-0 This CSS The selector is converted to XPath The result is that descendant-or-self:[@class and contains(concat(‘’, normalize-space(@class),‘’),‘item-0’)], So the output Selector The object has xpath attribute . But don't worry , This has no effect on the extraction results , It's just a different representation .
4. Extract text
Since the result of the extraction just now is an iteratable object Selectorlist, So to get all the extracted 1i The text content of the node , It's time to traverse the results , It is written as follows :
from parsel import Selector
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''
selector = Selector(text=html)
items = selector.css('.item-0')
for item in items:
text = item.xpath('.//text()').get()
print(text)
Here we traverse items Variable , And assignment item, So here item Become a Selector object , At this time, you can call its css or xpath Method for content extraction . Here we use .//text() This XPath The writing method extracts all the contents of the current node , At this point, if you do not call other methods , Then the return result should still be Selector Constitutes an iteratable object Selectorlist.Selectorlist There is one of them. get Method , Can be SelectorList Contains Selector Extract the content of the object .
The operation results are as follows :

get The purpose of the method is to Solectortuist Extract the first one Selector object , Then output the results in this .
Let's take another example :
from parsel import Selector
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''
selector = Selector(text=html)
result = selector.xpath('//li[contains(@class,"item-0")]//text()').get()
print(result)
The output is as follows :

Here we use //li[contains(@class,“item-0”)]//text() All selected class contain item-0 The text content of the node . To be precise , Return results SelectorList It should correspond to three li object , And here get Method only returns the first li The text content of the object . Because it only extracts the first Selector The result of the object .
Is it possible to extract all Selector The method of corresponding content ? Yes , That's it getall Method . So if you want to extract all the corresponding li The text content of the node , The writing method can be rewritten as follows :
result = selector.xpath('//li[contains(@class,"item-0")]//text()').getall()
print(result)
The output is as follows :

Now , What we get is the result of list type , Each of them and Selector The object is —— Corresponding . therefore , If you want to extract Selectorlist The corresponding result , have access to get or getall Method , The former will get the first Selector The contents of the object , The latter will get each in turn Selector The result corresponding to the object .
In addition, in the above case , If you put xpath The method is rewritten as css Method , That's how it works :
result = selector.css('.item-0 *::text').getall()
print(result)
here * Used to extract all child nodes ( Include plain text nodes ), Extracting text requires adding ::text, The final running result is the same as above .
Come here , We simply understand the method of extracting text .
5. Extract attributes
Just now we demonstrated HTML Text extraction in , Directly in XPath Add //text() that will do , How to extract attributes ? The way is similar , It's also directly in XPath perhaps CSS Just show it in the selector .
For example, we extract the third li Inside the node a Node href attribute , It is written as follows :
from parsel import Selector
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''
selector = Selector(text=html)
result = selector.css('.item-0.active a::attr(href)').get()
print(result)
result = selector.xpath(
'//li[contains(@class,"item-0") and contains(@class,"active")]/a/@href').get()
print(result)
Here we realize two ways of writing , Use them separately css and xpath Method realization . We also include item-0 and active Two class On the basis of , To choose the third li node , Then I further selected the inside a node . about CSS Selectors , You need to add ::attr(), Only when the attributes corresponding to the parallel transmission are called can they be selected ; about XPath, Direct use /@ Add the attribute name to select . At last, we can use get Methods extract results .
The operation results are as follows :
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-lgNFIv53-1658841983527)(https://s2.loli.net/2022/07/23/UrVHAl5iMGvIzNe.png)]
We can see that both methods correctly extract the corresponding href attribute .
6. Regular extraction
Except for the common css and xpath Method ,Selector Object also provides regular expression extraction methods , Let's use an example to understand :
from parsel import Selector
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''
selector = Selector(text=html)
result = selector.css('.item-0').re('link.*')
print(result)
The operation results are as follows :

You can see ,re Method traverses all the extracted Selector object , Then according to the regular expression passed in , Find the source code of the node that meets the rules and return it in the form of a list .
Of course , If you're calling css When the method is used , Further results have been extracted , For example, the node text value is extracted , that re Method will only extract the text value of the node :
from parsel import Selector
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''
selector = Selector(text=html)
result = selector.css('.item-0 *::text').re('.*item')
print(result)
The operation results are as follows :

We can also use it re_first Method to extract the first result that conforms to the rule :
from parsel import Selector
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''
selector = Selector(text=html)
result = selector.css('.item-0').re_first('<span class="bold">(.*?)</span>')
print(result)
Here we call re_ first Method , Extracted by Span The text value contained in the tag , The extraction result is enclosed in parentheses to represent an extraction group , The final output is the part surrounded by parentheses , The operation results are as follows :

Through these examples , We know some ways to use regular matching ,re Corresponding to multiple results ,re_first Corresponding to a single result , In different cases, you can choose the appropriate method to extract .
边栏推荐
- 基于Spark封装的二次开发工程edata-base,介绍
- 发动机悬置系统冲击仿真-瞬时模态动态分析与响应谱分析
- Application scenarios, key technologies and network architecture of communication perception integration
- 招聘顶尖人才!旷视科技“MegEagle创视者计划”正式启动
- Views, triggers and stored procedures in MySQL
- 全校软硬件基础设施一站式监控 ,苏州大学以时序数据库替换 PostgreSQL
- Detailed analysis of graphs of echats diagram les miserables (chord diagram)
- 文档智能多模态预训练模型LayoutLMv3:兼具通用性与优越性
- How to modify the strict mode under MySQL so that adding new users by inserting user table is successful
- 开源项目丨Taier1.2版本发布,新增工作流、租户绑定简化等多项功能
猜你喜欢

游戏玩家问题

基于Spark封装的二次开发工程edata-base,介绍

Views, triggers and stored procedures in MySQL

Alibaba mailbox web login turn processing

ECCV 2022 | 同时完成四项跟踪任务!Unicorn: 迈向目标跟踪的大统一

TDengine 商业生态合作伙伴招募开启

Webrtc realizes simple audio and video call function

异构计算技术分析

TDengine 助力西门子轻量级数字化解决方案 SIMICAS 简化数据处理流程

Recruit top talents! The "megeagle creator program" of Kuangshi technology was officially launched
随机推荐
MySQL 索引、事务与存储引擎
Using skills of word
Want to speed up the vit model with one click? Try this open source tool!
推导STO双中心动能积分的详细展开式
[brother hero July training] day 16: queue
Overview of data security in fog computing
WEB服务如何平滑的上下线
ECCV 2022 | complete four tracking tasks at the same time! Unicorn: towards the unification of target tracking
Iptables prevent nmap scanning and binlog explanation
搭建 Samba 服务
MySQL deadlock, pessimistic lock, optimistic lock
发动机悬置系统冲击仿真-瞬时模态动态分析与响应谱分析
Self optimization of wireless cell load balancing based on machine learning technology
flask_restful中的输出域(Resource、fields、marshal、marshal_with)
Server access speed
Beijing publicized the spot check of 8 batches of children's shoes, and qierte was listed as unqualified
Deep analysis: what is diffusion model?
Gamer questions
迭代次数和熵之间关系的一个验证试验
The largest square of leetcode (violence solving and dynamic programming solving)