当前位置:网站首页>[crawler] XPath for data extraction
[crawler] XPath for data extraction
2022-07-04 23:10:00 【Speech unrecognized】
install
pip install lxml
Guide pack
from lxml import etree
Use
take html character string Convert to element object
# take html character string Convert to element object
from lxml import etree
element = etree.HTML(html_str)
The following is through element object .xpath(' Matching rules ')
To extract content
Get tag
Use /
Represents the root node , Path and transition between paths
/html/xx/xx/xxx
Use //
Cross node selection , Go directly to the desired label or text
//xxx # Get all xxx label
Use .
./ Current node
Use ..
../ # The upper node of the current node
.//
When not complete html when , Use , Get relative path
get attribute
@ Property name
Get the current tag The attribute value corresponding to this attribute
//img/@src # all img Of scr attribute
Get text
/text()
Get the text content in the tag // Tag name [contains( text() , ' written words ' ) ]
Get contains In words label
//ol/li//span[contains(text(),' Playable ')]
Get specific condition tags
// Tag name [@ Property name = value ]
Locate specific tags according to their attribute values
//span[@class='title'] # You can get it by class name
// Tag name [ Indexes ]
Index from 1 Start
Get from the front // Upper label / Tag name [position()>3]
From 4 Start
Get from the back // Upper label / Tag name [last()]
Get the last // Upper label / Tag name [last() - 2]
Last but not least 3 individual
combination //ol/li[position()>1][position()<last()-2]
// Tag name [text()=' value ']
Locate the specific label according to the specific text content in the label , You need to match every word
//ol/li//span[text()='[ Playable ]'] # The matching tag content is [ Playable ] The label of
边栏推荐
猜你喜欢
随机推荐
【爬虫】数据提取之xpath
EditPlus--用法--快捷键/配置/背景色/字体大小
mamp下缺少pcntl扩展的解决办法,Fatal error: Call to undefined function pcntl_signal()
Redis getting started complete tutorial: Key Management
The solution to the lack of pcntl extension under MAMP, fatal error: call to undefined function pcntl_ signal()
Sword finger offer 68 - ii The nearest common ancestor of binary tree
MariaDB的Galera集群-双主双活安装设置
Sword finger offer 68 - I. nearest common ancestor of binary search tree
Three stage operations in the attack and defense drill of the blue team
Google Earth engine (GEE) - globfire daily fire data set based on mcd64a1
Advantages of Alibaba cloud international CDN
时间 (计算)总工具类 例子: 今年开始时间和今年结束时间等
Photoshop batch adds different numbers to different pictures
浅聊一下中间件
图片懒加载的原理
Question brushing guide public
Duplicate ADMAS part name
机器学习在房屋价格预测上的应用
P2181 对角线和P1030 [NOIP2001 普及组] 求先序排列
Redis introduction complete tutorial: List explanation