当前位置:网站首页>[crawler] XPath for data extraction
[crawler] XPath for data extraction
2022-07-04 23:10:00 【Speech unrecognized】
install
pip install lxml
Guide pack
from lxml import etree
Use
take html character string Convert to element object
# take html character string Convert to element object
from lxml import etree
element = etree.HTML(html_str)
The following is through element object .xpath(' Matching rules ')
To extract content
Get tag
Use /
Represents the root node , Path and transition between paths
/html/xx/xx/xxx
Use //
Cross node selection , Go directly to the desired label or text
//xxx # Get all xxx label
Use .
./ Current node
Use ..
../ # The upper node of the current node
.//
When not complete html when , Use , Get relative path
get attribute
@ Property name
Get the current tag The attribute value corresponding to this attribute
//img/@src # all img Of scr attribute
Get text
/text()
Get the text content in the tag // Tag name [contains( text() , ' written words ' ) ]
Get contains In words label
//ol/li//span[contains(text(),' Playable ')]
Get specific condition tags
// Tag name [@ Property name = value ]
Locate specific tags according to their attribute values
//span[@class='title'] # You can get it by class name
// Tag name [ Indexes ]
Index from 1 Start
Get from the front // Upper label / Tag name [position()>3]
From 4 Start
Get from the back // Upper label / Tag name [last()]
Get the last // Upper label / Tag name [last() - 2]
Last but not least 3 individual
combination //ol/li[position()>1][position()<last()-2]
// Tag name [text()=' value ']
Locate the specific label according to the specific text content in the label , You need to match every word
//ol/li//span[text()='[ Playable ]'] # The matching tag content is [ Playable ] The label of
边栏推荐
- 企业如何跨越数字化鸿沟?尽在云原生2.0
- Redis入门完整教程:初识Redis
- D3.js+Three. JS data visualization 3D Earth JS special effect
- Google Earth engine (GEE) - globfire daily fire data set based on mcd64a1
- Attack and defense world misc advanced area can_ has_ stdio?
- Redis入门完整教程:API的理解和使用
- 小程序vant tab组件解决文字过多显示不全的问题
- Record: how to scroll screenshots of web pages on Microsoft edge in win10 system?
- The solution to the lack of pcntl extension under MAMP, fatal error: call to undefined function pcntl_ signal()
- ETCD数据库源码分析——处理Entry记录简要流程
猜你喜欢
LabVIEW中比较两个VI
Attack and defense world misc advanced area can_ has_ stdio?
A complete tutorial for getting started with redis: hyperloglog
SPH中的粒子初始排列问题(两张图解决)
heatmap. JS picture hotspot heat map plug-in
[roommate learned to use Bi report data processing in the time of King glory in one game]
CTF竞赛题解之stm32逆向入门
VIM editor knowledge summary
Qt个人学习总结
Editplus-- usage -- shortcut key / configuration / background color / font size
随机推荐
Duplicate ADMAS part name
Google Earth engine (GEE) - tasks upgrade enables run all to download all images in task types with one click
[OpenGL] note 29 anti aliasing (MSAA)
Record: how to scroll screenshots of web pages on Microsoft edge in win10 system?
浅聊一下中间件
Sword finger offer 67 Convert a string to an integer
Redis入门完整教程:发布订阅
初试为锐捷交换机跨设备型号升级版本(以RG-S2952G-E为例)
Actual combat simulation │ JWT login authentication
The small program vant tab component solves the problem of too much text and incomplete display
位运算符讲解
Summary of wechat applet display style knowledge points
该如何去选择证券公司,手机上开户安不安全
Install the gold warehouse database of NPC
A complete tutorial for getting started with redis: getting to know redis for the first time
ECS settings SSH key login
数据库基础知识
Advanced area of attack and defense world misc 3-11
Sword finger offer 65 Add without adding, subtracting, multiplying, dividing
JS card style countdown days