当前位置:网站首页>[crawler] XPath for data extraction
[crawler] XPath for data extraction
2022-07-04 23:10:00 【Speech unrecognized】
install
pip install lxml
Guide pack
from lxml import etree
Use
take html character string Convert to element object
# take html character string Convert to element object
from lxml import etree
element = etree.HTML(html_str)
The following is through element object .xpath(' Matching rules ') To extract content
Get tag
Use / Represents the root node , Path and transition between paths
/html/xx/xx/xxx
Use // Cross node selection , Go directly to the desired label or text
//xxx # Get all xxx label
Use .
./ Current node
Use ..
../ # The upper node of the current node
.// When not complete html when , Use , Get relative path
get attribute
@ Property name Get the current tag The attribute value corresponding to this attribute
//img/@src # all img Of scr attribute
Get text
/text() Get the text content in the tag // Tag name [contains( text() , ' written words ' ) ] Get contains In words label
//ol/li//span[contains(text(),' Playable ')]
Get specific condition tags
// Tag name [@ Property name = value ] Locate specific tags according to their attribute values
//span[@class='title'] # You can get it by class name
// Tag name [ Indexes ] Index from 1 Start
Get from the front // Upper label / Tag name [position()>3] From 4 Start
Get from the back // Upper label / Tag name [last()] Get the last // Upper label / Tag name [last() - 2] Last but not least 3 individual
combination //ol/li[position()>1][position()<last()-2]
// Tag name [text()=' value '] Locate the specific label according to the specific text content in the label , You need to match every word
//ol/li//span[text()='[ Playable ]'] # The matching tag content is [ Playable ] The label of
边栏推荐
- Redis入门完整教程:哈希说明
- 【ODX Studio編輯PDX】-0.2-如何對比Compare兩個PDX/ODX文件
- 【ODX Studio编辑PDX】-0.3-如何删除/修改Variant变体中继承的(Inherited)元素
- [Taichi] change pbf2d (position based fluid simulation) of Taiji to pbf3d with minimal modification
- 智力考验看成语猜古诗句微信小程序源码
- 【剑指offer】1-5题
- 该如何去选择证券公司,手机上开户安不安全
- OSEK标准ISO_17356汇总介绍
- debug和release的区别
- 云服务器设置ssh密钥登录
猜你喜欢

Redis入门完整教程:客户端通信协议
![[roommate learned to use Bi report data processing in the time of King glory in one game]](/img/06/22dde3fcc0456bd230e1d0cde339ec.png)
[roommate learned to use Bi report data processing in the time of King glory in one game]
![[OpenGL] note 29 anti aliasing (MSAA)](/img/66/61f29e1c41d3099d55e2ead0a3b01e.png)
[OpenGL] note 29 anti aliasing (MSAA)

qt绘制网络拓补图(连接数据库,递归函数,无限绘制,可拖动节点)

EditPlus--用法--快捷键/配置/背景色/字体大小

Redis:Redis消息的发布与订阅(了解)

JS card style countdown days

【剑指Offer】6-10题

Redis入门完整教程:有序集合详解

Talk about Middleware
随机推荐
Photoshop batch adds different numbers to different pictures
Set up a website with a sense of ceremony, and post it to 1/2 of the public network through the intranet
Google Earth engine (GEE) - tasks upgrade enables run all to download all images in task types with one click
图片懒加载的原理
A complete tutorial for getting started with redis: understanding and using APIs
Redis démarrer le tutoriel complet: Pipeline
Redis入门完整教程:键管理
常用技术指标之一文读懂BOLL布林线指标
机器学习在房屋价格预测上的应用
【ODX Studio編輯PDX】-0.2-如何對比Compare兩個PDX/ODX文件
【ODX Studio编辑PDX】-0.2-如何对比Compare两个PDX/ODX文件
Sword finger offer 67 Convert a string to an integer
Qt个人学习总结
JS 3D explosive fragment image switching JS special effect
Servlet+JDBC+MySQL简单web练习
[odx Studio Edit pdx] - 0.2 - Comment comparer deux fichiers pdx / odx
Redis introduction complete tutorial: client communication protocol
Network namespace
Notepad++--编辑的技巧
How can enterprises cross the digital divide? In cloud native 2.0