当前位置:网站首页>[crawler] XPath for data extraction
[crawler] XPath for data extraction
2022-07-04 23:10:00 【Speech unrecognized】
install
pip install lxml
Guide pack
from lxml import etree
Use
take html character string Convert to element object
# take html character string Convert to element object
from lxml import etree
element = etree.HTML(html_str)
The following is through element object .xpath(' Matching rules ') To extract content
Get tag
Use / Represents the root node , Path and transition between paths
/html/xx/xx/xxx
Use // Cross node selection , Go directly to the desired label or text
//xxx # Get all xxx label
Use .
./ Current node
Use ..
../ # The upper node of the current node
.// When not complete html when , Use , Get relative path
get attribute
@ Property name Get the current tag The attribute value corresponding to this attribute
//img/@src # all img Of scr attribute
Get text
/text() Get the text content in the tag // Tag name [contains( text() , ' written words ' ) ] Get contains In words label
//ol/li//span[contains(text(),' Playable ')]
Get specific condition tags
// Tag name [@ Property name = value ] Locate specific tags according to their attribute values
//span[@class='title'] # You can get it by class name
// Tag name [ Indexes ] Index from 1 Start
Get from the front // Upper label / Tag name [position()>3] From 4 Start
Get from the back // Upper label / Tag name [last()] Get the last // Upper label / Tag name [last() - 2] Last but not least 3 individual
combination //ol/li[position()>1][position()<last()-2]
// Tag name [text()=' value '] Locate the specific label according to the specific text content in the label , You need to match every word
//ol/li//span[text()='[ Playable ]'] # The matching tag content is [ Playable ] The label of
边栏推荐
- 高通WLAN框架学习(30)-- 支持双STA的组件
- Async await used in map
- Redis:Redis消息的发布与订阅(了解)
- Attack and defense world misc advanced area can_ has_ stdio?
- 位运算符讲解
- MP进阶操作: 时间操作, sql,querywapper,lambdaQueryWapper(条件构造器)快速筛选 枚举类
- 【机器学习】手写数字识别
- Record: how to scroll screenshots of web pages on Microsoft edge in win10 system?
- 【剑指Offer】6-10题
- PS style JS webpage graffiti board plug-in
猜你喜欢

Redis: redis message publishing and subscription (understand)

Redis入门完整教程:有序集合详解

Redis getting started complete tutorial: Geo
![[Jianzhi offer] 6-10 questions](/img/73/5974068008bcdc9a70b3f5f57f1eb0.png)
[Jianzhi offer] 6-10 questions

vim编辑器知识总结

字体设计符号组合多功能微信小程序源码

Redis introduction complete tutorial: Collection details

QT drawing network topology diagram (connecting database, recursive function, infinite drawing, dragging nodes)

Duplicate ADMAS part name

On-off and on-off of quality system construction
随机推荐
推荐收藏:跨云数据仓库(data warehouse)环境搭建,这货特别干!
Redis入门完整教程:Bitmaps
初试为锐捷交换机跨设备型号升级版本(以RG-S2952G-E为例)
Set up a website with a sense of ceremony, and post it to 1/2 of the public network through the intranet
JS card style countdown days
Redis:Redis的事务
A complete tutorial for getting started with redis: redis shell
该如何去选择证券公司,手机上开户安不安全
String类中的常用方法
Google collab trample pit
[roommate learned to use Bi report data processing in the time of King glory in one game]
Qt加法计算器(简单案例)
Advanced area of attack and defense world misc 3-11
Insert sort, select sort, bubble sort
Redis入门完整教程:集合详解
Talk about Middleware
A complete tutorial for getting started with redis: understanding and using APIs
One of the commonly used technical indicators, reading boll Bollinger line indicators
Redis入门完整教程:键管理
一次edu证书站的挖掘