当前位置:网站首页>[crawler] XPath for data extraction
[crawler] XPath for data extraction
2022-07-04 23:10:00 【Speech unrecognized】
install
pip install lxml
Guide pack
from lxml import etree
Use
take html character string Convert to element object
# take html character string Convert to element object
from lxml import etree
element = etree.HTML(html_str)
The following is through element object .xpath(' Matching rules ') To extract content
Get tag
Use / Represents the root node , Path and transition between paths
/html/xx/xx/xxx
Use // Cross node selection , Go directly to the desired label or text
//xxx # Get all xxx label
Use .
./ Current node
Use ..
../ # The upper node of the current node
.// When not complete html when , Use , Get relative path
get attribute
@ Property name Get the current tag The attribute value corresponding to this attribute
//img/@src # all img Of scr attribute
Get text
/text() Get the text content in the tag // Tag name [contains( text() , ' written words ' ) ] Get contains In words label
//ol/li//span[contains(text(),' Playable ')]
Get specific condition tags
// Tag name [@ Property name = value ] Locate specific tags according to their attribute values
//span[@class='title'] # You can get it by class name
// Tag name [ Indexes ] Index from 1 Start
Get from the front // Upper label / Tag name [position()>3] From 4 Start
Get from the back // Upper label / Tag name [last()] Get the last // Upper label / Tag name [last() - 2] Last but not least 3 individual
combination //ol/li[position()>1][position()<last()-2]
// Tag name [text()=' value '] Locate the specific label according to the specific text content in the label , You need to match every word
//ol/li//span[text()='[ Playable ]'] # The matching tag content is [ Playable ] The label of
边栏推荐
- 云服务器设置ssh密钥登录
- [graph theory] topological sorting
- Record: how to scroll screenshots of web pages on Microsoft edge in win10 system?
- Redis入门完整教程:API的理解和使用
- Attack and defense world misc advanced area can_ has_ stdio?
- A complete tutorial for getting started with redis: hyperloglog
- JS 3D explosive fragment image switching JS special effect
- Redis introduction complete tutorial: List explanation
- Basic knowledge of database
- Redis: redis message publishing and subscription (understand)
猜你喜欢

Editplus-- usage -- shortcut key / configuration / background color / font size

【机器学习】手写数字识别

Google Earth engine (GEE) - tasks upgrade enables run all to download all images in task types with one click

可观测|时序数据降采样在Prometheus实践复盘

Redis getting started complete tutorial: hash description

Redis getting started complete tutorial: publish and subscribe

Redis入门完整教程:初识Redis

初试为锐捷交换机跨设备型号升级版本(以RG-S2952G-E为例)

Redis:Redis的事务

Redis入门完整教程:Redis Shell
随机推荐
Google Earth engine (GEE) -- take modis/006/mcd19a2 as an example to batch download the daily mean, maximum, minimum, standard deviation, statistical analysis of variance and CSV download of daily AOD
Basic knowledge of database
Photoshop批量给不同的图片添加不同的编号
Install the gold warehouse database of NPC
A complete tutorial for getting started with redis: transactions and Lua
【剑指offer】1-5题
Redis入门完整教程:事务与Lua
Redis introduction complete tutorial: slow query analysis
The small program vant tab component solves the problem of too much text and incomplete display
[graph theory] topological sorting
Redis入门完整教程:发布订阅
debug和release的区别
MariaDB的Galera集群-双主双活安装设置
智力考验看成语猜古诗句微信小程序源码
数据库基础知识
Redis getting started complete tutorial: Key Management
图片懒加载的原理
【ODX Studio编辑PDX】-0.3-如何删除/修改Variant变体中继承的(Inherited)元素
Redis:Redis消息的发布与订阅(了解)
cout/cerr/clog的区别