当前位置:网站首页>[crawler] XPath for data extraction
[crawler] XPath for data extraction
2022-07-04 23:10:00 【Speech unrecognized】
install
pip install lxml
Guide pack
from lxml import etree
Use
take html character string Convert to element object
# take html character string Convert to element object
from lxml import etree
element = etree.HTML(html_str)
The following is through element object .xpath(' Matching rules ') To extract content
Get tag
Use / Represents the root node , Path and transition between paths
/html/xx/xx/xxx
Use // Cross node selection , Go directly to the desired label or text
//xxx # Get all xxx label
Use .
./ Current node
Use ..
../ # The upper node of the current node
.// When not complete html when , Use , Get relative path
get attribute
@ Property name Get the current tag The attribute value corresponding to this attribute
//img/@src # all img Of scr attribute
Get text
/text() Get the text content in the tag // Tag name [contains( text() , ' written words ' ) ] Get contains In words label
//ol/li//span[contains(text(),' Playable ')]
Get specific condition tags
// Tag name [@ Property name = value ] Locate specific tags according to their attribute values
//span[@class='title'] # You can get it by class name
// Tag name [ Indexes ] Index from 1 Start
Get from the front // Upper label / Tag name [position()>3] From 4 Start
Get from the back // Upper label / Tag name [last()] Get the last // Upper label / Tag name [last() - 2] Last but not least 3 individual
combination //ol/li[position()>1][position()<last()-2]
// Tag name [text()=' value '] Locate the specific label according to the specific text content in the label , You need to match every word
//ol/li//span[text()='[ Playable ]'] # The matching tag content is [ Playable ] The label of
边栏推荐
- Photoshop batch adds different numbers to different pictures
- Redis getting started complete tutorial: Geo
- 【taichi】用最少的修改将太极的pbf2d(基于位置的流体模拟)改为pbf3d
- Redis入门完整教程:发布订阅
- SHP data making 3dfiles white film
- OSEK标准ISO_17356汇总介绍
- P2181 diagonal and p1030 [noip2001 popularization group] arrange in order
- QT drawing network topology diagram (connecting database, recursive function, infinite drawing, dragging nodes)
- Basic use and upgrade of Android native database
- Redis introduction complete tutorial: List explanation
猜你喜欢

高通WLAN框架学习(30)-- 支持双STA的组件

初试为锐捷交换机跨设备型号升级版本(以RG-S2952G-E为例)

Redis入门完整教程:事务与Lua

Editplus-- usage -- shortcut key / configuration / background color / font size

Redis getting started complete tutorial: publish and subscribe

Talk about Middleware
![[OpenGL] note 29 anti aliasing (MSAA)](/img/66/61f29e1c41d3099d55e2ead0a3b01e.png)
[OpenGL] note 29 anti aliasing (MSAA)

Redis入门完整教程:发布订阅

The difference between cout/cerr/clog

Redis入门完整教程:列表讲解
随机推荐
LIst 相关待整理的知识点
D3.js+Three. JS data visualization 3D Earth JS special effect
Summary of wechat applet display style knowledge points
【ODX Studio编辑PDX】-0.3-如何删除/修改Variant变体中继承的(Inherited)元素
debug和release的区别
Header file duplicate definition problem solving "c1014 error“
【剑指Offer】6-10题
Google collab trample pit
vim编辑器知识总结
【爬虫】数据提取之xpath
How can enterprises cross the digital divide? In cloud native 2.0
UML图记忆技巧
Set up a website with a sense of ceremony, and post it to 1/2 of the public network through the intranet
Redis: redis transactions
【ODX Studio編輯PDX】-0.2-如何對比Compare兩個PDX/ODX文件
一次edu证书站的挖掘
ffmpeg快速剪辑
Redis:Redis消息的发布与订阅(了解)
Analysis of the self increasing and self decreasing of C language function parameters
Redis入门完整教程:发布订阅