当前位置：网站首页>[crawler] XPath for data extraction

[crawler] XPath for data extraction

2022-07-04 23:10:00 【Speech unrecognized】

pip install lxml

from lxml import etree

take html character string Convert to element object

#  take html character string   Convert to element object 
from lxml import etree
element  = etree.HTML(html_str)

The following is through element object .xpath(' Matching rules ') To extract content

Use / Represents the root node , Path and transition between paths

/html/xx/xx/xxx

Use // Cross node selection , Go directly to the desired label or text

//xxx   #  Get all xxx label

Use .

./  Current node

Use ..

../  #  The upper node of the current node

.// When not complete html when , Use , Get relative path

@ Property name Get the current tag The attribute value corresponding to this attribute

//img/@src   #  all img  Of scr attribute

/text() Get the text content in the tag
// Tag name [contains( text() , ' written words ' ) ] Get contains In words label

//ol/li//span[contains(text(),' Playable ')]

// Tag name [@ Property name = value ] Locate specific tags according to their attribute values

//span[@class='title']   #  You can get it by class name

// Tag name [ Indexes ] Index from 1 Start

Get from the front
// Upper label / Tag name [position()>3] From 4 Start

Get from the back
// Upper label / Tag name [last()] Get the last
// Upper label / Tag name [last() - 2] Last but not least 3 individual

combination
//ol/li[position()>1][position()<last()-2]

// Tag name [text()=' value '] Locate the specific label according to the specific text content in the label , You need to match every word

//ol/li//span[text()='[ Playable ]'] #  The matching tag content is [ Playable ] The label of

版权声明
本文为[Speech unrecognized]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/185/202207042246340531.html