XPath expressions (II)
-
Web scrapping
Basic XPath patterns
| Pattern | Description | Examples |
|---|---|---|
| / | document root or child of the node (parent/child) | /html/body/div |
| // | all descendants of type ... (parent//descendant) | /html/body//p |
| * | all elements | |
| example: all children of body | /html/body/* | |
| example: all descendants of body | /html/body//* | |
| node() | child list of current element | /html/body/node() |
| text() | matches a text | /html/body/div/h1/text() |
| comment() | matches a comment | /html/body/comment() |
| @attr | values of attribute name ... | /html/body/div/@name |
| @* | elements with any attribute | |
| example: divs with any attributes | //div/@* | |
| [expr] | filter condition | |
| example: any elements which have any attributes | //*[@*] | |
| example: any elements which have class attribute | //*[@class*] | |
| example: any elements with class = 'description' | //*[@class='description'] | |
| example: div elements with name = 'div_c' | //div[@name='div_c'] | |
| < , > | arithmetic comparison | //img[@width > 10] |
| and | and condition | //div[@name="div_b3" and @class="description"] |
| or | or condition | //div[@name="div_b3" or @class="description"] |
| not | not condition | //p[not(@name="p1")] |
| .. | parent node | //h1[text()="First Heading"]/.. |
| . | current node | |
| python/lxml example: current node selection | .//p |
X/data/en2/xpath/ axis (node-sets relative to the current node)
Definitions
ancestor
= its parent, its parent's parent, and so on up to the root element
descendant
= element's children, their children, and so on
sibling
= children element of same parent, in document order, except the element itself.
..or self
= ...including current node
| Axis name | Result |
|---|---|
| ancestor | Selects all ancestors (parent, grandparent, etc.) of the current node |
| ancestor-or-self | Selects all ancestors (parent, grandparent, etc.) of the current node and the current node itself |
| attribute | Selects all attributes of the current node |
| child | Selects all children of the current node |
| descendant | Selects all descendants (children, grandchildren, etc.) of the current node |
| descendant-or-self | Selects all descendants (children, grandchildren, etc.) of the current node and the current node itself |
| following | Selects everything in the document after the closing tag of the current node |
| following-sibling | Selects all siblings after the current node |
| namespace | Selects all namespace nodes of the current node |
| parent | Selects the parent of the current node |
| preceding | Selects all nodes that appear before the current node in the document, except ancestors, attribute nodes and namespace nodes |
| preceding-sibling | Selects all siblings before the current node |
| self | Selects the current node |
| Axis name | equivalent | example | equivalent |
|---|---|---|---|
| self:: | . | self::*//p | .//p |
| parent:: | .. | //div[@name="div_c"]/parent::* | //div[@name="div_c"]/.. |
| child:: | / | /html/body/child::div | /html/body/div |
| ancestor:: | //div[@name="div_b"]/ancestor::* | ||
| ancestor-or-self:: | //div[@name="div_b"]/ancestor-or-self::* | ||
| descendant:: | // | //div[@name="div_b"]/descendant::* | /html/body/div[@name="div_b"]//* |
| descendant-or-self:: | //div[@name="div_b"]/descendant-or-self::* | ||
| attribute:: | @ | //div[attribute::name="div_c"] | //div[@name='div_c'] |
| following-sibling:: | //div[@name="div_b"]/following-sibling::* | ||
| preceding-sibling:: | //div[@name="div_b"]/preceding-sibling::* | ||
| following:: | //div[@name="div_b"]/following::* | ||
| preceding:: | //div[@name="div_b"]/preceding::* |