Lxml howto

Aug. 9, 2020

Web scrapping

Back-End

Processing html string text with lxml python library

Library used are:

from lxml import html, etree

html_str = """
<html>
    <head>
    </head>

    <body>
        <div id="my_div" class="some_style">this is my div</div>
    </body>
</html>
"""

First, we are parsing the html string to an Element tree instance. Parsing with lxml.etree is more restrictive than lxml.html so, for html, we could parse html string with lxml.html:

#doc = etree.fromstring(html_str)
doc = html.fromstring(html_str)

Lxml library has search functions, but we could simply use xpath expressions for searching html sub-elements:

items = doc.xpath('//div[@id="my_div"]')
my_div = items[0]

Element tree has attributes: tag(name of html element) , text(text inside of element), attrib(dictionary of attributes):

>>> print my_div.tag
>>> div
>>> print my_div.text
>>> this is my div
>>> print my_div.attrib

In our example, for appending a list after our div, we need to access the parent of div - the body:

items = doc.xpath('/html/body')
body = items[0]

Creating and appending new elements is possible with etree.SubElement constructor (there is no html.SubElement defined):

ul = etree.SubElement(body, 'ul')
li = etree.SubElement(ul, 'li')
a = etree.SubElement(li, 'a', href = 'http://infohost.nmt.edu/tcc/help/pubs/pylxml/pylxml.pdf')
a.text = "Lxml doc I"
span = etree.SubElement(a, 'span')
span.set('class','caret')

ul = etree.SubElement(body, 'ul')
li = etree.SubElement(ul, 'li')
a = etree.SubElement(li, 'a', href = 'http://lxml.de/')
a.text = "Lxml doc II"
span = etree.SubElement(a, 'span')
span.set('class','caret')

And, finnaly, retrive modified html source with:

print html.tostring(doc)

Download test file: lxml_test.py

Lxml howto

Processing html string text with lxml python library

Tags