Lxml howto
-
Web scrapping
-
Back-End
Processing html string text with lxml python library
Library used are:
from lxml import html, etree
html_str = """
<html>
<head>
</head>
<body>
<div id="my_div" class="some_style">this is my div</div>
</body>
</html>
"""
First, we are parsing the html string to an Element tree instance. Parsing with lxml.etree is more restrictive than lxml.html so, for html, we could parse html string with lxml.html:
#doc = etree.fromstring(html_str) doc = html.fromstring(html_str)
Lxml library has search functions, but we could simply use xpath expressions for searching html sub-elements:
items = doc.xpath('//div[@id="my_div"]')
my_div = items[0]
Element tree has attributes: tag(name of html element) , text(text inside of element), attrib(dictionary of attributes):
>>> print my_div.tag >>> div >>> print my_div.text >>> this is my div >>> print my_div.attrib
In our example, for appending a list after our div, we need to access the parent of div - the body:
items = doc.xpath('/html/body')
body = items[0]
Creating and appending new elements is possible with etree.SubElement constructor (there is no html.SubElement defined):
ul = etree.SubElement(body, 'ul')
li = etree.SubElement(ul, 'li')
a = etree.SubElement(li, 'a', href = 'http://infohost.nmt.edu/tcc/help/pubs/pylxml/pylxml.pdf')
a.text = "Lxml doc I"
span = etree.SubElement(a, 'span')
span.set('class','caret')
ul = etree.SubElement(body, 'ul')
li = etree.SubElement(ul, 'li')
a = etree.SubElement(li, 'a', href = 'http://lxml.de/')
a.text = "Lxml doc II"
span = etree.SubElement(a, 'span')
span.set('class','caret')
And, finnaly, retrive modified html source with:
print html.tostring(doc)
Download test file: lxml_test.py