The lxml XML toolkit for Python

Related tags

HTML Manipulationlxml
Overview

What is lxml?

lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language. It's also very fast and memory friendly, just so you know.

For an introduction and further documentation, see doc/main.txt.

For installation information, see INSTALL.txt.

For issue tracker, see https://bugs.launchpad.net/lxml

Support the project

lxml has been downloaded from the Python Package Index millions of times and is also available directly in many package distributions, e.g. for Linux or macOS.

Most people who use lxml do so because they like using it. You can show us that you like it by blogging about your experience with it and linking to the project website.

If you are using lxml for your work and feel like giving a bit of your own benefit back to support the project, consider sending us money through GitHub Sponsors, Tidelift or PayPal that we can use to buy us free time for the maintenance of this great library, to fix bugs in the software, review and integrate code contributions, to improve its features and documentation, or to just take a deep breath and have a cup of tea every once in a while. Please read the Legal Notice below, at the bottom of this page. Thank you for your support.

Support lxml through GitHub Sponsors

via a Tidelift subscription

or via PayPal:

Donate to the lxml project

Please contact Stefan Behnel for other ways to support the lxml project, as well as commercial consulting, customisations and trainings on lxml and fast Python XML processing.

Travis-CI and AppVeyor support the lxml project with their build and CI servers. Jetbrains supports the lxml project by donating free licenses of their PyCharm IDE. Another supporter of the lxml project is COLOGNE Webdesign.

Project income report

  • Total project income in 2019: EUR 717.52 (59.79 € / month)
    • Tidelift: EUR 360.30
    • Paypal: EUR 157.22
    • other: EUR 200.00

Legal Notice for Donations

Any donation that you make to the lxml project is voluntary and is not a fee for any services, goods, or advantages. By making a donation to the lxml project, you acknowledge that we have the right to use the money you donate in any lawful way and for any lawful purpose we see fit and we are not obligated to disclose the way and purpose to any party unless required by applicable law. Although lxml is free software, to the best of our knowledge the lxml project does not have any tax exempt status. The lxml project is neither a registered non-profit corporation nor a registered charity in any country. Your donation may or may not be tax-deductible; please consult your tax advisor in this matter. We will not publish or disclose your name and/or e-mail address without your consent, unless required by applicable law. Your donation is non-refundable.

Comments
  • Introduce a multi os travis build that builds OSX wheels

    Introduce a multi os travis build that builds OSX wheels

    Improvements could be made to build the manylinux wheels as well but this is a big change on its own. When creating a tag the wheel will pushed as a github release

    So this is unlikely ready to merge in its current state but I think its at a point where some feedback would be helpful.

    Example build Example "release"

    Here is me sanity testing the 2.7 wheel similarly to how the makefile does

     I   ² venv  ~/Downloads  pip install lxml-3.7.3-cp27-cp27m-macosx_10_11_x86_64.whl
    Processing ./lxml-3.7.3-cp27-cp27m-macosx_10_11_x86_64.whl
    Installing collected packages: lxml
    Successfully installed lxml-3.7.3
     I   ² venv  ~/Downloads  python                                1014ms  Sun Apr 30 08:14:12 2017
    Python 2.7.13 (default, Apr  5 2017, 22:17:22)
    [GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.38)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import lxml.etree
    >>> import lxml.objectify
    >>>
    

    and 2.6

     I   ² venv  ~/Downloads  pip install lxml-3.7.3-cp26-cp26m-macosx_10_11_x86_64.whl
    DEPRECATION: Python 2.6 is no longer supported by the Python core team, please upgrade your Python. A future version of pip will drop support for Python 2.6
    Processing ./lxml-3.7.3-cp26-cp26m-macosx_10_11_x86_64.whl
    Installing collected packages: lxml
    Successfully installed lxml-3.7.3
     I   ² venv  ~/Downloads  python                                 728ms  Sun Apr 30 08:25:55 2017
    Python 2.6.9 (unknown, Apr 30 2017, 08:22:20)
    [GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import lxml.etree
    >>> import lxml.objectify
    >>>
    

    Some notes:

    • I dont bother building all the python3 builds for OSX as they will all fail the same way the allowed_failure I do build does. Filed bug here I can help with that bug but not sure what to do next on it.
    • The linux wheels that get built are likely not useful as they are not manywheels builds. I experimented with doing the manywheels here and while I think its possible I abandoned it (for now) when I hit an FTP error I had no idea how to deal with. here is a Failing build and the abandoned branch
    opened by Bachmann1234 31
  • AppVeyor CI: Add Python 3.11 jobs

    AppVeyor CI: Add Python 3.11 jobs

    AppVeyor deployed new Windows images with Python 3.11 support (https://github.com/appveyor/ci/issues/3844), which means we can use it to build Python 3.11 Windows wheels for lxml. This PR adds three Python 3.11 jobs to the matrix, for the x86, x86-64 and arm64 platforms

    Part of Bug #1977998. Partly replaces #355.

    I tested the jobs on my branch, and the workflow passes.

    I would suggest after this PR, to backport it to the 4.9 maintenance branch and release a new 4.9.2 version which includes these Python 3.11 Windows wheels.

    opened by EwoutH 23
  • Add support for ucrt binaries on Windows

    Add support for ucrt binaries on Windows

    Hi,

    This PR is a first big step towards resolving #1326096. I went through the pain to recompile libiconv, libxml2 and libxslt with Visual Studio 2015/ucrt to have binaries that can be used to build a Python 3.5 wheel.

    This PR makes sure that the ucrt binaries are downloaded when we are on Python 3.5. I documented the actual compilation of the binaries in a reproducible manner at https://github.com/mhils/libxml2-win-binaries. After merging this PR and installing the Visual C++ Build Tools (or Visual Studio), a Python 3.5 x86 Windows wheel can be build as follows:

    > "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\vcvarsall.bat"
    > python3 setup.py bdist_wheel --static-deps
    

    I sucessfully tested the resulting wheel on a clean Win7 VM which worked fine.

    If I could ask for a favor, it would be great if you could upload a Python 3.5 Windows wheel to PyPi as soon as possible (feel free to take the wheel linked above or compile your own). We're currently migrating @mitmproxy to Python 3 and lxml is currently the only dependency that holds a pip3 install mitmproxy back.

    Thanks! Max

    opened by mhils 19
  • Fix inheritance in lxml.html

    Fix inheritance in lxml.html

    As the old comment / FIXME from 8132c755adad4a75ba855d985dd257493bccc7fd notes, the mixin should come first for the inheritance to be correct (the left-most class is the first in the MRO, at least if no diamond inheritance is involved).

    Also fix the odd super in HtmlMixin likely stemming from the incorrect MRO.

    Fixes the inheritance order of all HTML* base classes though it probably doesn't matter for other than HtmlElement.

    opened by xmo-odoo 14
  • Removed the PyPy special cases (for PyPy 4.1)

    Removed the PyPy special cases (for PyPy 4.1)

    PyPy trunk (and future PyPy 4.1) contains now https://bitbucket.org/pypy/pypy/commits/3144c72295ae which improves the cpyext compatibility. It removes the need for these few hacks (which never fully worked, as discussed on pypy-dev).

    opened by arigo 13
  • repair attribute mis-interpretation in ElementTreeContentHandler

    repair attribute mis-interpretation in ElementTreeContentHandler

    regarding https://bugs.launchpad.net/lxml/+bug/1136509, this is a proposed fix for the issue.

    The first part of the fix just rewrites the attributes in startElement to have keys of the form (namespace, key).

    At first, i set namespace to None, but I had a problem with that. It appears that even namespaced attributes like the "xmlns:xsi" in my test document, is also passed to startElement, I suppose because the owning tag doesn't have a namespace. So in this case I'm splitting on the colon and passing in the two tokens to startElementNS, but I'm not sure if this approach is correct. In any case, I added two tests, if you can show what should happen in the tests at least that would make the correct behavior apparent here.

    opened by zzzeek 13
  • Add Dependabot configuration for GitHub Actions updates

    Add Dependabot configuration for GitHub Actions updates

    Add a Dependabot configuration that checks once a week if the GitHub Actions are still using the latest version. If not, it opens a PR to update them.

    It will actually open very few PRs, since we only have major versions specified (like v3), so only on a major v4 release it will update and open a PR.

    This will basically automate the majority of PRs like #356.

    See Keeping your actions up to date with Dependabot.

    opened by EwoutH 11
  • GHA wheel CI: Update images, used actions and Python version

    GHA wheel CI: Update images, used actions and Python version

    A bit of maintenance on the GitHub Actions wheel CI:

    • Update the used Ubuntu and macOS images to the latest versions, and enable the Windows run
    • Update the used actions to their latest versions
    • Use Python 3.10 to build wheels
    • Add Python 3.11 run

    Part of Bug #1977998.

    opened by EwoutH 11
  • Do not blindly copy all of the namespaces when tostring():ing a subtree.

    Do not blindly copy all of the namespaces when tostring():ing a subtree.

    When using a subtree of a document do not simply copy all of the namespaces from all of the parents down. Only copy those that we actually use within the subtree. This as copying all namespaces will bloat the subtree with information it should not have.

    This might seem harmless to do in the average case, but it will cause problems when serializing the XML, specifically C14N serialization which will according to specification retain all ns declarations on the root level element. So if this tostring() execution then will insert all parent namespace declarations into the now new root element we will unnecessarily bloat the ns declarations on this new toplevel element.

    Having this said I am not confident this is the best code for doing this, feel free to point me in the direction of better code if you will.

    opened by Pelleplutt 11
  • Improve detection of the libxml2 and libxslt libraries

    Improve detection of the libxml2 and libxslt libraries

    This patch improves detection of the libxml2 and libxslt libraries by cleaning up some of the overly-complex build system.

    The patch also improves support for using pkg-config if available.

    opened by hughmcmaster 10
  • Adds a `smart_prefix` option to XPath evaluations to overcome a counter-intuitive design flaw

    Adds a `smart_prefix` option to XPath evaluations to overcome a counter-intuitive design flaw

    Namespaces are one honking great idea -- let's do more of those!

    Using XPath to locate elements is quiet cumbersome when it comes to documents that have a default namespace:

    >>> root = etree.fromstring('<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:x="http://www.w3.org/2000/svg"><text><body><x:svg /></body></text></TEI>')
    >>> root.nsmap
    {'x': 'http://www.w3.org/2000/svg', None: 'http://www.tei-c.org/ns/1.0'}
    >>> root.xpath('./text/body')
    []
    >>> root.xpath('./text/body', namespaces=root.nsmap)
    Traceback (most recent call last):
      File "<input>", line 1, in <module>
        root.xpath('./text/body', namespaces=root.nsmap)
      File "src/lxml/lxml.etree.pyx", line 1584, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:59349)
        evaluator = XPathElementEvaluator(self, namespaces=namespaces,
      File "src/lxml/xpath.pxi", line 261, in lxml.etree.XPathElementEvaluator.__init__ (src/lxml/lxml.etree.c:170589)
      File "src/lxml/xpath.pxi", line 133, in lxml.etree._XPathEvaluatorBase.__init__ (src/lxml/lxml.etree.c:168702)
      File "src/lxml/xpath.pxi", line 57, in lxml.etree._XPathContext.__init__ (src/lxml/lxml.etree.c:167658)
        _BaseContext.__init__(self, namespaces, extensions, error_log, enable_regexp,
      File "src/lxml/extensions.pxi", line 84, in lxml.etree._BaseContext.__init__ (src/lxml/lxml.etree.c:156529)
        if namespaces:
    TypeError: empty namespace prefix is not supported in XPath
    

    This is a well documented issue (also here) and is commonly solved by manipulating the namespace mapping with an ad-hoc prefix - which loses the information what the default namespace was unless preserved - and adding that to XPath expressions. (another hack, stdlib as well with some insights)

    But this solution doesn't play well in generalising code like adapter classes where it becomes tedious and error prone because XPath expressions are not always identical (did i mention they are counter-intuitive to type?) and keeping track of namespace mappings across loosely coupled code elements introduces boilerplates.

    Ultimately, the interplay of document namespaces and XPath expressions is everything but pythonic and rather complicated than complex, though

    There should be one-- and preferably only one --obvious way to do it.

    The root of this issue is caused by a flaw in the XPath 1.0 specs that libxml2 follows in its implementation:

    A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded). It is an error if the QName has a prefix for which there is no namespace declaration in the expression context.

    While XML namespaces actually have a notion of an unaliased default namespace:

    If the attribute name matches DefaultAttName, then the namespace name in the attribute value is that of the default namespace in the scope of the element to which the declaration is attached.

    XPath 2.0 did eventually fix this:

    A QName in a name test is resolved into an expanded QName using the statically known namespaces in the expression context. It is a static error [err:XPST0081] if the QName has a prefix that does not correspond to any statically known namespace. An unprefixed QName, when used as a name test on an axis whose principal node kind is element, has the namespace URI of the default element/type namespace in the expression context; otherwise, it has no namespace URI.

    There's no XPath 2.0 implementation with Python bindings around (well, there is one to XQuilla that returns raw strings and is far off lxml's capabilities), and it is very unlikely there's one to be implemented as the extension as a whole is a lot - which probably no one needs outside the XQuery/XSLT scene. libxml2 didn't intend to ten years ago, but hey, looking for a thesis to write?

    Thus I propose to backport that bug fix from XPath 2.0 to lxml's XPath interfaces with an opt-in smart_prefix option without considering the whole standard as

    practicality beats purity.

    Behind the scenes the ad-hoc prefix 'solution' described above is applied, but completely hidden from the client code.

    This pull request demonstrates the design and isn't completed yet, at least these issues still need to be addressed:

    • documentation
    • predicates are handled rather hackish and i have doubts that it works with more complex predicates
      • i'd appreciate test proposals for practical examples with such
      • support for predicates with the smart_prefix option could be dropped altogether, finer-grained selection is possible with Python and probably a common usage
    • should this even be the default behavior with opt-out? afaict it wouldn't break any code as supplying a namespace map with a default namespace (mapped to None) is currently invalid
      • i'd keep it out of XSLT anyway
    • should result elements from such queries have a property that stores the option? so later calls on .xpath() of these elements would behave the same if no smart_prefix option is provided
    • can regex.h be used directly from Cython, but that's not specific to this here

    btw, this is the first time i used Cython and my C usage was long ago, i'm happy about every feedback for improvements.

    Now, let's have some fun:

    >>> root = etree.fromstring('<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:x="http://www.w3.org/2000/svg"><text><body><x:svg /></body></text></TEI>')
    >>> root.nsmap
    {None: 'http://www.tei-c.org/ns/1.0', 'x': 'http://www.w3.org/2000/svg'}
    >>> root.xpath('./text/body', namespaces=root.nsmap, smart_prefix=True)
    [<Element {http://www.tei-c.org/ns/1.0}body at 0x7f8e655587c8>]
    >>> root.xpath('./text/body', smart_prefix=True)
    [<Element {http://www.tei-c.org/ns/1.0}body at 0x7f8e655587c8>]
    

    (oh, the inplace build option fails on my local machine without a helpful message. does anyone have a hint on that?)

    invalid 
    opened by funkyfuture 10
  • AppVeyor CI: Update to Visual Studio 2022 image

    AppVeyor CI: Update to Visual Studio 2022 image

    This PR updates the AppVeyor configuration to use the Visual Studio 2022 image.

    Update 2022-11-07: The Python 3.11 part of this PR has moved to a new PR, #360.

    opened by EwoutH 13
  • Xpath with namespace and position

    Xpath with namespace and position

    I noticed that it is not possible to use elem.find or elem.findall with an xpath that contains position indices if the method is called with the namespaces argument. This behavior has also been reported in Bug #1873886.

    It appears that during the tokenization of the xpath, the numbers are treated as tags, i.e. they are concatenated with the default namespace (during function calls with namespaces). This results in a wrong path imo. For example:

    >>> from lxml import etree
    >>> doc = etree.XML("""
          <foo xmlns="http://example.com/foo">
            <bar>baz</bar>
          </foo>""")
    >>> path = "./bar[1]"
    >>> doc.find(path, namespaces={None:"http://example.com/foo"})
    None
    

    The target element is not found here because the path that is used is effectively: ./{http://example.com/foo}bar[{http://example.com/foo}1]

    Changes:

    • I added a check during the tokenization of the xpath to determine whether the processed tag is a number to avoid concatenation with the namespace.
    opened by knit-bee 1
  • Try and preserve the structure of the html during a diff

    Try and preserve the structure of the html during a diff

    There exists a bug in the current htmldiff code, where by the generated diff changes the structure of the html (notice that the <div id="middle"> appears at the beginning instead of the middle):

    >>> from lxml.html import diff
    >>> a = "<div id='first'>some old text</div><div id='last'>more old text</div>"
    >>> b = "<div id='first'>some old text</div><div id='middle'>and new text</div><div id='last'>more old text</div>"
    >>> diff.htmldiff(a, b)
    ('<div id="middle"> <div id="first"><ins>some old text</ins></div><ins>and new</ins> <del>some old</del> text</div><div id="last">more old text</div>')
    >>>
    
    

    This patchset is an attempt to fix that issue.

    opened by lonetwin 0
  • Validate that host_whitelist is not a string

    Validate that host_whitelist is not a string

    Attacker can use https:///evil.com to make a malformed "hostless" URL that would have a netloc == '' -- which is in any string. Strings are not documented to be allowed in this config variable anyhow, so just raise a type error if someone passes in a string by accident.

    (This is a breaking change for people who didn't follow the documented types, but shouldn't affect anyone else.)

    New test fails on current master.

    opened by timmc 1
  • Don't parse hostname from netloc manually; rely on urlsplit's result

    Don't parse hostname from netloc manually; rely on urlsplit's result

    This manual parsing of netloc can be fooled by use of a userinfo component. SplitResult already has a hostname property.

    New test test_host_whitelist_sneaky_userinfo fails on master.

    opened by timmc 1
Releases(lxml-4.9.2)
The awesome document factory

The Awesome Document Factory WeasyPrint is a smart solution helping web developers to create PDF documents. It turns simple HTML pages into gorgeous s

Kozea 5.4k Jan 07, 2023
The lxml XML toolkit for Python

What is lxml? lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language. It's also very fast and memory

2.3k Jan 02, 2023
A library for converting HTML into PDFs using ReportLab

XHTML2PDF The current release of xhtml2pdf is xhtml2pdf 0.2.5. Release Notes can be found here: Release Notes As with all open-source software, its us

2k Dec 27, 2022
A jquery-like library for python

pyquery: a jquery-like library for python pyquery allows you to make jquery queries on xml documents. The API is as much as possible the similar to jq

Gael Pasgrimaud 2.2k Dec 29, 2022
Python binding to Modest engine (fast HTML5 parser with CSS selectors).

A fast HTML5 parser with CSS selectors using Modest engine. Installation From PyPI using pip: pip install selectolax Development version from github:

Artem Golubin 710 Jan 04, 2023
Pythonic HTML Parsing for Humans™

Requests-HTML: HTML Parsing for Humans™ This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible. When us

Python Software Foundation 12.9k Jan 01, 2023
Generate HTML using python 3 with an API that follows the DOM standard specfication.

Generate HTML using python 3 with an API that follows the DOM standard specfication. A JavaScript API and tons of cool features. Can be used as a fast prototyping tool.

byteface 114 Dec 14, 2022
A python HTML builder library.

PyML A python HTML builder library. Goals Fully functional html builder similar to the javascript node manipulation. Implement an html parser that ret

Arjix 8 Jul 04, 2022
A HTML-code compiler-thing that lets you reuse HTML code.

RHTML RHTML stands for Reusable-Hyper-Text-Markup-Language, and is pronounced "Rech-tee-em-el" despite how its abbreviation is. As the name stands, RH

Duckie 4 Nov 15, 2021
Safely add untrusted strings to HTML/XML markup.

MarkupSafe MarkupSafe implements a text object that escapes characters so it is safe to use in HTML and XML. Characters that have special meanings are

The Pallets Projects 514 Dec 31, 2022
Lektor-html-pretify - Lektor plugin to pretify the HTML DOM using Beautiful Soup

html-pretify Lektor plugin to pretify the HTML DOM using Beautiful Soup. How doe

Chaos Bodensee 2 Nov 08, 2022
Converts XML to Python objects

untangle Documentation Converts XML to a Python object. Siblings with similar names are grouped into a list. Children can be accessed with parent.chil

Christian Stefanescu 567 Nov 30, 2022
Python module that makes working with XML feel like you are working with JSON

xmltodict xmltodict is a Python module that makes working with XML feel like you are working with JSON, as in this "spec": print(json.dumps(xmltod

Martín Blech 5k Jan 04, 2023
Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes

Bleach Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes. Bleach can also linkify text safely, appl

Mozilla 2.5k Dec 29, 2022
inscriptis -- HTML to text conversion library, command line client and Web service

inscriptis -- HTML to text conversion library, command line client and Web service A python based HTML to text conversion library, command line client

webLyzard technology 122 Jan 07, 2023
Dominate is a Python library for creating and manipulating HTML documents using an elegant DOM API

Dominate Dominate is a Python library for creating and manipulating HTML documents using an elegant DOM API. It allows you to write HTML pages in pure

Tom Flanagan 1.5k Jan 09, 2023
That project takes as input special TXT File, divides its content into lsit of HTML objects and then creates HTML file from them.

That project takes as input special TXT File, divides its content into lsit of HTML objects and then creates HTML file from them.

1 Jan 10, 2022
Modded MD conversion to HTML

MDPortal A module to convert a md-eqsue lang to html Basically I ruined md in an attempt to convert it to html Overview Here is a demo file from parse

Zeb 1 Nov 27, 2021
Standards-compliant library for parsing and serializing HTML documents and fragments in Python

html5lib html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all majo

1k Dec 27, 2022