inscriptis -- HTML to text conversion library, command line client and Web service

Overview

inscriptis -- HTML to text conversion library, command line client and Web service

Supported python versions Maintainability Coverage Build status Documentation status PyPI version PyPI downloads

A python based HTML to text conversion library, command line client and Web service with support for nested tables, a subset of CSS and optional support for providing an annotated output.

Inscriptis is particularly well suited for applications that require high-performance, high-quality (i.e., layout-aware) text representations of HTML content, and will aid knowledge extraction and data science tasks conducted upon Web data.

Please take a look at the Rendering document for a demonstration of inscriptis' conversion quality.

A Java port of inscriptis 1.x is available here.

This document provides a short introduction to Inscriptis.

Statement of need - why inscriptis?

  1. Inscriptis provides a layout-aware conversion of HTML that more closely resembles the rendering obtained from standard Web browsers and, therefore, better preserves the spatial arrangement of text elements.

    Conversion quality becomes a factor once you need to move beyond simple HTML snippets. Non-specialized approaches and less sophisticated libraries do not correctly interpret HTML semantics and, therefore, fail to properly convert constructs such as itemizations, enumerations, and tables.

    Beautiful Soup's get_text() function, for example, converts the following HTML enumeration to the string firstsecond.

    <ul>
      <li>firstli>
      <li>secondli>
    <ul>

    Inscriptis, in contrast, not only returns the correct output

    * first
    * second
    

    but also supports much more complex constructs such as nested tables and also interprets a subset of HTML (e.g., align, valign) and CSS (e.g., display, white-space, margin-top, vertical-align, etc.) attributes that determine the text alignment. Any time the spatial alignment of text is relevant (e.g., for many knowledge extraction tasks, the computation of word embeddings and language models, and sentiment analysis) an accurate HTML to text conversion is essential.

  2. Inscriptis supports annotation rules, i.e., user-provided mappings that allow for annotating the extracted text based on structural and semantic information encoded in HTML tags and attributes used for controlling structure and layout in the original HTML document. These rules might be used to

    • provide downstream knowledge extraction components with additional information that may be leveraged to improve their respective performance.
    • assist manual document annotation processes (e.g., for qualitative analysis or gold standard creation). Inscriptis supports multiple export formats such as XML, annotated HTML and the JSONL format that is used by the open source annotation tool doccano.
    • enabling the use of Inscriptis for tasks such as content extraction (i.e., extract task-specific relevant content from a Web page) which rely on information on the HTML document's structure.

Installation

At the command line:

$ pip install inscriptis

Or, if you don't have pip installed:

$ easy_install inscriptis

If you want to install from the latest sources, you can do:

$ git clone https://github.com/weblyzard/inscriptis.git
$ cd inscriptis
$ python setup.py install

Python library

Embedding inscriptis into your code is easy, as outlined below:

import urllib.request
from inscriptis import get_text

url = "https://www.fhgr.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)
print(text)

Standalone command line client

The command line client converts HTML files or text retrieved from Web pages to the corresponding text representation.

Command line parameters

The inscript.py command line client supports the following parameters:

usage: inscript.py [-h] [-o OUTPUT] [-e ENCODING] [-i] [-d] [-l] [-a] [-r ANNOTATION_RULES] [-p POSTPROCESSOR] [--indentation INDENTATION]
                   [--table-cell-separator TABLE_CELL_SEPARATOR] [-v]
                   [input]

Convert the given HTML document to text.

positional arguments:
  input                 Html input either from a file or a URL (default:stdin).

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output file (default:stdout).
  -e ENCODING, --encoding ENCODING
                        Input encoding to use (default:utf-8 for files; detected server encoding for Web URLs).
  -i, --display-image-captions
                        Display image captions (default:false).
  -d, --deduplicate-image-captions
                        Deduplicate image captions (default:false).
  -l, --display-link-targets
                        Display link targets (default:false).
  -a, --display-anchor-urls
                        Display anchor URLs (default:false).
  -r ANNOTATION_RULES, --annotation-rules ANNOTATION_RULES
                        Path to an optional JSON file containing rules for annotating the retrieved text.
  -p POSTPROCESSOR, --postprocessor POSTPROCESSOR
                        Optional component for postprocessing the result (html, surface, xml).
  --indentation INDENTATION
                        How to handle indentation (extended or strict; default: extended).
  --table-cell-separator TABLE_CELL_SEPARATOR
                        Separator to use between table cells (default: three spaces).
  -v, --version         display version information

HTML to text conversion

convert the given page to text and output the result to the screen:

$ inscript.py https://www.fhgr.ch

convert the file to text and save the output to output.txt:

$ inscript.py fhgr.html -o fhgr.txt

convert HTML provided via stdin and save the output to output.txt:

$ echo '

Make it so!

' | inscript.py -o output.txt

HTML to annotated text conversion

convert and annotate HTML from a Web page using the provided annotation rules.

Download the example annotation-profile.json and save it to your working directory:

$ inscript.py https://www.fhgr.ch -r annotation-profile.json

The annotation rules are specified in annotation-profile.json:

{
 "h1": ["heading", "h1"],
 "h2": ["heading", "h2"],
 "b": ["emphasis"],
 "div#class=toc": ["table-of-contents"],
 "#class=FactBox": ["fact-box"],
 "#cite": ["citation"]
}

The dictionary maps an HTML tag and/or attribute to the annotations inscriptis should provide for them. In the example above, for instance, the tag h1 yields the annotations heading and h1, a div tag with a class that contains the value toc results in the annotation table-of-contents, and all tags with a cite attribute are annotated with citation.

Given these annotation rules the HTML file

<h1>Churh1>
<b>Churb> is the capital and largest town of the Swiss canton of the
Grisons and lies in the Grisonian Rhine Valley.

yields the following JSONL output

{"text": "Chur\n\nChur is the capital and largest town of the Swiss canton
          of the Grisons and lies in the Grisonian Rhine Valley.",
 "label": [[0, 4, "heading"], [0, 4, "h1"], [6, 10, "emphasis"]]}

The provided list of labels contains all annotated text elements with their start index, end index and the assigned label.

Annotation postprocessors

Annotation postprocessors enable the post processing of annotations to formats that are suitable for your particular application. Post processors can be specified with the -p or --postprocessor command line argument:

$ inscript.py https://www.fhgr.ch \
        -r ./examples/annotation-profile.json \
        -p surface

Output:

Chur\n\n Chur is the capital and largest town of the Swiss canton of the Grisons and lies in the Grisonian Rhine Valley."} ">
{"text": "  Chur\n\n  Chur is the capital and largest town of the Swiss
          canton of the Grisons and lies in the Grisonian Rhine Valley.",
 "label": [[0, 6, "heading"], [8, 14, "emphasis"]],
 "tag": "
   
    Chur
   \n\n
   
    Chur
    is the
        capital and largest town of the Swiss canton of the Grisons and
        lies in the Grisonian Rhine Valley."}

Currently, inscriptis supports the following postprocessors:

  • surface: returns a list of mapping between the annotation's surface form and its label:

    [
       ['heading', 'Chur'],
       ['emphasis': 'Chur']
    ]
    
  • xml: returns an additional annotated text version:

    
        Chur
       
    
    
        Chur
        is the capital and largest town of the Swiss
    canton of the Grisons and lies in the Grisonian Rhine Valley.
    
  • html: creates an HTML file which contains the converted text and highlights all annotations as outlined below:

Annotations extracted from the Wikipedia entry for Chur with the `--postprocess html` postprocessor.

Snippet of the rendered HTML file created with the following command line options and annotation rules:

inscript.py --annotation-rules ./wikipedia.json \
            --postprocessor html \
            https://en.wikipedia.org/wiki/Chur.html

Annotation rules encoded in the wikipedia.json file:

{
  "h1": ["heading"],
  "h2": ["heading"],
  "h3": ["subheading"],
  "h4": ["subheading"],
  "h5": ["subheading"],
  "i": ["emphasis"],
  "b": ["bold"],
  "table": ["table"],
  "th": ["tableheading"],
  "a": ["link"]
}

Web Service

The Flask Web Service translates HTML pages to the corresponding plain text.

Additional Requirements

  • python3-flask

Startup

Start the inscriptis Web service with the following command:

$ export FLASK_APP="inscriptis.service.web"
$ python3 -m flask run

Usage

The Web services receives the HTML file in the request body and returns the corresponding text. The file's encoding needs to be specified in the Content-Type header (UTF-8 in the example below):

$ curl -X POST  -H "Content-Type: text/html; encoding=UTF8"  \
        --data-binary @test.html  http://localhost:5000/get_text

The service also supports a version call:

$ curl http://localhost:5000/version

Example annotation profiles

The following section provides a number of example annotation profiles illustrating the use of Inscriptis' annotation support. The examples present the used annotation rules and an image that highlights a snippet with the annotated text on the converted web page, which has been created using the HTML postprocessor as outlined in Section annotation postprocessors.

Wikipedia tables and table metadata

The following annotation rules extract tables from Wikipedia pages, and annotate table headings that are typically used to indicate column or row headings.

{
   "table": ["table"],
   "th": ["tableheading"],
   "caption": ["caption"]
}

The figure below outlines an example table from Wikipedia that has been annotated using these rules.

Table and table metadata annotations extracted from the Wikipedia entry for Chur.

References to entities, missing entities and citations from Wikipedia

This profile extracts references to Wikipedia entities, missing entities and citations. Please note that the profile isn't perfect, since it also annotates [ edit ] links.

{
   "a#title": ["entity"],
   "a#class=new": ["missing"],
   "class=reference": ["citation"]
}

The figure shows entities and citations that have been identified on a Wikipedia page using these rules.

Metadata on entries, missing entries and citations extracted from the Wikipedia entry for Chur.

Posts and post metadata from the XDA developer forum

The annotation rules below, extract posts with metadata on the post's time, user and the user's job title from the XDA developer forum.

{
    "article#class=message-body": ["article"],
    "li#class=u-concealed": ["time"],
    "#itemprop=name": ["user-name"],
    "#itemprop=jobTitle": ["user-title"]
}

The figure illustrates the annotated metadata on posts from the XDA developer forum.

Posts and post metadata extracted from the XDA developer forum.

Code and metadata from Stackoverflow pages

The rules below extracts code and metadata on users and comments from Stackoverflow pages.

{
   "code": ["code"],
   "#itemprop=dateCreated": ["creation-date"],
   "#class=user-details": ["user"],
   "#class=reputation-score": ["reputation"],
   "#class=comment-date": ["comment-date"],
   "#class=comment-copy": ["comment-comment"]
}

Applying these rules to a Stackoverflow page on text extraction from HTML yields the following snippet:

Code and metadata from Stackoverflow pages.

Advanced topics

Annotated text

Inscriptis can provide annotations alongside the extracted text which allows downstream components to draw upon semantics that have only been available in the original HTML file.

The extracted text and annotations can be exported in different formats, including the popular JSONL format which is used by doccano.

Example output:

{"text": "Chur\n\nChur is the capital and largest town of the Swiss canton
          of the Grisons and lies in the Grisonian Rhine Valley.",
 "label": [[0, 4, "heading"], [0, 4, "h1"], [6, 10, "emphasis"]]}

The output above is produced, if inscriptis is run with the following annotation rules:

{
 "h1": ["heading", "h1"],
 "b": ["emphasis"],
}

The code below demonstrates how inscriptis' annotation capabilities can be used within a program:

import urllib.request
from inscriptis import get_annotated_text, ParserConfig

url = "https://www.fhgr.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

rules = {'h1': ['heading', 'h1'],
         'h2': ['heading', 'h2'],
         'b': ['emphasis'],
         'table': ['table']
        }

output = get_annotated_text(html, ParserConfig(annotation_rules=rules)
print("Text:", output['text'])
print("Annotations:", output['label'])

Fine tuning

The following options are available for fine tuning inscriptis' HTML rendering:

  1. More rigorous indentation: call inscriptis.get_text() with the parameter indentation='extended' to also use indentation for tags such as
    and that do not provide indentation in their standard definition. This strategy is the default in inscript.py and many other tools such as Lynx. If you do not want extended indentation you can use the parameter indentation='standard' instead.
  2. Overwriting the default CSS definition: inscriptis uses CSS definitions that are maintained in inscriptis.css.CSS for rendering HTML tags. You can override these definitions (and therefore change the rendering) as outlined below:
    from lxml.html import fromstring
    from inscriptis.css_profiles import CSS_PROFILES, HtmlElement
    from inscriptis.html_properties import Display
    from inscriptis.model.config import ParserConfig

    # create a custom CSS based on the default style sheet and change the
    # rendering of `div` and `span` elements
    css = CSS_PROFILES['strict'].copy()
    css['div'] = HtmlElement(display=Display.block, padding=2)
    css['span'] = HtmlElement(prefix=' ', suffix=' ')

    html_tree = fromstring(html)
    # create a parser using a custom css
    config = ParserConfig(css=css)
    parser = Inscriptis(html_tree, config)  usage: inscript.py [-h] [-o OUTPUT] [-e ENCODING] [-i] [-d] [-l] [-a] [-r ANNOTATION_RULES] [-p POSTPROCESSOR]
                   [--indentation INDENTATION] [-v]
                   [input]

Convert the given HTML document to text.

positional arguments:
  input                 Html input either from a file or a URL (default:stdin).

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output file (default:stdout).
  -e ENCODING, --encoding ENCODING
                        Input encoding to use (default:utf-8 for files; detected server encoding for Web URLs).
  -i, --display-image-captions
                        Display image captions (default:false).
  -d, --deduplicate-image-captions
                        Deduplicate image captions (default:false).
  -l, --display-link-targets
                        Display link targets (default:false).
  -a, --display-anchor-urls
                        Display anchor URLs (default:false).
  -r ANNOTATION_RULES, --annotation-rules ANNOTATION_RULES
                        Path to an optional JSON file containing rules for annotating the retrieved text.
  -p POSTPROCESSOR, --postprocessor POSTPROCESSOR
                        Optional component for postprocessing the result (html, surface, xml).
  --indentation INDENTATION
                        How to handle indentation (extended or strict; default: extended).
  -v, --version         display version information
    text = parser.get_text()

Citation

There is a Journal of Open Source Software paper you can cite for Inscriptis:

@article{Weichselbraun2021,
  doi = {10.21105/joss.03557},
  url = {https://doi.org/10.21105/joss.03557},
  year = {2021},
  publisher = {The Open Journal},
  volume = {6},
  number = {66},
  pages = {3557},
  author = {Albert Weichselbraun},
  title = {Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web},
  journal = {Journal of Open Source Software}
}

Changelog

A full list of changes can be found in the release notes.

Comments
  • Strange memory leak(?) consuming behaviour

    Strange memory leak(?) consuming behaviour

    2.2.0

    update looks like a Python LXML memory leak issue https://medium.com/devopss-hole/python-lxml-memory-leak-b8d0b1000dc7

    For some background, I'm using your wonderful library in my flask application, so it means that the process does not get restarted, I've tried solving this by moving the inscriptis step to its own thread but it still seems to make the whole app bleed memory

    #!/usr/bin/python3
    
    import time
    
    def leak_memory():
      from inscriptis import get_text
      with open('leaky.html', 'r') as f:
        s = f.read()
      text_content = get_text(s)
    
    leak_memory()
    leak_memory()
    leak_memory()
    leak_memory()
    
    print ("Done, now look at memory usage")
    time.sleep(20)
    

    See the script and the test HTML here

    leaky.html.zip

    What I'm seeing is that that on some more complex HTML, it will consume something like 150Mb on the first get_text(..) call, and then it will never let the process release that memory, that's the problem for me.

    • I've tried using gc.collect() after get_text() but it never releases the memory
    • tried del text_content etc etc, but that didnt help

    ideas? is this a bug?

    Happy to throw a few dollars across for supporting your wonderful project!

    opened by dgtlmoon 5
  • Blank/empty HTML comments result in loss of information in that element

    Blank/empty HTML comments result in loss of information in that element

    version 2.2.0

    from inscriptis import get_text
    
    x = get_text('<html><body><span class="price-detailed__unit-price"><span>$<!-- -->90<!-- -->.<!-- -->74</span></span></body></html>')
    print (x)
    # $
    
    x = get_text('<html><body><span class="price-detailed__unit-price"><span>$90.74</span></span></body></html>')
    print (x)
    # 90.74
    

    I have no idea why this is the case :/ I noticed some sites may use this to stop scrapers

    opened by dgtlmoon 4
  • Converting tables to “running text”

    Converting tables to “running text”

    I have a table, for example:

    <div>
    <table cellspacing="0" style="-aw-border-insideh:0.5pt single #ffffff; -aw-border-insidev:0.5pt single #ffffff; border-collapse:collapse">
    	<tbody>
    		<tr>
    			<td style="background-color:#4472c4; vertical-align:top; width:79.35pt">
    			<p><strong>Product</strong></p>
    			</td>
    			<td style="background-color:#4472c4; vertical-align:top; width:79.35pt">
    			<p><strong>Size</strong></p>
    			</td>
    			<td style="background-color:#4472c4; vertical-align:top; width:79.35pt">
    			<p><strong>Price</strong></p>
    			</td>
    			<td style="background-color:#4472c4; vertical-align:top; width:79.35pt">
    			<p><strong>Location</strong></p>
    			</td>
    			<td style="background-color:#4472c4; vertical-align:top; width:79.4pt">
    			<p><strong>Comment</strong></p>
    			</td>
    		</tr>
    		<tr>
    			<td style="background-color:#4472c4; vertical-align:top; width:79.35pt">
    			<p><strong>A</strong></p>
    			</td>
    			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.35pt">
    			<p>50X20cm</p>
    			</td>
    			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.35pt">
    			<p>55$</p>
    			</td>
    			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.35pt">
    			<p>IL</p>
    			</td>
    			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.4pt">
    			<p>Text text&hellip;</p>
    			</td>
    		</tr>
    		<tr>
    			<td style="background-color:#4472c4; vertical-align:top; width:79.35pt">
    			<p><strong>B</strong></p>
    			</td>
    			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.35pt">
    			<p>50X20cm</p>
    			</td>
    			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.35pt">
    			<p>55$</p>
    			</td>
    			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.35pt">
    			<p>BK</p>
    			</td>
    			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.4pt">
    			<p>Text text&hellip;</p>
    			</td>
    		</tr>
    		<tr>
    			<td style="background-color:#4472c4; vertical-align:top; width:79.35pt">
    			<p><strong>C</strong></p>
    			</td>
    			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.35pt">
    			<p>50X20cm</p>
    			</td>
    			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.35pt">
    			<p>55$</p>
    			</td>
    			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.35pt">
    			<p>LM</p>
    			</td>
    			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.4pt">
    			<p>Text text&hellip;</p>
    			</td>
    		</tr>
    		<tr>
    			<td style="background-color:#4472c4; vertical-align:top; width:79.35pt">
    			<p><strong>D</strong></p>
    			</td>
    			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.35pt">
    			<p>50X20cm</p>
    			</td>
    			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.35pt">
    			<p>55$</p>
    			</td>
    			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.35pt">
    			<p>LM</p>
    			</td>
    			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.4pt">
    			<p>Text text&hellip;</p>
    			</td>
    		</tr>
    		<tr>
    			<td style="background-color:#4472c4; vertical-align:top; width:79.35pt">
    			<p><strong>E</strong></p>
    			</td>
    			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.35pt">
    			<p>50X20cm</p>
    			</td>
    			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.35pt">
    			<p>55$</p>
    			</td>
    			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.35pt">
    			<p>PP</p>
    			</td>
    			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.4pt">
    			<p>Text text&hellip;</p>
    			</td>
    		</tr>
    		<tr>
    			<td style="background-color:#4472c4; vertical-align:top; width:79.35pt">
    			<p><strong>f</strong></p>
    			</td>
    			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.35pt">
    			<p>50X20cm</p>
    			</td>
    			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.35pt">
    			<p>55$</p>
    			</td>
    			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.35pt">
    			<p>RXS</p>
    			</td>
    			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.4pt">
    			<p>Text text&hellip;</p>
    			</td>
    		</tr>
    	</tbody>
    </table>
    
    <p>&nbsp;</p>
    </div>
    
    

    That inscriptis transform into:

        Product  Size     Price  Location  Comment   
                                                     
        A        50X20cm  55$    IL        Text text…
                                                     
        B        50X20cm  55$    BK        Text text…
                                                     
        C        50X20cm  55$    LM        Text text…
                                                     
        D        50X20cm  55$    LM        Text text…
                                                     
        E        50X20cm  55$    PP        Text text…
                                                     
        f        50X20cm  55$    RXS       Text text…
    

    Which is great. However, I want to convert it into running text:

    Product: A Size: 50X20cm Price: 55$ Location: IL Comment: Text text…                                      
    Product: B Size: 50X20cm Price: 55$ Location: BK Comment: Text text…           
    Product: C Size: 50X20cm Price: 55$ Location: LM Comment: Text text…           
    Product: D Size: 50X20cm Price: 55$ Location: LM Comment: Text text…           
    Product: E Size: 50X20cm Price: 55$ Location: PP Comment: Text text…           
    Product: f Size: 50X20cm Price: 55$ Location: RXS Comment: Text text…
    

    That means for each row, add the column name before the value. This can be done using inscriptis?

    opened by omri-suissa-clearmash 4
  • [discussion] compared to other tools (links)

    [discussion] compared to other tools (links)

    I wanted to add this as a discussion, but cant see/find the tab, thought the comparison was interesting, although links is a different tool (command line only)

    Using links, it takes 0.13s , vs nearly 3s for pips inscript.py (23 times faster/slower)

    links uses 16Mb vs inscript using 178~Mb

    $ /usr/bin/time -v links -dump ./leaky.html > links.txt
            Command being timed: "links -dump ./leaky.html"
            User time (seconds): 0.08
            System time (seconds): 0.01
            Percent of CPU this job got: 100%
            Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.10
            Average shared text size (kbytes): 0
            Average unshared data size (kbytes): 0
            Average stack size (kbytes): 0
            Average total size (kbytes): 0
            Maximum resident set size (kbytes): 15964
            Average resident set size (kbytes): 0
            Major (requiring I/O) page faults: 0
            Minor (reclaiming a frame) page faults: 4227
            Voluntary context switches: 1
            Involuntary context switches: 2
            Swaps: 0
            File system inputs: 0
            File system outputs: 0
            Socket messages sent: 0
            Socket messages received: 0
            Signals delivered: 0
            Page size (bytes): 4096
            Exit status: 0
    
    
    $ cat leaky.html |time -v inscript.py  > inscript.txt
            Command being timed: "inscript.py"
            User time (seconds): 2.99
            System time (seconds): 0.08
            Percent of CPU this job got: 99%
            Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.08
            Average shared text size (kbytes): 0
            Average unshared data size (kbytes): 0
            Average stack size (kbytes): 0
            Average total size (kbytes): 0
            Maximum resident set size (kbytes): 178232
            Average resident set size (kbytes): 0
            Major (requiring I/O) page faults: 0
            Minor (reclaiming a frame) page faults: 54636
            Voluntary context switches: 55
            Involuntary context switches: 8
            Swaps: 0
            File system inputs: 0
            File system outputs: 1136
            Socket messages sent: 0
            Socket messages received: 0
            Signals delivered: 0
            Page size (bytes): 4096
            Exit status: 0
    
    

    inscriptis adds a lot of white space at the start before the content, but links seems to nail it in one.

    image

    if i trim the whitespace, links shows the 'visible' representation, whilst inscript.py will attempt to show the field values as one line

    image

    opened by dgtlmoon 3
  • Exception handling

    Exception handling

    This invalid segment of a real-life html crashes entire module. See below. The better behavior would be to swallow the error and issue a warning.

    <p style="margin:0;padding:0;margin: 0cm; margin-bottom: ..0001pt; -ms-word-wrap: break-word;"><span style="font-size: 10.0pt; font-family: \'Arial\',sans-serif; color: black;">
    
    File "/emails/venv/lib/python3.10/site-packages/inscriptis/__init__.py", line 104, in get_text
        return Inscriptis(html_tree, config).get_text() if html_tree is not None \
      File "/emails/venv/lib/python3.10/site-packages/inscriptis/html_engine.py", line 81, in __init__
        self._parse_html_tree(html_tree)
      File "/emails/venv/lib/python3.10/site-packages/inscriptis/html_engine.py", line 100, in _parse_html_tree
        self._parse_html_tree(node)
      File "/emails/venv/lib/python3.10/site-packages/inscriptis/html_engine.py", line 100, in _parse_html_tree
        self._parse_html_tree(node)
      File "/emails/venv/lib/python3.10/site-packages/inscriptis/html_engine.py", line 100, in _parse_html_tree
        self._parse_html_tree(node)
      [Previous line repeated 2 more times]
      File "/emails/venv/lib/python3.10/site-packages/inscriptis/html_engine.py", line 93, in _parse_html_tree
        self.handle_starttag(tree.tag, tree.attrib)
      File "/emails/venv/lib/python3.10/site-packages/inscriptis/html_engine.py", line 135, in handle_starttag
        self.apply_attributes(attrs, html_element=self.css.get(
      File "/emails/venv/lib/python3.10/site-packages/inscriptis/model/attribute.py", line 60, in apply_attributes
        self.attribute_mapping[attr_name](attr_value, html_element)
      File "/emails/venv/lib/python3.10/site-packages/inscriptis/model/css.py", line 43, in attr_style
        apply_style(value, html_element)
      File "/emails/venv/lib/python3.10/site-packages/inscriptis/model/css.py", line 101, in attr_margin_bottom
        html_element.margin_after = CssParse._get_em(value)
      File "/emails/venv/lib/python3.10/site-packages/inscriptis/model/css.py", line 61, in _get_em
        value = float(_m.group(1))
    ValueError: could not convert string to float: '..0001'
    
    opened by crtnx 3
  • Display links config

    Display links config

    Hi there,

    First, thanks a lot for the wonderful work. Second, treat the issue I am describing below more like an enhancement.

    My problem is related to the way of displaying links when it is configured so. There is no way to configure how I want to see them. When enabled, an output includes both label and link, but for my purposes I want to see links only. I've looked at the source code and it is hardcoded this way...

    def _start_a(self, attrs):
            self.link_target = ''
            if self.config.display_links:
                self.link_target = attrs.get('href', '')
            if self.config.display_anchors:
                self.link_target = self.link_target or attrs.get('name', '')
    
            if self.link_target:
                self.tags[-1].write('[')
    
    def _end_a(self):
        if self.link_target:
            self.tags[-1].write(']({0})'.format(self.link_target))
    

    Please, provide a way for displaying links only, without labels. Or maybe give us a way to overwrite default behavior for A element with a custom function. And keep up the good work!

    opened by crtnx 3
  • Mixed text in extraction of table with span

    Mixed text in extraction of table with span

    It seems that inscriptis does something strange in the text extraction of following document:

    GoldDocument: https://gitlab.semanticlab.net/careercoach/page-segmentation/-/blob/master/corpus/goldDocuments/www.computer-studio.ch_i283260313085922574995599451274802310613.json URL: https://www.computer-studio.ch/schulung/ecdl-zertifikat/ecdlbase/

    Keyword "Voraussetzung" and the coressponding text content are part of table cells. However the extracted text is a mix of both:

    Current: "Modul \u00abPC-Einf\u00fchrung mit Windows\u00bb oder gleichwertige Kenntnisse \nVoraussetzung (Auf Teilnehmende, welche die erforderlichen Vorkenntnisse nicht besitzen, kann keine R\u00fccksicht genommen werden!)\n"

    Gold: "Voraussetzung Modul \u00abPC-Einf\u00fchrung mit Windows\u00bb oder gleichwertige Kenntnisse (Auf Teilnehmende, welche die erforderlichen Vorkenntnisse nicht besitzen, kann keine R\u00fccksicht genommen werden!)\n"

    opened by sudoale 2
  • Mixing content of columns in a forum

    Mixing content of columns in a forum

    For some websites inscriptis is mixing content of columns. Example: https://bpdfamily.com/message_board/index.php?topic=343886.0

    The first post of the text displayed by inscriptis: I don't even know where/how to start so I apologize if this is a bit rambly before it makes a point but I feel the need to preface certain things. What is your sexual orientation: Straight I've never been good with women and because of that I have chronic self-esteem issues, I wouldn't say they are generally debilitating but I certainly have issues when it comes to courting women. Who in your life has "personality" issues: Romantic partner I've had 2 real relationships in my life:

    opened by rogerwaldvogel 2
  • Concatinated words after stripping html tags

    Concatinated words after stripping html tags

    opened by sandrohoerler 2
  • Empty row bug

    Empty row bug

    Index out of bounds Error

    • https://github.com/weblyzard/inscriptis/blob/master/src/inscriptis/html_properties.py#L30
    • self.rows[-1].columns[-1] += text
    • Possible fix simply :
                if not self.rows:
                    self.add_row()
                if not self.rows[-1].columns:
                    self.add_column()    
    

    Sample from dragnet sample

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
        <HTML>
         <HEAD>
          <TITLE>Drop cap initial letter</TITLE>
          <STYLE type="text/css">
           P              { font-size: 12pt; line-height: 1.2 }
           P:first-letter { font-size: 200%; font-style: italic;
                            font-weight: bold; float: left }
           SPAN           { text-transform: uppercase }
          </STYLE>
         </HEAD>
         <BODY>
         <h1>Some more significant words here</h1>
          <P><SPAN><b>The first<b></SPAN> few words of an article
            in The Economist.</p>
            <p>&nbsp;</p>
    
    
         </BODY>
    
        </HTML>
    

    Cheers, Lucas

    opened by Lucas-Gerrand 2
  • build(deps): bump python from 3.10.7-slim-bullseye to 3.11.0-slim-bullseye

    build(deps): bump python from 3.10.7-slim-bullseye to 3.11.0-slim-bullseye

    Bumps python from 3.10.7-slim-bullseye to 3.11.0-slim-bullseye.

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies docker 
    opened by dependabot[bot] 1
  • detection of (almost) hidden text in html

    detection of (almost) hidden text in html

    Hi! I'm developing spam filters, and have to parse html emails to plain text to analyze. I've used html2text and later my own simplified implementation, but inscriptis looks even better!

    Is it possible to implement optional filtering/ignoring of hidden text parts? Text written using very small font size or font color equal (or close to) background color... sometimes this is defined in css/style tags, sometimes in span tag's parameters. This technique is often used on webpages and spam emails to fool search engines and spam filters with fake content not visible to human viewers.

    Here is a sample: http://thot.banki.hu/deepspam/poison.html

    opened by arpitest 2
Releases(2.3.2)
  • 2.3.2(Dec 7, 2022)

  • 2.3.1(Aug 29, 2022)

  • 2.3.0(Aug 2, 2022)

    • fix: correctly handle HTML comments used to confuse HTML to text conversion (fixes #45).
    • fix: updated unittests to correctly work with lxml in Ubuntu 22.04.
    • add: updated and extended flake8 testing.
    Source code(tar.gz)
    Source code(zip)
  • 2.2.0(Oct 22, 2021)

    • support custom HTML tables separators (addresses #29).
    • extended documentation on the command line client and added a link to the JOSS paper on inscriptis.
    • officially support Python 3.10 and add it to the build pipeline.
    • fixed dependency resolution for tox builds.
    Source code(tar.gz)
    Source code(zip)
  • 2.1.1(Oct 11, 2021)

    • improved documentation based on feedback provided by @reality, @rlskoeser and @sbenthall as part of the Journal of Open Source Software review process.
    • the Inscriptis web service has been included into the Python package and can now be started with
       export FLASK_APP="inscriptis.service.web"
       python3 -m flask run
      
    Source code(tar.gz)
    Source code(zip)
  • 2.1.0(Oct 11, 2021)

    • improved documentation based on feedback provided by @reality, @rlskoeser and @sbenthall as part of the Journal of Open Source Software review process.
    • the Inscriptis web service has been included into the Python package and can now be started with
       export FLASK_APP="inscriptis.service.web"
       python3 -m flask run
      
    Source code(tar.gz)
    Source code(zip)
  • 2.0.0(Jul 12, 2021)

    Changes

    HTML parsing:

    • new: improved model for handling text blocks and lines
    • chg: improved HTML parsing of tables, enumerations and margins; fixed borderline cases
    • chg: improved whitespace handling
    • add: cover more borderline cases with unit tests

    Inscriptis core:

    • new: annotation support
    • new: processing of annotation rules and annotation output
    • new: type hints
    • add: extended and improved documentation

    Inscript command line client:

    • new: added --annotation-rules option for annotation support.
    • new: added --post-processor option to export and visualize annotations (HTML, XML and surface form export)
    • chg: apply --encoding to Web URLs as well

    Misc:

    • chg: migrated to the semantic versioning schema described on https://semver.org/ for versioning.

    Note

    In terms of functionality, this release corresponds to Inscriptis 2.0rc2.

    Source code(tar.gz)
    Source code(zip)
  • 2.0rc2(Jul 10, 2021)

    Please refer to https://github.com/weblyzard/inscriptis/releases/tag/2.0rc1 for a list of all new features. This release candidate fixes the following issues in rc1:

    • fixed annotations for some borderline cases
    • improved documentation compared to 2.0rc2
    Source code(tar.gz)
    Source code(zip)
  • 2.0rc1(Jun 30, 2021)

    1. HTML parsing:

      • new: new model for handling blocks and lines
      • chg: improved HTML parsing of tables, enumerations and margins; fixed borderline cases
      • chg: improved whitespace handling
      • add: cover more borderline cases with unit tests
    2. Inscriptis core:

      • new: support for annotation rules and annotation output
      • new: annotation post-processors (html, xml, surface form)
      • new: type hints
      • chg: extended and improved documentation
    3. Inscript command line client:

      • chg: apply --encoding to Web URLs as well
    Source code(tar.gz)
    Source code(zip)
  • 1.2(May 14, 2021)

    • tables: add support for vertical (valign, css: text-vertical-alginment) and horizontal (align) cell alignment (fixes: #33)
    • improved handling of HTML attributes and styles
    • code cleanup
    • migrated build from travis to github actions
    Source code(tar.gz)
    Source code(zip)
  • 1.1.2(Jan 4, 2021)

    • ignore top margins at the beginning of a document.
    • more liberal licensing:
      • the license change has been triggered by another project that created a Java port of inscriptis.
      • to facilitate the free sharing of code and ideas between our two projects, we have (i) obtained the permission of all contributors for a license change, and (ii) changed the inscriptis license to the "Apache License 2.0".
    Source code(tar.gz)
    Source code(zip)
  • 1.1.1(Dec 8, 2020)

    • minor performance improvements and code optimizations
    • added Python 3.9 test environment
    • improved test coverage
    • updated package metadata
    • improved tox configuration
    Source code(tar.gz)
    Source code(zip)
  • 1.1(May 20, 2020)

    1. added support for rendering tags with the white-space: pre CSS attribute (e.g. <pre> which is often used for formatting code).
    2. API change: A ParserConfig object replaces the parameters display_images, dedpulicate_captions, display_links and indentation in get_text() and for initializing the Inscriptis class.
    
          from lxml.html import fromstring
          from inscriptis.model.config import ParserConfig
          
          html_tree = fromstring(html)   
          # optional parser configuration fine tuning
          config = ParserConfig(display_links=True, display_anchors=True)
          parser = Inscriptis(html_tree, config)
          text = parser.get_text()
    
    1. command line client:
      • added option for displaying anchor links
      • --encoding not sets the HTML and output encoding
      • new --version option
    2. Web service
      • use the related CSS profile per default
      • added version call
    3. Documentation fixes and improvements
    Source code(tar.gz)
    Source code(zip)
  • 1.0(Dec 20, 2019)

    • improved performance and code structure.
    • use metadata published in ./inscriptis/__init__.py for versioning and in setup.py.
    • improved test coverage
    • created sphinx API, usage and testing documentation which is published on https://inscriptis.readthedocs.org
    • requires Python 3.5+ (dropped support for Python 2.7)
    Source code(tar.gz)
    Source code(zip)
  • 0.0.4.1.1(Sep 25, 2019)

  • 0.0.4.1(Sep 25, 2019)

    • improved indentation, if span and div tags are used
    • support for custom rendering styles
    • improved documentation
    • use travis for auto CI
    • requires Python 2.7+ or Python 3.5+ since lxml does not support Python 3 versions <3.5
    Source code(tar.gz)
    Source code(zip)
  • 0.0.4.0(Feb 26, 2019)

    • Correctly handle nested tables and line breaks (e.g. due to enumerations, list or paragraph breaks) in tables.
    • Improved content stripping.

    Please take a look at the Rendering document for an overview of how Inscriptis renders different tables.

    Source code(tar.gz)
    Source code(zip)
  • 0.0.3.8(Jan 31, 2019)

  • 0.0.3.7(Dec 21, 2018)

    • correctly parse negative margins in CSS definitions.
    • This fixes a bug that led for some pages to a high number (>1000) of newlines between content.
    Source code(tar.gz)
    Source code(zip)
  • 0.0.3.5(Dec 11, 2018)

  • 0.0.3.4(Nov 15, 2018)

  • 0.0.3.3(Apr 17, 2018)

  • 0.0.3.2(Nov 24, 2017)

    Changelog

    1. optional flask web service for converting html to python
    2. bug fixes
      • allow infinitely nested lists
      • fix a css parsing bug
      • correctly handle empty documents
    Source code(tar.gz)
    Source code(zip)
Owner
webLyzard technology
webLyzard technology
Converts XML to Python objects

untangle Documentation Converts XML to a Python object. Siblings with similar names are grouped into a list. Children can be accessed with parent.chil

Christian Stefanescu 567 Nov 30, 2022
Pythonic HTML Parsing for Humans™

Requests-HTML: HTML Parsing for Humans™ This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible. When us

Python Software Foundation 12.9k Jan 01, 2023
Python module that makes working with XML feel like you are working with JSON

xmltodict xmltodict is a Python module that makes working with XML feel like you are working with JSON, as in this "spec": print(json.dumps(xmltod

Martín Blech 5k Jan 04, 2023
That project takes as input special TXT File, divides its content into lsit of HTML objects and then creates HTML file from them.

That project takes as input special TXT File, divides its content into lsit of HTML objects and then creates HTML file from them.

1 Jan 10, 2022
A HTML-code compiler-thing that lets you reuse HTML code.

RHTML RHTML stands for Reusable-Hyper-Text-Markup-Language, and is pronounced "Rech-tee-em-el" despite how its abbreviation is. As the name stands, RH

Duckie 4 Nov 15, 2021
Lektor-html-pretify - Lektor plugin to pretify the HTML DOM using Beautiful Soup

html-pretify Lektor plugin to pretify the HTML DOM using Beautiful Soup. How doe

Chaos Bodensee 2 Nov 08, 2022
inscriptis -- HTML to text conversion library, command line client and Web service

inscriptis -- HTML to text conversion library, command line client and Web service A python based HTML to text conversion library, command line client

webLyzard technology 122 Jan 07, 2023
A library for converting HTML into PDFs using ReportLab

XHTML2PDF The current release of xhtml2pdf is xhtml2pdf 0.2.5. Release Notes can be found here: Release Notes As with all open-source software, its us

2k Dec 27, 2022
Python binding to Modest engine (fast HTML5 parser with CSS selectors).

A fast HTML5 parser with CSS selectors using Modest engine. Installation From PyPI using pip: pip install selectolax Development version from github:

Artem Golubin 710 Jan 04, 2023
The lxml XML toolkit for Python

What is lxml? lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language. It's also very fast and memory

2.3k Jan 02, 2023
Modded MD conversion to HTML

MDPortal A module to convert a md-eqsue lang to html Basically I ruined md in an attempt to convert it to html Overview Here is a demo file from parse

Zeb 1 Nov 27, 2021
A jquery-like library for python

pyquery: a jquery-like library for python pyquery allows you to make jquery queries on xml documents. The API is as much as possible the similar to jq

Gael Pasgrimaud 2.2k Dec 29, 2022
A python HTML builder library.

PyML A python HTML builder library. Goals Fully functional html builder similar to the javascript node manipulation. Implement an html parser that ret

Arjix 8 Jul 04, 2022
Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes

Bleach Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes. Bleach can also linkify text safely, appl

Mozilla 2.5k Dec 29, 2022
The awesome document factory

The Awesome Document Factory WeasyPrint is a smart solution helping web developers to create PDF documents. It turns simple HTML pages into gorgeous s

Kozea 5.4k Jan 07, 2023
Safely add untrusted strings to HTML/XML markup.

MarkupSafe MarkupSafe implements a text object that escapes characters so it is safe to use in HTML and XML. Characters that have special meanings are

The Pallets Projects 514 Dec 31, 2022
Standards-compliant library for parsing and serializing HTML documents and fragments in Python

html5lib html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all majo

1k Dec 27, 2022
Dominate is a Python library for creating and manipulating HTML documents using an elegant DOM API

Dominate Dominate is a Python library for creating and manipulating HTML documents using an elegant DOM API. It allows you to write HTML pages in pure

Tom Flanagan 1.5k Jan 09, 2023
Generate HTML using python 3 with an API that follows the DOM standard specfication.

Generate HTML using python 3 with an API that follows the DOM standard specfication. A JavaScript API and tons of cool features. Can be used as a fast prototyping tool.

byteface 114 Dec 14, 2022