Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

Overview

trafilatura: Web scraping tool for text discovery and retrieval

Python package Python versions Documentation Status Travis build status Code Coverage Downloads

Demo as GIF image

Description

Trafilatura is a Python package and command-line tool which seamlessly downloads, parses, and scrapes web page data: it can extract metadata, main body text and comments while preserving parts of the text formatting and page structure. The output can be converted to different formats.

Distinguishing between a whole page and the page's essential parts can help to alleviate many quality problems related to web text processing, by dealing with the noise caused by recurring elements (headers and footers, ads, links/blogroll, etc.).

The extractor aims to be precise enough in order not to miss texts or to discard valid documents. In addition, it must be robust, but also reasonably fast. With these objectives in mind, Trafilatura is designed to run in production on millions of web documents. It is based on lxml as well as readability and jusText as fallback.

Features

  • Seamless parallelized online and offline processing:
    • Download and conversion utilities included
    • URLs, HTML files or parsed HTML trees as input
  • Robust and efficient extraction:
    • Main text and/or comments
    • Structural elements preserved: paragraphs, titles, lists, quotes, code, line breaks, in-line text formatting
    • Extraction of metadata (title, author, date, site name, categories and tags)
  • Several output formats supported:
    • Plain text (minimal formatting)
    • CSV (with metadata, tab-separated values)
    • JSON (with metadata)
    • XML (for metadata and structure) and TEI-XML
  • Link discovery and URL lists:
    • Support for sitemaps and ATOM/RSS feeds
    • Efficient and polite processing of URL queues
    • Blacklisting
  • Optional language detection on extracted content

Evaluation and alternatives

For more detailed results see the evaluation page and evaluation script. To reproduce the tests just clone the repository, install all necessary packages and run the evaluation script with the data provided in the tests directory.

500 documents, 1487 text and 1496 boilerplate segments (2020-11-06)
Python Package Precision Recall Accuracy F-Score Diff.
justext 2.2.0 (tweaked) 0.870 0.584 0.749 0.699 6.1x
newspaper3k 0.2.8 0.921 0.574 0.763 0.708 12.9x
goose3 3.1.6 0.950 0.629 0.799 0.757 19.0x
boilerpy3 1.0.2 (article mode) 0.851 0.696 0.788 0.766 4.8x
baseline (text markup) 0.746 0.804 0.766 0.774 1x
dragnet 2.0.4 0.906 0.689 0.810 0.783 3.1x
readability-lxml 0.8.1 0.917 0.716 0.826 0.804 5.9x
news-please 1.5.13 0.923 0.711 0.827 0.804 184x
trafilatura 0.6.0 0.924 0.849 0.890 0.885 3.9x
trafilatura 0.6.0 (+ fallbacks) 0.933 0.877 0.907 0.904 8.4x

External evaluations:

Usage and documentation

For further information please refer to the documentation:

License

trafilatura is distributed under the GNU General Public License v3.0. If you wish to redistribute this library but feel bounded by the license conditions please try interacting at arms length, multi-licensing with compatible licenses, or contacting me.

See also GPL and free software licensing: What's in it for business?

Roadmap

  • [-] Duplicate detection at sentence, paragraph and document level using a least recently used (LRU) cache
  • [-] URL lists and document management
  • [-] Configuration and extraction parameters
  • [-] Graphical user interface
  • [ ] Interaction with web archives (notably WARC format)
  • [ ] Integration of natural language processing tools

Contributing

Contributions are welcome!

Feel free to file issues on the dedicated page. Thanks to the contributors who submitted features and bugfixes!

Author

This effort is part of methods to derive information from web documents in order to build text databases for research (chiefly linguistic analysis and natural language processing). Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge for those who conduct such research. Web corpus construction involves numerous design decisions, and this software package can help facilitate text data collection and enhance corpus quality.

You can contact me via my contact page or GitHub.

Going further

Online documentation: trafilatura.readthedocs.io.

Tutorials: overview.

Trafilatura: Italian word for wire drawing.

Corresponding posts on Bits of Language (blog).

Comments
  • Celery error with v1.2.1: ValueError: signal only works in main thread

    Celery error with v1.2.1: ValueError: signal only works in main thread

    Having version 1.2.1 it is not possible to launch trafilatura extraction in the async task like celery. https://github.com/adbar/trafilatura/blob/1bb5fee6a4812e53b6597053c25efde995174d79/trafilatura/core.py#L982 It would be better to have HAS_SIGNAL as config variable, and not hardcoded value

    celery_1      |     text = trafilatura.extract(
    celery_1      |   File "/usr/local/lib/python3.8/site-packages/trafilatura/core.py", line 982, in extract
    celery_1      |     signal(SIGALRM, timeout_handler)
    celery_1      |   File "/usr/local/lib/python3.8/signal.py", line 47, in signal
    celery_1      |     handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
    celery_1      | ValueError: signal only works in main thread
    
    feedback 
    opened by alex-bender 16
  • No metadata extraction

    No metadata extraction

    Hello,

    Thanks for your beautiful and powerful project, I try to test some websites with trafilatura 0.6.0 in Python 3.8.

    My test:

    import trafilatura
    from trafilatura.core import bare_extraction
    
    downloaded = trafilatura.fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/')
    
    result = bare_extraction(downloaded, include_formatting=False, with_metadata=True)
    
    print(result)
    

    The results: ({'title': None, 'author': None, 'url': None, 'hostname': None, 'description': None, 'sitename': None, 'date': None, 'categories': None, 'tags': None, 'fingerprint': None, 'id': None}, 'Leader spotlight: Erin Spiceland Every March we recognize the women who have shaped history—and now, we’re taking a look forward. From driving software development in large companies to maintaining thriving open source communities, we’re spending Women’s History Month with women leaders who are making history every day in the tech community. Erin Spiceland is a Software Engineer for SpaceX. Born and raised in rural south Georgia, she is a Choctaw and Chickasaw mother of two now living in downtown Los Angeles. Erin didn’t finish college—she’s a predominantly self-taught software engineer. In her spare time, she makes handmade Native American beadwork and regalia and attends powwows. How would you summarize your career (so far) in a single sentence? My career has been a winding road through periods of stimulation and health as well as periods of personal misery. During it all, I’ve learned a variety of programming languages and technologies while working on a diverse array of products and services. I’m a domestic abuse survivor and a Choctaw bisexual polyamorous woman. I’m so proud of myself that I made it this far considering where I came from. What was your first job in tech like? In 2007, I had a three-year-old daughter and I was trying to finish my computer science degree one class at a time, all while keeping my house and family running smoothly. I found the math classes exciting and quickly finished my math minor, leaving only computer science classes. I was looking at about five years before I would graduate. Then, my husband at the time recommended me for an entry software developer position at a telecom and digital communications company. When faced with the choice between an expensive computer science degree and getting paid to do what I loved, I dropped out of college and accepted the job. I was hired to work on internal tooling, and eventually, products. I did a lot of development on product front-ends, embedded network devices, and a distributed platform-as-a-service. I learned Java/JSP, Python, JavaScript/CSS, Node.js, as well as MySQL, PostgreSQL, and distributed systems architecture. It was an intense experience that required a lot of self-teaching, asking others for help, and daycare, but it set me up for my later successes. What does leadership mean to you in your current role? “Leadership is about enabling those below, above, and around you to be at their healthiest and most effective so that all of you can accurately understand your surroundings, make effective plans and goals for the future, and achieve those goals.” I appreciate and admire technical, effective leaders who care for their reports as humans, not as lines on a burndown chart, and forego heavy-handed direction in favor of communication and mutual dialogue. I think it’s as important for a leader to concern herself with her coworkers’ personal well-being as it is for her to direct their performance. What’s the biggest career risk you’ve ever taken? What did you learn from that experience? Last year I took a pay cut to move from a safe, easy job where I had security to work in a language I hadn’t seen in years and with systems more complicated than anything I’d worked with before. I moved from a place where I had a huge four bedroom house to a studio apartment that was twice the price. I moved away from my children, of who I share custody with my ex-husband. We fly across the U.S. to see each other now. I miss my children every day. However, I get to be a wonderful role model for them. “I get to show my children that a Native woman who grew up in poverty, lost her mother and her culture, and who didn’t finish college can learn, grow, and build whatever career and life she wants.” What are you looking forward to next? I can’t wait to wake up every day with my partner who loves me so much. I’m looking forward to showing my children exactly how far they can go. I’m excited to keep exploring Los Angeles. “I expect to learn so much more about software and about life, and I want to experience everything.” Want to know more about Erin Spiceland? Follow them on GitHub or Twitter. Want to learn more about featured leaders for Women’s History Month? Read about: Laura Frank Tacho, Director of Engineering at CloudBees Rachel White, Developer Experience Lead at American Express Kathy Pham, Computer Scientist and Product Leader at Mozilla and Harvard Heidy Khlaaf, Research Consultant at Adelard LLP Check back in soon—we’ll be adding new interviews weekly throughout March.', <Element body at 0x10680a280>, <Element body at 0x1067af080>)

    So, no metadata return.

    Also, I added a xpath in the metaxpaths.py and rebuild your code. I'm sure that //div[contains(@class, "post__categories")]//li//a will be match with a category in the url https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/. But no category is returned.

    categories_xpaths = [
        """//div[starts-with(@class, 'post-info') or starts-with(@class, 'postinfo') or
        starts-with(@class, 'post-meta') or starts-with(@class, 'postmeta') or
        starts-with(@class, 'meta') or starts-with(@class, 'entry-meta') or starts-with(@class, 'entry-info') or
        starts-with(@class, 'entry-utility') or starts-with(@id, 'postpath')]//a""",
        "//p[starts-with(@class, 'postmeta') or starts-with(@class, 'entry-categories') or @class='postinfo' or @id='filedunder']//a",
        "//footer[starts-with(@class, 'entry-meta') or starts-with(@class, 'entry-footer') or starts-with(@class, 'post-info')]//a",
        '//*[(self::li or self::span)][@class="post-category" or starts-with(@class, "post__categories") or @class="postcategory" or @class="entry-category"]//a',
        '//header[@class="entry-header"]//a',
        '//div[@class="row" or @class="tags"]//a',
        '//div[contains(@class, "post__categories")]//li//a',
    ]
    

    Another question is that could I get content of article including html format (no clean tags in content)?

    Please help me, thanks for your support!

    enhancement 
    opened by phongtnit 16
  • Issue with multiple authors and preference for meta information

    Issue with multiple authors and preference for meta information

    We shouldnt believe on schema person

    agenda Current: "author": "Sandy Cheu", Should be: "author": "Stephen Teulan; Nikita Weikhardt",

    aged Current: "author":"Consumers", Should be: "author": "Liz Alderslade",

    meta remove single names cath Current: "author": null, Should be: "author": "Rebecca",

    echo Current: "author": null, Should be: "author": "Katie",

    enhancement 
    opened by felipehertzer 15
  • Navigation bar filtering - some bug fixed

    Navigation bar filtering - some bug fixed

    The current repo should work well? I have removed several things that are unused and fixed a tiny bug that affects the accuracy. I have added to the git ignore so that the branch should now get quite clean as well XD

    opened by immortal-autumn 13
  • No Formatting in Plain Text Output

    No Formatting in Plain Text Output

    When using include_formatting for plain text, I'm not seeing any formatting (bold, italics, etc..). The term I'm using supports this. Is this by design or a bug? I tried both the standalone version and using it as a library with trafilatura.extract(downloaded, include_formatting=True).

    enhancement question 
    opened by peterjschroeder 13
  • Performance enhancement

    Performance enhancement

    I. Test file

    test2.py
    from time import time
    
    import requests
    from trafilatura import extract
    
    
    if __name__ == '__main__':
        urls = ["https://en.wikipedia.org/wiki/List_of_Hindi_songs_recorded_by_Asha_Bhosle",
                "https://en.wikipedia.org/wiki/2022_in_video_games",
                "https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Kuwait",
                "https://en.wikipedia.org/wiki/Presidency_of_Rodrigo_Duterte",
                "https://en.wikipedia.org/wiki/List_of_2021%E2%80%9322_NBA_season_transactions",
                "https://en.wikipedia.org/wiki/2022_in_sports",
                "https://en.wikipedia.org/wiki/Firefox_version_history",
                "https://en.wikipedia.org/wiki/List_of_common_misconceptions",
                "https://en.wikipedia.org/wiki/Same-sex_union_legislation",
                "https://en.wikipedia.org/wiki/Presidency_of_Donald_Trump",]
    
        cum_time = 0
        for url in urls:        
            resp = requests.get(url)
            t0 = time()
            result = extract(resp.text)
            cum_time = cum_time + time() - t0
        print(cum_time)
    

    II. Test pprofile

    kernprof -lv test2.py
    

    before

    Total time: 0.544693 s
    File: /trafilatura-master/trafilatura/utils.py
    Function: remove_control_characters at line 221
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
       221                                           @profile
       222                                           def remove_control_characters(string):
       223                                               '''Prevent non-printable and XML invalid character errors'''
       224     25998     544693.0     21.0    100.0      return ''.join([c for c in string if c.isprintable() or c.isspace()])
    

    after

    Total time: 0.169241 s
    File: /trafilatura-master/trafilatura/utils.py
    Function: remove_control_characters at line 227
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
       227                                           @profile
       228                                           def remove_control_characters(string):
       229                                               '''Prevent non-printable and XML invalid character errors'''
       230     25998     169241.0      6.5    100.0      return ''.join(filter(is_printable_or_space, string))
    

    III. Test vprof

    vprof -c -h test2.py
    

    before before

    after after

    feedback 
    opened by deedy5 10
  • Correction in the extraction of authors by tag and by json

    Correction in the extraction of authors by tag and by json

    In this correction:

    • added 'submitted-by' and 'username' tags to xpath
    • the maximum size of the author's name has been increased.
    • regex has been added to remove emoji from author names often found on sites like buzzfeed
    • added a regex to minify json before running the other regex, was having trouble fetching authors when json formatted.
    • added a regex to remove json items like images and organization before searching the author
    • reorganized the extract_json function as it was overwriting meta tags with none when no json was found

    qsr Before this fix: "author": null After this fix: "author": "Kevin Santos"

    perthnow Before this fix: "author": "NCA NewsWire" After this fix: "author": "Finn McHugh"

    buzzfeed Before this fix: "author": "Hameda Nafiz BuzzFeed Staff" After this fix: "author": "Hameda Nafiz"

    buzzfeed Before this fix: "author": "Olivia ❤️" After this fix: "author": "Olivia Community Contributor"

    build Before this fix: "author": null After this fix: "author": "Thoams Lane"

    hunterandbligh Before this fix: none After this fix: "author": "REBECCA MAGRO"

    abc - 'data-component' Before this fix: "author": null After this fix: "author": "Charlotte Gore"

    proactiveinvestors Before this fix: "author": null After this fix: "author": "Calum Muirhead"

    banking Before this fix: "author": "Sarah Harman Jul" After this fix: "author": "Sarah Harman"

    hcamag Before this fix: "author": "Sarah Harman Jul" After this fix: "author": "Mark Rosanes"

    spacedaily and + 9 sites Before this fix: "author": null After this fix: "author": "Lucie Aubourg"

    first Before this fix:"author": "Nick Griffin", After this fix: "author": "Stan Shamu",

    racing Before this fix:"author": "Ben Sporle - @bensporle; Ben Sporle", After this fix: "author": "Ben Sporle",

    ajn Before this fix:"author": "RABBI GARY ROBUCK July", After this fix: "author": "RABBI GARY ROBUCK",

    ESPN it is not totally fix, but it is better Before this fix: "author": "Andrew Mcglashandeputy Editor, Espncricinfo", After this fix: "author": "Andrew McGlashan Deputy editor; ESPNcricinfo",

    Probono it is not totally fix, but it is better Before this fix: "author": null, After this fix: "author": "Luke Michael; Journalist; @Luke_Michael",

    opened by felipehertzer 10
  • Library is redirecting stderr to /dev/null upon every call

    Library is redirecting stderr to /dev/null upon every call

    If readbility fallback is activated, the Trafilatura library redirects stderr to /dev/null upon every call: https://github.com/adbar/trafilatura/blob/a56fb3e041175df38a32b1c5ef2e9c7888eeb7a6/trafilatura/external.py#L63

    Within programs involving other libraries, this causes a host of side effects. E.g., generating a chart with seaborn imports ipython (a dependency of seaborn) which pre-checks upon initialization stdin, stdout and stderr and crashes because stderr is /dev/null. I have other side effects as well in other libraries, including disappearing logs (eg when logs settings are modified after calls to Trafilatura).

    This redirection seems to have been necessary to prevent the readibility library to print out messages to stderr. A cursory reading of the current version of readibility seems to indicate it doesn't do that, it only emits proper logs.

    Consequently, this redirect may be removed (to be tested).

    opened by dmoklaf 10
  • In parallel trafilatura is marginally slower than goose

    In parallel trafilatura is marginally slower than goose

    I'm not quite sure where to begin with this, it's a strange one. In a real world scenario I tried switching from Goose3 to Trafilatura. I'm processing html extractions in parallel with dask. After switching to trafilatura, I noticed a 30% slowdown. I ended up writing my own evaluation library to verify the results.

    Results from running in parallel: ┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ Library ┃ Accuracy ┃ Precision ┃ Recall ┃ FScore ┃ Mean Similarity ┃ Items/sec ┃ ┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ goose3 │ 0.9678 │ 0.8561 │ 0.9547 │ 0.9027 │ 0.8343 │ 383.4737 │ │ trafilatura │ 0.9124 │ 0.9485 │ 0.908 │ 0.9278 │ 0.8567 │ 361.3232 │ └─────────────┴──────────┴───────────┴────────┴────────┴─────────────────┴───────────┘

    Results from running sequentially: ┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ Library ┃ Accuracy ┃ Precision ┃ Recall ┃ FScore ┃ Mean Similarity ┃ Items/sec ┃ ┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ goose3 │ 0.9678 │ 0.8561 │ 0.9547 │ 0.9027 │ 0.8343 │ 9.7953 │ │ trafilatura │ 0.9124 │ 0.9485 │ 0.908 │ 0.9278 │ 0.8567 │ 23.0045 │ └─────────────┴──────────┴───────────┴────────┴────────┴─────────────────┴───────────┘

    Note: the dataset evaluated in from scrapinghub/article-extraction-benchmark tool. The only portion of the code that runs in parallel for the bench marks is the extraction. Only the extraction is timed for calculating items/sec.

    In summary: trafilatura is marginally slower than Goose3 in parallel. However sequentially it is twice as fast as Goose3.

    I'm not sure where to begin with this. It can be difficult to profile parallel processing. It may be related to some of the memory leak issues reported with trafilutura, although it appears those have been resolved. Or the caching, I haven't looked into how that functions.

    I will work on publishing my benchmarking tool this afternoon.

    question 
    opened by getorca 9
  • Handle pages where article is split into multiple sibling nodes

    Handle pages where article is split into multiple sibling nodes

    This fixes #85 (and #159).

    It involved a bit of a refactor of the extract_content function, but the basic idea is that it looks through all of the children in the subtree returned from tree.xpath(expr), not just stopping at the first child like before. Beyond that, it pulls out the logic that checks whether the BODY_XPATH expression matched in the current loop iteration has found a useful subtree, to make it a little more readable, and only performs the final cleanup and look-elsewhere logic at the very end.

    So essentially, on finding a subtree whose first node is valid, we proceeded to consider all of the remaining nodes in that subtree.

    This seems to work great, although I haven't run it through the automated tests. (I had trouble running the url tests.)

    Let me know what you think. Happy to talk through anything, and if/when this seems good to you, I'll clean it up (print statements, code style, etc.).

    Thanks!

    opened by naftalibeder 9
  • Broken parsing of images

    Broken parsing of images

    I'm not quite sure what's wrong with images but here is reproducer:

    $ curl https://en.wikipedia.org/wiki/Tribe > /tmp/tribe.html
    $ python
    Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
    [GCC 7.3.0] :: Anaconda, Inc. on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> html_wiki_tribe = open('/tmp/tribe.html').read()
    ... text = trafilatura.extract(
    ...     html_wiki_tribe,
    ...     include_images=True
    ... )
    ~/anaconda3/lib/python3.7/site-packages/trafilatura/xml.py in xmltotxt(xmloutput, include_formatting, include_links)
        272             LOGGER.debug('unexpected element: %s', element.tag)
        273             returnlist.extend([textelement, ' '])
    --> 274     return sanitize(''.join(returnlist))
        275 
        276 
    
    TypeError: sequence item 6: expected str instance, NoneType found
    
    

    UPD Looks like this could help: image

    bug 
    opened by alex-bender 9
  • Improve title extraction by removing sitename suffix

    Improve title extraction by removing sitename suffix

    Most os sites add a suffix like:

    • My article title | My Site Name
    • My article title - My Site Name

    There is no need the sitename within the article title

    Common separators are: - | – — • · ‹ › ⁄ « » < > : * ⋆ ~

    Some sites use html entities for this, like: &#8212;

    enhancement 
    opened by andremacola 5
  • Remove unwanted html elements with regex or xpaths

    Remove unwanted html elements with regex or xpaths

    Possibility to remove unnecessary html elements before starting the extraction process.

    There are often some elements within the extracted text that are not article content.

    Titles should by default not come inside the extracted text, or there should be an option to remove them (maybe this requires another issue)

    Something like:

    unwanted = [
      'iframe',
      'button',
      'figcaption',
      'caption',
      'form',
      'aside',
      'script',
      'style',
      'ins',
      'link',
      'header',
      'footer',
      '#comments',
      'nav',
      '.post-comments',
      '.post-tags',
      '.wp-block-embed',
      '.wp-caption-text',
      'svg',
      '[class^=ads]',
      '[class*=ads-]',
      '[style="display:none"]',
      '[style*="display:none"]',
      '[style*="display: none"]',
      '[itemprop*="description"]',
      '.push-web-notification',
      '.mc-column.entities',
      '.newsletter-component',
      '.post-subject',
      '.post-info',
      '.addthis_tool',
      '.pt-cv-wrapper'
    ]
    
    article = trafilatura.bare_extraction(document,
            unwanted_elements=unwanted
            include_comments=False, include_tables=False,
            favor_precision=True, favor_recall=True,
            no_fallback=True, target_language=None,
            date_extraction_params={'extensive_search': True, 'original_date': True, 'outputformat': "%Y-%m-%dT%H:%M:%S%z"},
            config=config)
    
    question 
    opened by andremacola 4
  • feat: Add image urls to metadata

    feat: Add image urls to metadata

    Sometimes an image is not included in text body and we can extract by some SEO TAGS

    Issue: https://github.com/adbar/trafilatura/issues/281

    Unfortunately I didn't have time to create the tests

    opened by andremacola 2
  • Add image urls to metadata

    Add image urls to metadata

    Sometimes an image is not included in text body and we can extract by some SEO TAGS like some article parsers do (https://github.com/extractus/article-extractor/blob/main/src/utils/extractMetaData.js)

    Here some metatags:

    'image'
    'og:image'
    'og:image:url'
    'og:image:secure_url'
    'twitter:image'
    'twitter:image:src'
    
    enhancement 
    opened by andremacola 1
  • Extraction of Youtube iframes and img elements with links

    Extraction of Youtube iframes and img elements with links

    Not able to fetch image tags Not able to fetch iframe tags. From command prompt in windows machine

    trafilatura --sitemap "https://www.lyricspulp.com/" --list > linklist.txt trafilatura --sitemap homepage --list > linklist.txt trafilatura -i linklist.txt --xml -o outputfile.txt trafilatura -i linklist.txt --formatting --links --images --no-comments --xml -o outputfile.txt

    enhancement 
    opened by sampathmende 3
Releases(v1.4.0)
  • v1.4.0(Oct 18, 2022)

    Impact on extraction and output format:

    • better extraction (#233, #243 & #250 with @knit-bee, #246 with @mrienstra, #258)
    • XML: preserve list type as attribute (#229)
    • XML TEI: better conformity with @knit-bee (#238, #242, #253, #254)
    • faster text cleaning and shorter code (#237 with @deedy5, #245)
    • metadata: add language when detector is activated (#224)
    • metadata: extend fallbacks and test coverage for json_metadata functions by @felipehertzer (#235)
    • TXT: change markdown formatting of headers by @LaundroMat (#257)

    Smaller changes in convenience functions:

    • add function to clear caches (#219)
    • CLI: change exit code if download fails (#223)
    • settings: use "\n" for multiple user agents by @k-sareen (#241)

    Updates:

    • docs updated (and #244 by @dsgibbons)
    • package dependencies updated

    Full Changelog: https://github.com/adbar/trafilatura/compare/v1.3.0...v1.4.0

    Source code(tar.gz)
    Source code(zip)
  • v1.3.0(Jul 29, 2022)

    • fast and robust html2txt() function added (#221)
    • more robust parsing (#228)
    • fixed bugs in metadata extraction, with @felipehertzer in #213 & #226
    • extraction about 10-20% faster, slightly better recall
    • partial fixes for memory leaks (#216)
    • docs extended and updated (#217, #225)
    • prepared deprecation of old process_record() function
    • more stable processing with updated dependencies

    Full Changelog: https://github.com/adbar/trafilatura/compare/v1.2.2...v1.3.0

    Source code(tar.gz)
    Source code(zip)
  • v1.2.2(May 18, 2022)

    • more efficient rules for extraction
    • metadata: further attributes used (with @felipehertzer)
    • better baseline extraction
    • issues fixed: #202, #204, #205
    • evaluation updated

    Full Changelog: https://github.com/adbar/trafilatura/compare/v1.2.1...v1.2.2

    Source code(tar.gz)
    Source code(zip)
  • v1.2.1(May 2, 2022)

    What's Changed

    • --precision and --recall arguments added to the CLI
    • better text cleaning: paywalls and comments
    • improvements for Chinese websites (with @glacierck & @immortal-autumn): #186, #187, #188
    • further bugs fixed: #189, #192 (with @felipehertzer), #200
    • efficiency: faster module loading and improved RAM footprint

    Full Changelog: https://github.com/adbar/trafilatura/compare/v1.2.0...v1.2.1

    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Mar 7, 2022)

    • efficiency: replaced module readability-lxml by trimmed fork
    • bugs fixed: (#179, #180, #183, #184)
    • improved baseline extraction
    • cleaner metadata (with @felipehertzer)

    Full Changelog: https://github.com/adbar/trafilatura/compare/v1.1.0...v1.2.0

    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Feb 21, 2022)

    • encodings: better detection, output NFC-normalized Unicode
    • maintenance and performance: more efficient code
    • bugs fixed (#119, #136, #147, #160, #161, #162, #164, #167 and others)
    • prepare compatibility with upcoming Python 3.11
    • changed default settings
    • extended documentation

    Full Changelog: https://github.com/adbar/trafilatura/compare/v1.0.0...v1.1.0

    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Nov 30, 2021)

    • compress HTML backup files & seamlessly open .gz files
    • support JSON web feeds
    • graphical user interface integrated into main package
    • faster downloads: reviewed backoff, compressed data
    • optional modules: downloads with pycurl, language identification with py3langid
    • bugs fixed (#111, #125, #132, #136, #140)
    • minor optimizations and fixes by @vbarbaresi in #124 & #130
    • fixed array with single or multiples entries on json extractor by @felipehertzer in #143
    • code base refactored with @sourcery-ai #121, improved and optimized for Python 3.6+
    • drop support for Python 3.5

    Full Changelog: https://github.com/adbar/trafilatura/compare/v0.9.3...v1.0.0

    Source code(tar.gz)
    Source code(zip)
  • v0.9.3(Oct 21, 2021)

    • better, faster encoding detection: replaced chardet with charset_normalizer
    • faster execution: updated justext to 3.0
    • better extraction of sub-elements in tables (#78, #90)
    • more robust web feed parsing
    • further defined precision- and recall-oriented settings
    • license extraction in footers (#118)

    Full Changelog: https://github.com/adbar/trafilatura/compare/v0.9.2...v0.9.3

    Source code(tar.gz)
    Source code(zip)
  • v0.9.2(Oct 6, 2021)

    • first precision- and recall-oriented presets defined
    • improvements in authorship extraction (thanks @felipehertzer)
    • requesting TXT output with formatting now results in Markdown format
    • bugs fixed: notably extraction robustness and consistency (#109, #111, #113)
    • setting for cookies in request headers (thanks @muellermartin)
    • better date extraction thanks to htmldate update
    Source code(tar.gz)
    Source code(zip)
  • v0.9.1(Aug 2, 2021)

    • improved author extraction (thanks @felipehertzer!)
    • bugs fixed: HTML element handling, HTML meta attributes, spider, CLI, ...
    • docs updated and extended
    • CLI: option names normalized (heed deprecation warnings), new option explore
    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(Jun 15, 2021)

    • focused crawling functions including politeness rules
    • more efficient multi-threaded downloads + use as Python functions
    • documentation extended
    • bugs fixed: extraction and URL handling
    • removed support for Python 3.4
    Source code(tar.gz)
    Source code(zip)
  • v0.8.2(Apr 21, 2021)

    • better handling of formatting, links and images, title type as attribute in XML formats
    • more robust sitemaps and feeds processing
    • more accurate extraction
    • further consolidation: code simplified and bugs fixed
    Source code(tar.gz)
    Source code(zip)
  • v0.8.1(Mar 11, 2021)

  • v0.8.0(Feb 19, 2021)

    • improved link discovery and handling
    • fixes in metadata extraction, feeds and sitemaps processing
    • breaking change: the extract function now reads target format from output_format argument only
    • new extraction option: preserve links, CLI options re-ordered
    • more opportunistic backup extraction
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Jan 4, 2021)

    • customizable configuration file to parametrize extraction and downloads
    • better handling of feeds and sitemaps
    • additional CLI options: crytographic hash for file name, use Internet Archive as backup
    • more precise extraction
    • faster downloads: requests replaced with bare urllib3 and custom decoding
    • consolidation: bug fixes and improvements, many thanks to the issues reporters!
    Source code(tar.gz)
    Source code(zip)
  • v0.6.1(Dec 2, 2020)

    • added bare_extraction function returning Python variables
    • improved link discovery in feeds and sitemaps
    • option to preserve image info
    • fixes (many thanks to bug reporters!)
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Nov 6, 2020)

  • v0.5.2(Sep 22, 2020)

    • optional language detector changed: langidpycld3
    • helper function bare_extraction()
    • optional deduplication off by default
    • better URL handling (courlan), more complete metadata
    • code consolidation (cleaner and shorter)
    Source code(tar.gz)
    Source code(zip)
  • v0.5.1(Jul 15, 2020)

  • v0.5.0(Jun 2, 2020)

    • faster and more robust text and metadata extraction
    • more efficient batch processing (parallel processing, URL queues)
    • support for ATOM/RSS feeds
    • complete command-line tool with corresponding options
    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Apr 24, 2020)

  • v0.1.0(Sep 25, 2019)

Owner
Adrien Barbaresi
Research scientist – web texts, computational linguistics and digital humanities
Adrien Barbaresi
Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website (prior form publication) by form number and returns the results as json. It provides the option to download pdfs over a ra

1 Jan 04, 2022
This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

Faisal Ahmed 1 Jan 10, 2022
Minecraft Item Scraper

Minecraft Item Scraper To run, first ensure you have the BeautifulSoup module: pip install bs4 Then run, python minecraft_items.py folder-to-save-ima

Jaedan Calder 1 Dec 29, 2021
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.

IST Research 1.1k Jan 06, 2023
This is a webscraper for a specific website

This is a webscraper for a specific website. It is tuned to extract the headlines of that website. With some little adjustments the webscraper is able to extract any part of the website.

Rahul Siyanwal 1 Dec 13, 2021
Displays market info for the LUNI token on the Terra Blockchain

LuniBot for Discord Displays market info for the LUNI/LUNA token on the Terra Blockchain (Webscrape method currently scraping CoinMarketCap). Will evo

0 Jan 22, 2022
A Python module to bypass Cloudflare's anti-bot page.

cloudscraper A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Requests.

VeNoMouS 2.6k Dec 31, 2022
Danbooru scraper with python

Danbooru Version: 0.0.1 License under: MIT License Dependencies Python: = 3.9.7 beautifulsoup4 cloudscraper Example of use Danbooru from danbooru imp

Sugarbell 2 Oct 27, 2022
A Spider for BiliBili comments with a simple API server.

BiliComment A spider for BiliBili comment. Spider Usage Put config.json into config directory, and then python . ./config/config.json. A example confi

Hao 3 Jul 05, 2021
Audio media crawler for lbry.

Audio media crawler for lbry. Requirements Python 3.8 Poetry 1.1.7 Elasticsearch 7.14.0 Lbry-sdk 0.99.0 Development This project uses poetry as a depe

Hound.fm 4 Dec 03, 2022
Footballmapies - Football mapies for learning webscraping and use of gmplot module in python

Footballmapies - Football mapies for learning webscraping and use of gmplot module in python

1 Jan 28, 2022
Extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file.

GetTss python Package extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file. Install $ pip install GetTss Us

laojunjun 6 Nov 21, 2022
a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

This is George's Scraping Project To get started cd into the theZoo file and run: chmod +x script.sh then: ./script.sh This will spin up a Postgres co

George Reyes 7 Nov 27, 2022
An utility library to scrape data from TikTok, Instagram, Twitch, Youtube, Twitter or Reddit in one line!

Social Media Scraper An utility library to scrape data from TikTok, Instagram, Twitch, Youtube, Twitter or Reddit in one line! Go to the website » Vie

2 Aug 03, 2022
Simple python tool for the purpose of swapping latinic letters with cirilic ones and vice versa in txt, docx and pdf files in Serbian language

Alpha Swap English This is a simple python tool for the purpose of swapping latinic letters with cirylic ones and vice versa, in txt, docx and pdf fil

Aleksandar Damnjanovic 3 May 31, 2022
Introduction to WebScraping Workshop - Semcomp 24 Beta

Extrair informações da internet de forma automatizada. Existem diversas maneiras de fazer isso, nesse tutorial vamos ver algumas delas, por meio de bibliotecas de python.

Luísa Moura 19 Sep 11, 2022
Ebay Webscraper for Getting Average Product Price

Ebay-Webscraper-for-Getting-Average-Product-Price The code in this repo is used to determine the average price of an item on Ebay given a valid search

17 Jan 05, 2023
An Web Scraping API for MDL(My Drama List) for Python.

PyMDL An API for MyDramaList(MDL) based on webscraping for python. Description An API for MDL to make your life easier in retriving and working on dat

6 Dec 10, 2022
Google Maps crawler using Selenium

Google Maps Crawler using Selenium Built as part of the Antifragile Dev Project Selenium crawler that browses Google Maps as a regular user and stores

Guilherme Latrova 46 Dec 16, 2022
Explore scraping with BeautifulSoup!

beautifulsoup-scrape Explore scraping with BeautifulSoup! Part One: Start from Shakespeare As my professor is a poet (yes, and he teaches me data and

Chuqin 2 Oct 05, 2022