A Python library for automating interaction with websites.

Overview

MechanicalSoup. A Python library for automating website interaction.

Home page

https://mechanicalsoup.readthedocs.io/

Overview

A Python library for automating interaction with websites. MechanicalSoup automatically stores and sends cookies, follows redirects, and can follow links and submit forms. It doesn't do JavaScript.

MechanicalSoup was created by M Hickford, who was a fond user of the Mechanize library. Unfortunately, Mechanize was incompatible with Python 3 until 2019 and its development stalled for several years. MechanicalSoup provides a similar API, built on Python giants Requests (for HTTP sessions) and BeautifulSoup (for document navigation). Since 2017 it is a project actively maintained by a small team including @hemberger and @moy.

Gitter Chat

Installation

Latest Version Supported Versions

PyPy3 is also supported (and tested against).

Download and install the latest released version from PyPI:

pip install MechanicalSoup

Download and install the development version from GitHub:

pip install git+https://github.com/MechanicalSoup/MechanicalSoup

Installing from source (installs the version in the current working directory):

python setup.py install

(In all cases, add --user to the install command to install in the current user's home directory.)

Documentation

The full documentation is available on https://mechanicalsoup.readthedocs.io/. You may want to jump directly to the automatically generated API documentation.

Example

From examples/expl_qwant.py, code to get the results from a Qwant search:

"""Example usage of MechanicalSoup to get the results from the Qwant
search engine.
"""

import re
import mechanicalsoup
import html
import urllib.parse

# Connect to duckduckgo
browser = mechanicalsoup.StatefulBrowser(user_agent='MechanicalSoup')
browser.open("https://lite.qwant.com/")

# Fill-in the search form
browser.select_form('#search-form')
browser["q"] = "MechanicalSoup"
browser.submit_selected()

# Display the results
for link in browser.page.select('.result a'):
    # Qwant shows redirection links, not the actual URL, so extract
    # the actual URL from the redirect link:
    href = link.attrs['href']
    m = re.match(r"^/redirect/[^/]*/(.*)$", href)
    if m:
        href = urllib.parse.unquote(m.group(1))
    print(link.text, '->', href)

More examples are available in examples/.

For an example with a more complex form (checkboxes, radio buttons and textareas), read tests/test_browser.py and tests/test_form.py.

Development

Build Status Coverage Status Requirements Status Documentation Status CII Best Practices LGTM Alerts LGTM Grade

Instructions for building, testing and contributing to MechanicalSoup: see CONTRIBUTING.rst.

Common problems

Read the FAQ.

Comments
  • Submit an empty file when leaving a file input blank

    Submit an empty file when leaving a file input blank

    This is in regards to issue #250

    For the tests, I followed @moy 's train of thought :

    • they are basically a copy+paste without the creation of a temp file
    • assert value["doc"] == "" checks that the response contains an empty file

    Thought a different test definition was necessary, was I right to assume so ?

    In browser.py, I changed the continue around line 179 to something similar to what has been done in test__request_file here

    There are 2 Add no file input submit test commits : the second one is simply a clean up of some commented code. Will avoid it next time !

    I was unable to run test_browser.py due to some weird Import module error on modules that are installed, so I'm kind of Pull Requesting blindly. Does it matter that I say I'm confident in the changes though ?

    opened by senabIsShort 27
  • MechanicalSoup logo

    MechanicalSoup logo

    In the Roadmap, some artwork is requested. I asked an artistic friend to try to interpret this request, and this is what they came up with. I would love to use this as our logo (in both the README, as per the roadmap, and perhaps also as our organization icon). Before I make a PR, I just wanted to see if this was what you were going for.

    Drawing
    opened by hemberger 20
  • Tests randomly hanging on Travis-CI

    Tests randomly hanging on Travis-CI

    Every couple of Travis builds, I see one of the sub-builds hang. It happens frequently enough that I feel like I have to babysit Travis, which is not a good situation to be in. From what I can tell, this occurs under two conditions:

    1. httpbin.org is under heavy load (this occurs infrequently, but can occur for extended periods of time)
    2. flake8 hangs for some unknown reason (seems arbitrary, and rerunning almost always fixes it)

    I really want to understand 2), because for 1) we could simply rely on httpbin.org a bit less if necessary.

    opened by hemberger 18
  • Remove `name` attribute from all unused buttons on form submit

    Remove `name` attribute from all unused buttons on form submit

    I ran into a site with forms including buttons of type "button" with name attributes. Because Form.choose_submit() was only removing name from buttons of type "submit", the values for the "button" buttons were being erroneously sent on POST, thereby breaking my submission. This patch fixes the issue, even when a submit button isn't explicitly chosen.

    Note that all buttons that aren't of type "button" or "reset" function as "submit" in all major browsers and should therefore be choosable.

    opened by blackwind 16
  • Do not submit disabled <input> elements

    Do not submit disabled elements

    https://www.w3.org/TR/html52/sec-forms.html#element-attrdef-disabledformelements-disabled

    The disabled attribute is used to make the control non-interactive and to prevent its value from being submitted.

    MechanicalSoup ignores disabled attributes which should be fixed.

    Some additional notes: (from https://www.wufoo.com/html5/disabled-attribute/)

    • If the disabled attribute is set on a <fieldset>, the descendent form controls are disabled.
    • A disabled field can’t be modified, tabbed to, highlighted, or have its contents copied. Its value is also ignored when the form goes thru constraint validation.
    • The disabled value is Boolean, and therefore doesn’t need a value. But, if you must, you can include disabled="disabled".
    • Setting the value of the disabled attribute to null does not remove the effects of the attribute. Instead use removeAttribute('disabled').
    • You can target elements that are disabled with the :disabled pseudo-class. Or, if you want to specifically target the presence of the attribute, you can use input[disabled]. Similarly, you can use :enabled and input:not([disabled]) to target elements that are not disabled.
    • You do not need to include aria-disabled="true" when including the disabled attribute because disabled is already well supported. However, if you are programmatically disabling an element that is not a form control and therefore the disabled attribute does not apply, include aria-disabled="true".
    • The disabled attribute is valid for all form controls including all <input> types, <textarea>, <button>, <select>, <fieldset>, and <keygen>.
    opened by 5j9 14
  • browser.follow_link() has no way to pass kwargs to requests

    browser.follow_link() has no way to pass kwargs to requests

    As noted elsewhere, I've recently been debugging behind an SSL proxy, which requires telling requests to not verify SSL certificates. Generally I've done that with

        kwargs = { "verify": False }
        # ...
        r = br.submit_selected(**kwargs)
    

    which is fine. But it's not so fine when I need to follow a link, because browser.follow_link() uses its **kwargs for BS4's tag finding, but not for actually following the link.

    So instead of

        r = br.follow_link(text='Link anchor', **kwargs)
    

    I end up with

        link = br.find_link(text='Link anchor')
        r = br.open_relative(link['href'], **kwargs)
    

    I am not sure how to fix this. Some thoughts:

    1. If nothing changes, add some more clarity to browser.follow_link()'s documentation explaining how to work around this situation.
    2. Add kwargs-ish params to browser.follow_link(), one for BS4 and one for Requests. Of course, only one gets to be **kwargs, but at least one might be able to call browser.follow_link(text='Link anchor', requests_args=kwargs) or something.
    3. Send the same **kwargs parameter to both

    Maybe there's a better way. I guess in my case I could set this state in requests' Session object, ~which I think would be browser.session.merge_environment_settings(...)~ no, that's not right, I'm not sure how to accomplish it actually.

    opened by johnhawkinson 13
  • Replace httpbin.org with pytest-httpbin in tests

    Replace httpbin.org with pytest-httpbin in tests

    The pytest-httpbin module provides pytest support for the httpbin module (which is the code that runs the remote server http://httpbin.org). This locally spins up an internal webserver when tests are run.

    With this change, MechanicalSoup tests can be run without an internet connection. As a result, the tests run much faster.

    You may need the python{,3}-dev package on your system to pip install the pytest-httpbin module.

    deferred 
    opened by hemberger 13
  • No parser was explicitly specified

    No parser was explicitly specified

    /usr/local/lib/python3.4/dist-packages/bs4/init.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

    To get rid of this warning, change this:

    BeautifulSoup([your markup])

    to this:

    BeautifulSoup([your markup], "lxml")

    markup_type=markup_type))

    Need to use add_soup method or what?

    opened by stdex 12
  • Add get_request_kwargs to check before requesting

    Add get_request_kwargs to check before requesting

    When we use mechanicalsoup, we sometimes want to verify a request before submitting it.

    If you merge this pull request, the package will be able to provide a way for the package's users to review the request.

    This is my first pull request for this project. Please let me know if I'm missing anything.

    opened by kumarstack55 11
  • Set up LGTM and fix warnings

    Set up LGTM and fix warnings

    LGTM.com finds one issue in our code, and it seems legitimate to me (although I'm guilty for introducing it):

    https://lgtm.com/projects/g/hickford/MechanicalSoup/

    We should fix this, and configure lgtm so that it checks pull-requests.

    opened by moy 11
  • Problems calling

    Problems calling "follow_link" with "url_regex"

    Dear Dan and Matthieu,

    first things first: Thanks for conceiving and maintaining this great library. We also switched one of our implementations over from "mechanize" as per https://github.com/ip-tools/ip-navigator/commit/a26c3a8a and it worked really well.

    When doing so, we encountered a minor problem when trying to call the follow_link method with the url_regex keyword argument like

    response = self.browser.follow_link(url_regex='register/PAT_.*VIEW=pdf', headers={'Referer': result.url})
    

    This raises the exception

    TypeError: links() got multiple values for keyword argument 'url_regex'
    

    I am are currently a bit short on time, otherwise i would have submitted a pull request without further ado. Thanks a bunch for looking into this issue.

    With kind regards, Andreas.

    opened by amotl 11
  • browser.links() should return an empty list if self.page is None

    browser.links() should return an empty list if self.page is None

    I was writing a fuzzer for a cybersecurity assignment, and it crashed when it tried to find the links on a PDF file. I think it would make more sense to return that there are no links, if the page fails to parse. This seems relatively straightforward to implement.

    opened by npetrangelo 1
  • Typing annotations and typechecking with mypy or pyright?

    Typing annotations and typechecking with mypy or pyright?

    We already have basic static analysis with flake8 (and the underlying pyflakes), but using typing annotations and a static typechecker may 1) find more bugs, 2) help our users by providing completion and other smart features in their IDE.

    mypy is the historical typechecker, pyright is a more recent one which in my (very limited) experience works better (it's also the tool behind the new Python mode of VSCode). So I'd suggest pyright if we don't have arguments to choose mypy.

    For now, neither tool can typecheck the project without error, so a first step would be to add the necessary annotations to get an error-free pyright check.

    easy? 
    opened by moy 3
  • Can you build it without lxml?

    Can you build it without lxml?

    MechanicalSoup is a really nice package i have used for, but it still requires C Compiler to compile the lxml on *nix systems.

    It may be a problem to port to some platforms without C Compiler, such as Android or some minified Linux.

    Currently i used a script to build MechanicalSoup without lxml:

    #!/bin/sh
    
    # Remove lxml in requirements.txt
    sed -i '/lxml/d' requirements.txt
    
    # Use `html.parser` instead `lxml`
    sed -i "s@{'features': 'lxml'}@{'features': 'html.parser'}@g" mechanicalsoup/*.py
    
    # Fix examples and tests
    sed -i "s@\\(BeautifulSoup(.\\{1,\\}\\)'lxml'\\(.*)\\)@\1'html.parser'\[email protected]" examples/*.py tests/*.py
    

    It works well, so i think it is not a big problem...

    opened by urain39 2
  • Selecting a form that only has a class attribute

    Selecting a form that only has a class attribute

    I'm trying to get a form but it only has a class attribute and I'm continuously getting a "LinkNotFoundError". I've inspected the page and I know that I have the correct class name but it doesn't work at all and I don't see any real reference to this type of issue in the docs. I would try to get the form with BS4 but then there wouldn't be a way to select the form.

    I can attempt to get the form with BS4 then maybe add an id attribute to it then try selecting it with an id attribute?

    I'd really appreciate any help, thank you!

    question 
    opened by SilverStrings024 6
  • add_soup(): Don't match Content-type with `in`

    add_soup(): Don't match Content-type with `in`

    Don't use Python's in operator to match Content-Types, since that is a simple substring match.

    It's obviously not correct since a Content-Type string can be relatively complicated, like

    Content-Type: application/xhtml+xml; charset="utf-8"; boundary="This is not text/html"

    Although that's rather contrived, the prior test "text/html" in response.headers.get("Content-Type", "") would return True here, incorrectly.

    Also, the existance of subtypes with +'s means that using the prior test for "application/xhtml" would match the above example when it probably shouldn't.

    Instead, leverage requests's code, which comes from the Python Standard Library's cgi.py.

    Clarify that we don't implement MIME sniffing, nor X-Content-Type-Options: nosniff instead we do our own thing.


    I was looking at this code because of #373.

    I've marked this as a draft, because I'm not quite sure this is the way to go, both because of the long discursive comment, the use of a _ function from requests (versus cgi.py's parse_header()).

    Also, I'm kind of perplexed what's going on here:

                http_encoding = (
                    response.encoding
                    if 'charset' in parameters
                    else None
                )
    

    Like…why does the presence of charset=utf-8 in the Content-Type header mean that we should trust requests's encoding field? Oh, I see, it's because sometimes requests does some sniffing-ish-stuff and sometimes it doesn't (in which case it parses the Content-Type) and we need to know which, and we're backing out a conclusion about its heuristics? Probably seems like maybe we should parse it ourselves if so. idk.

    Maybe we should be doing more formal mime sniffing. And maybe we should be honoring X-Content-Type-Options: nosniff. And… … …

    I'm also not sure what kind of test coverage is really appropriate here, if anything additional. Seems like the answer shouldn't be "zero," so…

    opened by johnhawkinson 2
Releases(v1.2.0)
  • v1.2.0(Sep 17, 2022)

    Main changes

    • Added support for Python 3.10.

    • Added support for HTML form-associated elements (i.e. input elements that are associated with a form by a form attribute, but are not a child element of the form). [#380]

    Bug fixes

    • When uploading a file, only the filename is now submitted to the server. Previously, the full file path was being submitted, which exposed more local information than users may have been expecting. [#375]
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(May 29, 2021)

    Main changes

    • Dropped support for EOL Python versions: 2.7 and 3.5.

    • Increased minimum version requirement for requests from 2.0 to 2.22.0 and beautifulsoup4 from 4.4 to 4.7.

    • Use encoding from the HTTP request when no HTML encoding is specified. [#355]

    • Added the put method to the Browser class. This is a light wrapper around requests.Session.put. [#359]

    • Don't override Referer headers passed in by the user. [#364]

    • StatefulBrowser methods follow_link and download_link now support passing a dictionary of keyword arguments to requests, via requests_kwargs. For symmetry, they also support passing Beautiful Soup args in as bs4_kwargs, although any excess **kwargs are sent to Beautiful Soup as well, just as they were previously. [#368]

    Many thanks to the contributors who made this release possible!

    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Jan 5, 2021)

    This is the last release that will support Python 2.7. Thanks to the many contributors that made this release possible!

    Main changes:

    • Added support for Python 3.8 and 3.9.

    • StatefulBrowser has new properties page, form, and url, which can be used in place of the methods get_current_page, get_current_form and get_url respectively (e.g. the new x.page is equivalent to x.get_current_page()). These methods may be deprecated in a future release. [#175]

    • StatefulBrowser.form will raise an AttributeError instead of returning None if no form has been selected yet. Note that StatefulBrowser.get_current_form() still returns None for backward compatibility.

    Bug fixes

    • Decompose <select> elements with the same name when adding a new input element to a form. [#297]

    • The params and data kwargs passed to submit will now properly be forwarded to the underlying request for GET methods (whereas previously params was being overwritten by data). [#343]

    Source code(tar.gz)
    Source code(zip)
  • v0.12.0(Aug 27, 2019)

    Main changes:

    • Changes in official python version support: added 3.7 and dropped 3.4.

    • Added ability to submit a form without updating StatefulBrowser internal state: submit_selected(..., update_state=False). This means you get a response from the form submission, but your browser stays on the same page. Useful for handling forms that result in a file download or open a new tab.

    Bug fixes

    • Improve handling of form enctype to behave like a real browser. [#242]

    • HTML type attributes are no longer required to be lowercase. [#245]

    • Form controls with the disabled attribute will no longer be submitted to improve compliance with the HTML standard. If you were relying on this bug to submit disabled elements, you can still achieve this by deleting the disabled attribute from the element in the Form object directly. [#248]

    • When a form containing a file input field is submitted without choosing a file, an empty filename & content will be sent just like in a real browser. [#250]

    • <option> tags without a value attribute will now use their text as the value. [#252]

    • The optional url_regex argument to follow_link and download_link was fixed so that it is no longer ignored. [#256]

    • Allow duplicate submit elements instead of raising a LinkNotFoundError. [#264]

    Our thanks to the many new contributors in this release!

    Source code(tar.gz)
    Source code(zip)
  • v0.11.0(Sep 11, 2018)

    This release focuses on fixing bugs related to uncommon HTTP/HTML scenarios and on improving the documentation.

    Bug fixes

    • Constructing a Form instance from a bs4.element.Tag whose tag name is not form will now emit a warning, and may be deprecated in the future. [#228]

    • Breaking Change: LinkNotFoundError now derives from Exception instead of BaseException. While this will bring the behavior in line with most people's expectations, it may affect the behavior of your code if you were heavily relying on this implementation detail in your exception handling. [#203]

    • Improve handling of button submit elements. Will now correctly ignore buttons of type button and reset during form submission, since they are not considered to be submit elements. [#199]

    • Do a better job of inferring the content type of a response if the Content-Type header is not provided. [#195]

    • Improve consistency of query string construction between MechanicalSoup and web browsers in edge cases where form elements have duplicate name attributes. This prevents errors in valid use cases, and also makes MechanicalSoup more tolerant of invalid HTML. [#158]

    Source code(tar.gz)
    Source code(zip)
  • v0.10.0(Feb 4, 2018)

    Main changes:

    • Added StatefulBrowser.refresh() to reload the current page with the same request. [#188]

    • StatefulBrowser.follow_link, StatefulBrowser.submit_selected() and the new StatefulBrowser.download_link now sets the Referer: HTTP header to the page from which the link is followed. [#179]

    • Added method StatefulBrowser.download_link, which will download the contents of a link to a file without changing the state of the browser. [#170]

    • The selector argument of Browser.select_form can now be a bs4.element.Tag in addition to a CSS selector. [#169]

    • Browser.submit and StatefulBrowser.submit_selected accept a larger number of keyword arguments. Arguments are forwarded to requests.Session.request. [#166]

    Internal changes:

    • StatefulBrowser.choose_submit will now ignore input elements that are missing a name-attribute instead of raising a KeyError. [#180]

    • Private methods Browser._build_request and Browser._prepare_request have been replaced by a single method Browser._request. [#166]

    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(Nov 2, 2017)

    Main changes:

    • We do not rely on BeautifulSoup's default choice of HTML parser. Instead, we now specify lxml as default. As a consequence, the default setting requires lxml as a dependency.

    • Python 2.6 and 3.3 are no longer supported.

    • The GitHub URL moved from https://github.com/hickford/MechanicalSoup/ to https://github.com/MechanicalSoup/MechanicalSoup. @moy and @hemberger are now officially administrators of the project in addition to @hickford, the original author.

    • We now have a documentation site: https://mechanicalsoup.readthedocs.io/. The API is now fully documented, and we have included a tutorial, several more code examples, and a FAQ.

    • StatefulBrowser.select_form can now be called without argument, and defaults to "form" in this case. It also has a new argument, nr (defaults to 0), which can be used to specify the index of the form to select if multiple forms match the selection criteria.

    • We now use requirement files. You can install the dependencies of MechanicalSoup with e.g.::

      pip install -r requirements.txt -r tests/requirements.txt

    • The Form class was restructured and has a new API. The behavior of existing code is unchanged, but a new collection of methods has been added for clarity and consistency with the set method:

      • set_input deprecates input
      • set_textarea deprecates textarea
      • set_select is new
      • set_checkbox and set_radio together deprecate check (checkboxes are handled differently by default)
    • A new Form.print_summary method allows you to write browser.get_current_form().print_summary() to get a summary of the fields you need to fill-in (and which ones are already filled-in).

    • The Form class now supports selecting multiple options in a <select multiple> element.

    Bug fixes

    • Checking checkboxes with browser["name"] = ("val1", "val2") now unchecks all checkbox except the ones explicitly specified.

    • StatefulBrowser.submit_selected and StatefulBrowser.open now reset __current_page to None when the result is not an HTML page. This fixes a bug where __current_page was still the previous page.

    • We don't error out anymore when trying to uncheck a box which doesn't have a checkbox attribute.

    • Form.new_control now correctly overrides existing elements.

    Internal changes

    • The testsuite has been further improved and reached 100% coverage.

    • Tests are now run against the local version of MechanicalSoup, not against the installed version.

    • Browser.add_soup will now always attach a soup-attribute. If the response is not text/html, then soup is set to None.

    • Form.set(force=True) creates an <input type=text ...> element instead of an <input type=input ...>.

    Source code(tar.gz)
    Source code(zip)
  • v0.8.0(Oct 1, 2017)

    Main changes:

    • Browser and StatefulBrowser can now be configured to raise a LinkNotFound exception when encountering a 404 Not Found error. This is activated by passing raise_on_404=True to the constructor. It is disabled by default for backward compatibility, but is highly recommanded.

    • Browser now has a __del__ method that closes the current session when the object is deleted.

    • A Link object can now be passed to follow_link.

    • The user agent can now be customized. The default includes MechanicalSoup and its version.

    • There is now a direct interface to the cookiejar in *Browser classes ((set|get)_cookiejar methods).

    • This is the last MechanicalSoup version supporting Python 2.6 and 3.3.

    Bug fixes:

    • We used to crash on forms without action="..." fields.

    • The choose_submit method has been fixed, and the btnName argument of StatefulBrowser.submit_selected is now a shortcut for using choose_submit.

    • Arguments to open_relative were not properly forwarded.

    Internal changes:

    • The testsuite has been greatly improved. It now uses the pytest API (not only the pytest launcher) for more concise code.

    • The coverage of the testsuite is now measured with codecov.io. The results can be viewed on: https://codecov.io/gh/hickford/MechanicalSoup

    • We now have a requires.io badge to help us tracking issues with dependencies. The report can be viewed on: https://requires.io/github/hickford/MechanicalSoup/requirements/

    • The version number now appears in a single place in the source code.

    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(May 7, 2017)

    Summary of changes:

    • New class StatefulBrowser, that keeps track of the currently visited page to make the calling code more concise.

    • A new launch_browser method in Browser and StatefulBrowser, that allows launching a browser on the currently visited page for easier debugging.

    • Many bug fixes.

    Release on Pypi: https://pypi.python.org/pypi/MechanicalSoup/0.7.0

    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Nov 24, 2015)

This is a module that I had created along with my friend. It's a basic web scraping module

QuickInfo PYPI link : https://pypi.org/project/quickinfo/ This is the library that you've all been searching for, it's built for developers and allows

OneBit 2 Dec 13, 2021
Scrapes Every Email Address of Every Society in Every University

society-email-scrape Site Live at https://kcsoc.github.io/society-email-scrape/ How to automatically generate new data Go to unis.yml Add your uni Cre

Krishna Consciousness Society 18 Dec 14, 2022
Screenhook is a script that captures an image of a web page and send it to a discord webhook.

screenshot from the web for discord webhooks screenhook is a script that captures an image of a web page and send it to a discord webhook.

Toast Energy 3 Jun 04, 2022
DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques

DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques, based in France Only. The particularity of this program i

Dalunacrobate 347 Jan 07, 2023
Libextract: extract data from websites

Libextract is a statistics-enabled data extraction library that works on HTML and XML documents and written in Python

499 Dec 09, 2022
This is a python api to scrape search results from a url.

googlescrape Installation Installation is simple! # Stable version pip install googlescrape Examples from googlescrape import client scrapeClient=cli

1 Dec 15, 2022
哔哩哔哩爬取器:以个人为中心

Open Bilibili Crawer 哔哩哔哩是一个信息非常丰富的社交平台,我们基于此构造社交网络。在该网络中,节点包括用户(up主),以及视频、专栏等创作产物;关系包括:用户之间,包括关注关系(following/follower),回复关系(评论区),转发关系(对视频or动态转发);用户对创

Boshen Shi 3 Oct 21, 2021
PaperRobot: a paper crawler that can quickly download numerous papers, facilitating paper studying and management

PaperRobot PaperRobot 是一个论文抓取工具,可以快速批量下载大量论文,方便后期进行持续的论文管理与学习。 PaperRobot通过多个接口抓取论文,目前抓取成功率维持在90%以上。通过配置Config文件,可以抓取任意计算机领域相关会议的论文。 Installation Down

moxiaoxi 47 Nov 23, 2022
Simple proxy scraper made by using ProxyScrape's api.

What is Moon? Moon is a lightweight and fast proxy scraper made by using ProxyScrape's api. What can i do with this? You can use proxies for varietys

1 Jul 04, 2022
Displays market info for the LUNI token on the Terra Blockchain

LuniBot for Discord Displays market info for the LUNI/LUNA token on the Terra Blockchain (Webscrape method currently scraping CoinMarketCap). Will evo

0 Jan 22, 2022
This script is intended to crawl license information of repositories through the GitHub API.

GithubLicenseCrawler This script is intended to crawl license information of repositories through the GitHub API. Taking a csv file with requirements.

schutera 4 Oct 25, 2022
This tool crawls a list of websites and download all PDF and office documents

This tool crawls a list of websites and download all PDF and office documents. Then it analyses the PDF documents and tries to detect accessibility issues.

AccessibilityLU 7 Sep 30, 2022
A way to scrape sports streams for use with Jellyfin.

Sportyfin Description Stream sports events straight from your Jellyfin server. Sportyfin allows users to scrape for live streamed events and watch str

axelmierczuk 38 Nov 05, 2022
This repo has the source code for the crawler and data crawled from auto-data.net

This repo contains the source code for crawler and crawled data of cars specifications from autodata. The data has roughly 45k cars

Tô Đức Anh 5 Nov 22, 2022
The first public repository that provides free BUBT website scraping API script on Github.

BUBT WEBSITE SCRAPPING SCRIPT I think this is the first public repository that provides free BUBT website scraping API script on github. When I was do

Md Imam Hossain 3 Feb 10, 2022
A dead simple crawler to get books information from Douban.

Introduction A dead simple crawler to get books information from Douban. Pre-requesites Python 3 Install dependencies from requirements.txt (Optional)

Yun Wang 1 Jan 10, 2022
Discord webhook spammer with proxy support and proxy scraper

Discord webhook spammer with proxy support and proxy scraper

3 Feb 27, 2022
Automated data scraper for Thailand COVID-19 data

The Researcher COVID data Automated data scraper for Thailand COVID-19 data Accessing the Data 1st Dose Provincial Vaccination Data 2nd Dose Provincia

Porames Vatanaprasan 31 Apr 17, 2022
Using Selenium with Python to Web Scrap Popular Youtube Tech Channels.

Web Scrapping Popular Youtube Tech Channels with Selenium Data Mining, Data Wrangling, and Exploratory Data Analysis About the Data Web scrapi

David Rusho 0 Aug 18, 2021
Find papers by keywords and venues. Then download it automatically

paper finder Find papers by keywords and venues. Then download it automatically. How to use this? Search CLI python search.py -k "knowledge tracing,kn

Jiahao Chen (TabChen) 2 Dec 15, 2022