mlscraper: Scrape data from HTML pages automatically with Machine Learning

Overview

mlscraper: Scrape data from HTML pages automatically with Machine Learning

https://img.shields.io/travis/lorey/mlscraper?nocache:alt:Travis(.org) https://img.shields.io/pypi/v/mlscraper?nocache:alt:PyPI https://img.shields.io/pypi/pyversions/mlscraper?nocache:alt:PyPI-PythonVersion

mlscraper allows you to extract structured data from HTML automatically with Machine Learning. You train it by providing a few examples of your desired output. It will then figure out the extraction rules for you automatically and afterwards you'll be able to extract data from any new page you provide.

.github/how-it-works-wide.png

Background Story

Many services for crawling and scraping automation allow you to select data in a browser and get JSON results in return. No need to specify CSS selectors or anything else.

I've been wondering for a long time why there's no Open Source solution that does something like this. So here's my attempt at creating a python library to enable automatic scraping.

All you have to do is define some examples of scraped data. mlscraper will figure out everything else and return clean data.

Currently, this is a proof of concept with a simplistic solution.

How it works

After you've defined the data you want to scrape, mlscraper will:

  • find your samples inside the HTML DOM
  • determine which rules/methods to apply for extraction
  • extract the data for you and return it in a dictionary
import requests

from mlscraper import RuleBasedSingleItemScraper
from mlscraper.training import SingleItemPageSample

# the items found on the training page
targets = {
    "https://test.com/article/1": {"title": "One great result!", "description": "Some description"},
    "https://test.com/article/2": {"title": "Another great result!", "description": "Another description"},
    "https://test.com/article/3": {"title": "Result to be found", "description": "Description to crawl"},
}

# fetch html and create samples
samples = [SingleItemPageSample(requests.get(url).content, targets[url]) for url in targets]

# training the scraper with the items
scraper = RuleBasedSingleItemScraper.build(samples)

# apply the learned rules and extract new item automatically
result = scraper.scrape(requests.get('https://test.com/article/4').content)

print(result)
# results in something like:
# {'title': 'Article four', 'description': 'Scraped automatically'}

You can find working scrapers like a stackoverflow and a quotes scraper in the examples folder.

Getting started

Install the library via pip install mlscraper. You can then import it via mlscraper and use it as shown in the examples.

Development

See CONTRIBUTING.rst

Related work

If you're interested in the underlying research, I can highly recommend these publications:

I originally called this autoscraper but while working on it someone else released a library named exactly the same. Check it out here: autoscraper.

Comments
  • missing mlscraper.html

    missing mlscraper.html

    Followed the readme and was testing the code after pip install --pre mlscraper

    But got a module not found error

    from mlscraper.html import Page ModuleNotFoundError: No module named 'mlscraper.html'

    checking the installed library, only the following were present: ml.py parser.py training.py util.py

    For people checking out to library it will be convenient if we add all dependencies in readme present: from mlscraper.html import Page from mlscraper.samples import Sample, TrainingSet from mlscraper.training import train_scraper

    opened by appsec-airito 6
  • Find and fix issue with github profile pages

    Find and fix issue with github profile pages

    • follower counts have no unique selector (need nth or something else)
    • image width and height get matched when searching for 20 followers (as icons have manually set dimensions)
    opened by lorey 4
  • Stackoverflow example not working

    Stackoverflow example not working

    This is the code

    import logging
    
    import requests
    
    from mlscraper import SingleItemPageSample, RuleBasedSingleItemScraper
    
    
    items = {
        "https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array": {
            "title": "Why is processing a sorted array faster than processing an unsorted array?"
        },
        "https://stackoverflow.com/questions/927358/how-do-i-undo-the-most-recent-local-commits-in-git": {
            "title": "How do I undo the most recent local commits in Git?"
        },
        "https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do": {
            "title": "What does the “yield” keyword do?"
        },
    }
    
    results = {url: requests.get(url) for url in items.keys()}
    
    # train scraper
    samples = [
        SingleItemPageSample(results[url].content, items[url]) for url in items.keys()
    ]
    scraper = RuleBasedSingleItemScraper.build(samples)
    
    print("Scraping new question")
    html = requests.get(
        "https://stackoverflow.com/questions/2003505/how-do-i-delete-a-git-branch-locally-and-remotely"
    ).content
    result = scraper.scrape(html)
    
    print("Result: %s" % result)
    
    

    Output

    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-11-9f646dab1fca> in <module>()
         24     SingleItemPageSample(results[url].content, items[url]) for url in items.keys()
         25 ]
    ---> 26 scraper = RuleBasedSingleItemScraper.build(samples)
         27 
         28 print("Scraping new question")
    
    4 frames
    /usr/local/lib/python3.7/dist-packages/mlscraper/__init__.py in build(samples)
         89                     matches_per_page_right = [
         90                         len(m) == 1 and m[0].get_text() == s.item[attr]
    ---> 91                         for m, s in zip(matches_per_page, samples)
         92                     ]
         93                     score = sum(matches_per_page_right) / len(samples)
    
    /usr/local/lib/python3.7/dist-packages/mlscraper/__init__.py in <listcomp>(.0)
         88                     matches_per_page = (s.page.select(selector) for s in samples)
         89                     matches_per_page_right = [
    ---> 90                         len(m) == 1 and m[0].get_text() == s.item[attr]
         91                         for m, s in zip(matches_per_page, samples)
         92                     ]
    
    /usr/local/lib/python3.7/dist-packages/mlscraper/__init__.py in <genexpr>(.0)
         86                 if selector not in selector_scoring:
         87                     logging.info("testing %s (%d/%d)", selector, i, len(selectors))
    ---> 88                     matches_per_page = (s.page.select(selector) for s in samples)
         89                     matches_per_page_right = [
         90                         len(m) == 1 and m[0].get_text() == s.item[attr]
    
    /usr/local/lib/python3.7/dist-packages/mlscraper/parser.py in select(self, css_selector)
         28     def select(self, css_selector):
         29         try:
    ---> 30             return [SoupNode(res) for res in self._soup.select(css_selector)]
         31         except NotImplementedError:
         32             logging.warning(
    
    /usr/local/lib/python3.7/dist-packages/bs4/element.py in select(self, selector, _candidate_generator, limit)
       1495                 if tag_name == '':
       1496                     raise ValueError(
    -> 1497                         "A pseudo-class must be prefixed with a tag name.")
       1498                 pseudo_attributes = re.match(r'([a-zA-Z\d-]+)\(([a-zA-Z\d]+)\)', pseudo)
       1499                 found = []
    
    ValueError: A pseudo-class must be prefixed with a tag name.
    
    opened by rish-hyun 2
  • Example from docs does not work

    Example from docs does not work

    This example from the README does not work unfortunately. Perhaps, I'm doing something wrong.

    Example:

    import requests
    from mlscraper.html import Page
    from mlscraper.samples import Sample, TrainingSet
    from mlscraper.training import train_scraper
    
    # fetch the page to train
    einstein_url = 'http://quotes.toscrape.com/author/Albert-Einstein/'
    resp = requests.get(einstein_url)
    assert resp.status_code == 200
    
    # create a sample for Albert Einstein
    training_set = TrainingSet()
    page = Page(resp.content)
    sample = Sample(page, {'name': 'Albert Einstein', 'born': 'March 14, 1879'})
    training_set.add_sample(sample)
    
    # train the scraper with the created training set
    scraper = train_scraper(training_set)
    
    # scrape another page
    resp = requests.get('http://quotes.toscrape.com/author/J-K-Rowling')
    result = scraper.get(Page(resp.content))
    print(result)
    

    Error:

    File ~/miniconda3/envs/colbert/lib/python3.8/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:133, in _make_cell_set_template_code()
        116     return types.CodeType(
        117         co.co_argcount,
        118         co.co_nlocals,
       (...)
        130         (),
        131     )
        132 else:
    --> 133     return types.CodeType(
        134         co.co_argcount,
        135         co.co_kwonlyargcount,
        136         co.co_nlocals,
        137         co.co_stacksize,
        138         co.co_flags,
        139         co.co_code,
        140         co.co_consts,
        141         co.co_names,
        142         co.co_varnames,
        143         co.co_filename,
        144         co.co_name,
        145         co.co_firstlineno,
        146         co.co_lnotab,
        147         co.co_cellvars,  # this is the trickery
        148         (),
        149     )
    
    TypeError: an integer is required (got type bytes)
    
    opened by creatorrr 1
  • Bump lxml from 4.5.1 to 4.6.5

    Bump lxml from 4.5.1 to 4.6.5

    Bumps lxml from 4.5.1 to 4.6.5.

    Changelog

    Sourced from lxml's changelog.

    4.6.5 (2021-12-12)

    Bugs fixed

    • A vulnerability (GHSL-2021-1038) in the HTML cleaner allowed sneaking script content through SVG images.

    • A vulnerability (GHSL-2021-1037) in the HTML cleaner allowed sneaking script content through CSS imports and other crafted constructs.

    4.6.4 (2021-11-01)

    Features added

    • GH#317: A new property system_url was added to DTD entities. Patch by Thirdegree.

    • GH#314: The STATIC_* variables in setup.py can now be passed via env vars. Patch by Isaac Jurado.

    4.6.3 (2021-03-21)

    Bugs fixed

    • A vulnerability (CVE-2021-28957) was discovered in the HTML Cleaner by Kevin Chung, which allowed JavaScript to pass through. The cleaner now removes the HTML5 formaction attribute.

    4.6.2 (2020-11-26)

    Bugs fixed

    • A vulnerability (CVE-2020-27783) was discovered in the HTML Cleaner by Yaniv Nizry, which allowed JavaScript to pass through. The cleaner now removes more sneaky "style" content.

    4.6.1 (2020-10-18)

    ... (truncated)

    Commits
    • a9611ba Fix a test in Py2.
    • a3eacbc Prepare release of 4.6.5.
    • b7ea687 Update changelog.
    • 69a7473 Cleaner: cover some more cases where scripts could sneak through in specially...
    • 54d2985 Fix condition in test decorator.
    • 4b220b5 Use the non-depcrecated TextTestResult instead of _TextTestResult (GH-333)
    • d85c6de Exclude a test when using the macOS system libraries because it fails with li...
    • cd4bec9 Add macOS-M1 as wheel build platform.
    • fd0d471 Install automake and libtool in macOS build to be able to install the latest ...
    • f233023 Cleaner: Remove SVG image data URLs since they can embed script content.
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Bump pip from 19.2.3 to 21.1

    Bump pip from 19.2.3 to 21.1

    Bumps pip from 19.2.3 to 21.1.

    Changelog

    Sourced from pip's changelog.

    21.1 (2021-04-24)

    Process

    • Start installation scheme migration from distutils to sysconfig. A warning is implemented to detect differences between the two implementations to encourage user reports, so we can avoid breakages before they happen.

    Features

    • Add the ability for the new resolver to process URL constraints. ([#8253](https://github.com/pypa/pip/issues/8253) <https://github.com/pypa/pip/issues/8253>_)
    • Add a feature --use-feature=in-tree-build to build local projects in-place when installing. This is expected to become the default behavior in pip 21.3; see Installing from local packages <https://pip.pypa.io/en/stable/user_guide/#installing-from-local-packages>_ for more information. ([#9091](https://github.com/pypa/pip/issues/9091) <https://github.com/pypa/pip/issues/9091>_)
    • Bring back the "(from versions: ...)" message, that was shown on resolution failures. ([#9139](https://github.com/pypa/pip/issues/9139) <https://github.com/pypa/pip/issues/9139>_)
    • Add support for editable installs for project with only setup.cfg files. ([#9547](https://github.com/pypa/pip/issues/9547) <https://github.com/pypa/pip/issues/9547>_)
    • Improve performance when picking the best file from indexes during pip install. ([#9748](https://github.com/pypa/pip/issues/9748) <https://github.com/pypa/pip/issues/9748>_)
    • Warn instead of erroring out when doing a PEP 517 build in presence of --build-option. Warn when doing a PEP 517 build in presence of --global-option. ([#9774](https://github.com/pypa/pip/issues/9774) <https://github.com/pypa/pip/issues/9774>_)

    Bug Fixes

    • Fixed --target to work with --editable installs. ([#4390](https://github.com/pypa/pip/issues/4390) <https://github.com/pypa/pip/issues/4390>_)
    • Add a warning, discouraging the usage of pip as root, outside a virtual environment. ([#6409](https://github.com/pypa/pip/issues/6409) <https://github.com/pypa/pip/issues/6409>_)
    • Ignore .dist-info directories if the stem is not a valid Python distribution name, so they don't show up in e.g. pip freeze. ([#7269](https://github.com/pypa/pip/issues/7269) <https://github.com/pypa/pip/issues/7269>_)
    • Only query the keyring for URLs that actually trigger error 401. This prevents an unnecessary keyring unlock prompt on every pip install invocation (even with default index URL which is not password protected). ([#8090](https://github.com/pypa/pip/issues/8090) <https://github.com/pypa/pip/issues/8090>_)
    • Prevent packages already-installed alongside with pip to be injected into an isolated build environment during build-time dependency population. ([#8214](https://github.com/pypa/pip/issues/8214) <https://github.com/pypa/pip/issues/8214>_)
    • Fix pip freeze permission denied error in order to display an understandable error message and offer solutions. ([#8418](https://github.com/pypa/pip/issues/8418) <https://github.com/pypa/pip/issues/8418>_)
    • Correctly uninstall script files (from setuptools' scripts argument), when installed with --user. ([#8733](https://github.com/pypa/pip/issues/8733) <https://github.com/pypa/pip/issues/8733>_)
    • New resolver: When a requirement is requested both via a direct URL (req @ URL) and via version specifier with extras (req[extra]), the resolver will now be able to use the URL to correctly resolve the requirement with extras. ([#8785](https://github.com/pypa/pip/issues/8785) <https://github.com/pypa/pip/issues/8785>_)
    • New resolver: Show relevant entries from user-supplied constraint files in the error message to improve debuggability. ([#9300](https://github.com/pypa/pip/issues/9300) <https://github.com/pypa/pip/issues/9300>_)
    • Avoid parsing version to make the version check more robust against lousily debundled downstream distributions. ([#9348](https://github.com/pypa/pip/issues/9348) <https://github.com/pypa/pip/issues/9348>_)
    • --user is no longer suggested incorrectly when pip fails with a permission error in a virtual environment. ([#9409](https://github.com/pypa/pip/issues/9409) <https://github.com/pypa/pip/issues/9409>_)
    • Fix incorrect reporting on Requires-Python conflicts. ([#9541](https://github.com/pypa/pip/issues/9541) <https://github.com/pypa/pip/issues/9541>_)

    ... (truncated)

    Commits
    • 2b2a268 Bump for release
    • ea761a6 Update AUTHORS.txt
    • 2edd3fd Postpone a deprecation to 21.2
    • 3cccfbf Rename mislabeled news fragment
    • 21cd124 Fix NEWS.rst placeholder position
    • e46bdda Merge pull request #9827 from pradyunsg/fix-git-improper-tag-handling
    • 0e4938d :newspaper:
    • ca832b2 Don't split git references on unicode separators
    • 1320bac Merge pull request #9814 from pradyunsg/revamp-ci-apr-2021-v2
    • e9cc23f Skip checks on PRs only
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Bump urllib3 from 1.25.9 to 1.26.5

    Bump urllib3 from 1.25.9 to 1.26.5

    Bumps urllib3 from 1.25.9 to 1.26.5.

    Release notes

    Sourced from urllib3's releases.

    1.26.5

    :warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

    • Fixed deprecation warnings emitted in Python 3.10.
    • Updated vendored six library to 1.16.0.
    • Improved performance of URL parser when splitting the authority component.

    If you or your organization rely on urllib3 consider supporting us via GitHub Sponsors

    1.26.4

    :warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

    • Changed behavior of the default SSLContext when connecting to HTTPS proxy during HTTPS requests. The default SSLContext now sets check_hostname=True.

    If you or your organization rely on urllib3 consider supporting us via GitHub Sponsors

    1.26.3

    :warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

    • Fixed bytes and string comparison issue with headers (Pull #2141)

    • Changed ProxySchemeUnknown error message to be more actionable if the user supplies a proxy URL without a scheme (Pull #2107)

    If you or your organization rely on urllib3 consider supporting us via GitHub Sponsors

    1.26.2

    :warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

    • Fixed an issue where wrap_socket and CERT_REQUIRED wouldn't be imported properly on Python 2.7.8 and earlier (Pull #2052)

    1.26.1

    :warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

    • Fixed an issue where two User-Agent headers would be sent if a User-Agent header key is passed as bytes (Pull #2047)

    1.26.0

    :warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

    • Added support for HTTPS proxies contacting HTTPS servers (Pull #1923, Pull #1806)

    • Deprecated negotiating TLSv1 and TLSv1.1 by default. Users that still wish to use TLS earlier than 1.2 without a deprecation warning should opt-in explicitly by setting ssl_version=ssl.PROTOCOL_TLSv1_1 (Pull #2002) Starting in urllib3 v2.0: Connections that receive a DeprecationWarning will fail

    • Deprecated Retry options Retry.DEFAULT_METHOD_WHITELIST, Retry.DEFAULT_REDIRECT_HEADERS_BLACKLIST and Retry(method_whitelist=...) in favor of Retry.DEFAULT_ALLOWED_METHODS, Retry.DEFAULT_REMOVE_HEADERS_ON_REDIRECT, and Retry(allowed_methods=...) (Pull #2000) Starting in urllib3 v2.0: Deprecated options will be removed

    ... (truncated)

    Changelog

    Sourced from urllib3's changelog.

    1.26.5 (2021-05-26)

    • Fixed deprecation warnings emitted in Python 3.10.
    • Updated vendored six library to 1.16.0.
    • Improved performance of URL parser when splitting the authority component.

    1.26.4 (2021-03-15)

    • Changed behavior of the default SSLContext when connecting to HTTPS proxy during HTTPS requests. The default SSLContext now sets check_hostname=True.

    1.26.3 (2021-01-26)

    • Fixed bytes and string comparison issue with headers (Pull #2141)

    • Changed ProxySchemeUnknown error message to be more actionable if the user supplies a proxy URL without a scheme. (Pull #2107)

    1.26.2 (2020-11-12)

    • Fixed an issue where wrap_socket and CERT_REQUIRED wouldn't be imported properly on Python 2.7.8 and earlier (Pull #2052)

    1.26.1 (2020-11-11)

    • Fixed an issue where two User-Agent headers would be sent if a User-Agent header key is passed as bytes (Pull #2047)

    1.26.0 (2020-11-10)

    • NOTE: urllib3 v2.0 will drop support for Python 2. Read more in the v2.0 Roadmap <https://urllib3.readthedocs.io/en/latest/v2-roadmap.html>_.

    • Added support for HTTPS proxies contacting HTTPS servers (Pull #1923, Pull #1806)

    • Deprecated negotiating TLSv1 and TLSv1.1 by default. Users that still wish to use TLS earlier than 1.2 without a deprecation warning

    ... (truncated)

    Commits
    • d161647 Release 1.26.5
    • 2d4a3fe Improve performance of sub-authority splitting in URL
    • 2698537 Update vendored six to 1.16.0
    • 07bed79 Fix deprecation warnings for Python 3.10 ssl module
    • d725a9b Add Python 3.10 to GitHub Actions
    • 339ad34 Use pytest==6.2.4 on Python 3.10+
    • f271c9c Apply latest Black formatting
    • 1884878 [1.26] Properly proxy EOF on the SSLTransport test suite
    • a891304 Release 1.26.4
    • 8d65ea1 Merge pull request from GHSA-5phf-pp7p-vc2r
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Bump py from 1.8.1 to 1.10.0

    Bump py from 1.8.1 to 1.10.0

    Bumps py from 1.8.1 to 1.10.0.

    Changelog

    Sourced from py's changelog.

    1.10.0 (2020-12-12)

    • Fix a regular expression DoS vulnerability in the py.path.svnwc SVN blame functionality (CVE-2020-29651)
    • Update vendored apipkg: 1.4 => 1.5
    • Update vendored iniconfig: 1.0.0 => 1.1.1

    1.9.0 (2020-06-24)

    • Add type annotation stubs for the following modules:

      • py.error
      • py.iniconfig
      • py.path (not including SVN paths)
      • py.io
      • py.xml

      There are no plans to type other modules at this time.

      The type annotations are provided in external .pyi files, not inline in the code, and may therefore contain small errors or omissions. If you use py in conjunction with a type checker, and encounter any type errors you believe should be accepted, please report it in an issue.

    1.8.2 (2020-06-15)

    • On Windows, py.path.locals which differ only in case now have the same Python hash value. Previously, such paths were considered equal but had different hashes, which is not allowed and breaks the assumptions made by dicts, sets and other users of hashes.
    Commits
    • e5ff378 Update CHANGELOG for 1.10.0
    • 94cf44f Update vendored libs
    • 5e8ded5 testing: comment out an assert which fails on Python 3.9 for now
    • afdffcc Rename HOWTORELEASE.rst to RELEASING.rst
    • 2de53a6 Merge pull request #266 from nicoddemus/gh-actions
    • fa1b32e Merge pull request #264 from hugovk/patch-2
    • 887d6b8 Skip test_samefile_symlink on pypy3 on Windows
    • e94e670 Fix test_comments() in test_source
    • fef9a32 Adapt test
    • 4a694b0 Add GitHub Actions badge to README
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Bump lxml from 4.5.1 to 4.6.3

    Bump lxml from 4.5.1 to 4.6.3

    Bumps lxml from 4.5.1 to 4.6.3.

    Changelog

    Sourced from lxml's changelog.

    4.6.3 (2021-03-21)

    Bugs fixed

    • A vulnerability (CVE-2021-28957) was discovered in the HTML Cleaner by Kevin Chung, which allowed JavaScript to pass through. The cleaner now removes the HTML5 formaction attribute.

    4.6.2 (2020-11-26)

    Bugs fixed

    • A vulnerability (CVE-2020-27783) was discovered in the HTML Cleaner by Yaniv Nizry, which allowed JavaScript to pass through. The cleaner now removes more sneaky "style" content.

    4.6.1 (2020-10-18)

    Bugs fixed

    • A vulnerability was discovered in the HTML Cleaner by Yaniv Nizry, which allowed JavaScript to pass through. The cleaner now removes more sneaky "style" content.

    4.6.0 (2020-10-17)

    Features added

    • GH#310: lxml.html.InputGetter supports __len__() to count the number of input fields. Patch by Aidan Woolley.

    • lxml.html.InputGetter has a new .items() method to ease processing all input fields.

    • lxml.html.InputGetter.keys() now returns the field names in document order.

    • GH-309: The API documentation is now generated using sphinx-apidoc. Patch by Chris Mayo.

    Bugs fixed

    ... (truncated)

    Commits
    • a5f9cb5 Prepare release of lxml 4.6.3.
    • 2d01a1b Add HTML-5 "formaction" attribute to "defs.link_attrs" (GH-316)
    • e986a9c Fix reference in docs.
    • 4cb5736 Work around Py2's lack of "re.ASCII".
    • c30106f Prepare release of 4.6.2.
    • a105ab8 Prevent combinations of <math/svg> and <style> to sneak JavaScript through th...
    • c053dc1 Add a recipe for a look-ahead generator to allow modifications during tree it...
    • b083124 lxml actually works in Py3.9.
    • 0f80590 lxml actually works in Py3.9.
    • fd8893c Add a doc note that the .find() methods are usually faster than one might exp...
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Bump lxml from 4.5.1 to 4.6.2

    Bump lxml from 4.5.1 to 4.6.2

    Bumps lxml from 4.5.1 to 4.6.2.

    Changelog

    Sourced from lxml's changelog.

    4.6.2 (2020-11-26)

    Bugs fixed

    • A vulnerability (CVE-2020-27783) was discovered in the HTML Cleaner by Yaniv Nizry, which allowed JavaScript to pass through. The cleaner now removes more sneaky "style" content.

    4.6.1 (2020-10-18)

    Bugs fixed

    • A vulnerability was discovered in the HTML Cleaner by Yaniv Nizry, which allowed JavaScript to pass through. The cleaner now removes more sneaky "style" content.

    4.6.0 (2020-10-17)

    Features added

    • GH#310: lxml.html.InputGetter supports __len__() to count the number of input fields. Patch by Aidan Woolley.

    • lxml.html.InputGetter has a new .items() method to ease processing all input fields.

    • lxml.html.InputGetter.keys() now returns the field names in document order.

    • GH-309: The API documentation is now generated using sphinx-apidoc. Patch by Chris Mayo.

    Bugs fixed

    • LP#1869455: C14N 2.0 serialisation failed for unprefixed attributes when a default namespace was defined.

    • TreeBuilder.close() raised AssertionError in some error cases where it should have raised XMLSyntaxError. It now raises a combined exception to keep up backwards compatibility, while switching to XMLSyntaxError as an interface.

    4.5.2 (2020-07-09)

    ... (truncated)

    Commits
    • 4cb5736 Work around Py2's lack of "re.ASCII".
    • c30106f Prepare release of 4.6.2.
    • a105ab8 Prevent combinations of <math/svg> and <style> to sneak JavaScript through th...
    • c053dc1 Add a recipe for a look-ahead generator to allow modifications during tree it...
    • b083124 lxml actually works in Py3.9.
    • 0f80590 lxml actually works in Py3.9.
    • fd8893c Add a doc note that the .find() methods are usually faster than one might exp...
    • eb6df27 Update release version on homepage.
    • 69b5c9b Automate the build artefact downloading from github and appveyor.
    • 61432a8 Prepare release of lxml 4.6.1.
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Include fetching in scrapers

    Include fetching in scrapers

    Scrapers should be able to deal with urls, HTML, and parsed DOMs (even requests response objects?) to enable for flexible library usage.

    • scrapers input HTML, DOM, and responses, e.g. via scrape_soup, scrape_html, scrape_url
    • examples can be created via static methods, e.g. via SingleItemPageSample.from_soup, etc.
    enhancement 
    opened by lorey 1
  • Bump wheel from 0.37.1 to 0.38.1 in /requirements

    Bump wheel from 0.37.1 to 0.38.1 in /requirements

    Bumps wheel from 0.37.1 to 0.38.1.

    Changelog

    Sourced from wheel's changelog.

    Release Notes

    UNRELEASED

    • Updated vendored packaging to 22.0

    0.38.4 (2022-11-09)

    • Fixed PKG-INFO conversion in bdist_wheel mangling UTF-8 header values in METADATA (PR by Anderson Bravalheri)

    0.38.3 (2022-11-08)

    • Fixed install failure when used with --no-binary, reported on Ubuntu 20.04, by removing setup_requires from setup.cfg

    0.38.2 (2022-11-05)

    • Fixed regression introduced in v0.38.1 which broke parsing of wheel file names with multiple platform tags

    0.38.1 (2022-11-04)

    • Removed install dependency on setuptools
    • The future-proof fix in 0.36.0 for converting PyPy's SOABI into a abi tag was faulty. Fixed so that future changes in the SOABI will not change the tag.

    0.38.0 (2022-10-21)

    • Dropped support for Python < 3.7
    • Updated vendored packaging to 21.3
    • Replaced all uses of distutils with setuptools
    • The handling of license_files (including glob patterns and default values) is now delegated to setuptools>=57.0.0 (#466). The package dependencies were updated to reflect this change.
    • Fixed potential DoS attack via the WHEEL_INFO_RE regular expression
    • Fixed ValueError: ZIP does not support timestamps before 1980 when using SOURCE_DATE_EPOCH=0 or when on-disk timestamps are earlier than 1980-01-01. Such timestamps are now changed to the minimum value before packaging.

    0.37.1 (2021-12-22)

    • Fixed wheel pack duplicating the WHEEL contents when the build number has changed (#415)
    • Fixed parsing of file names containing commas in RECORD (PR by Hood Chatham)

    0.37.0 (2021-08-09)

    • Added official Python 3.10 support
    • Updated vendored packaging library to v20.9

    ... (truncated)

    Commits
    • 6f1608d Created a new release
    • cf8f5ef Moved news item from PR #484 to its proper place
    • 9ec2016 Removed install dependency on setuptools (#483)
    • 747e1f6 Fixed PyPy SOABI parsing (#484)
    • 7627548 [pre-commit.ci] pre-commit autoupdate (#480)
    • 7b9e8e1 Test on Python 3.11 final
    • a04dfef Updated the pypi-publish action
    • 94bb62c Fixed docs not building due to code style changes
    • d635664 Updated the codecov action to the latest version
    • fcb94cd Updated version to match the release
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Bump certifi from 2022.6.15 to 2022.12.7 in /requirements

    Bump certifi from 2022.6.15 to 2022.12.7 in /requirements

    Bumps certifi from 2022.6.15 to 2022.12.7.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Find better selectors

    Find better selectors

    Currently, we just use the next best selector we find, starting from generic to specific. But too generic selectors are bad, e.g. div most likely has no meaning, and on the other hand, to specific selectors like the full path are likely too specific and will break.

    Maybe there's a heuristic for good selectors. An idea: What if we compute selectivity for each selector, e.g. how unique this selector is on the whole page. Would prefer ids and unique classes and discourage generic selectors. We then take the most selective but simplest selector.

    opened by lorey 0
  • Match substrings

    Match substrings

    Often, user do not want to match full attributes or text of nodes, but specific substrings.

    Solutions:

    • generate extractors that use appropriate rules to transform node.text to desired outcome.
    enhancement 
    opened by lorey 1
Releases(v1.0.0rc3)
  • v1.0.0rc3(Jun 24, 2022)

    • improved training performance by 10x (again) by trying to generate scrapers for highly similar matches first
    • added first pseudo css selectors by implementing nth-child. e.g. div a:nth-child(1)
    • added child selector generation, e.g. .user-box > a
    • added attribute-based css selectors, e.g. a[itemprop="user"]
    • added automated tests for GitHub profile pages
    • added lazy hashing for node elements
    • extended text matching to also include parent elements that contain the same text
    • fixed a bug where searching for values resulted in image dimensions being matched
    • fixed a bug where text did not exactly match the sample provided but was selected anyway
    Source code(tar.gz)
    Source code(zip)
  • v1.0.0rc2(Jun 21, 2022)

  • v1.0.0rc1(Jun 21, 2022)

    mlscraper has been rewritten from the core and is now easier to use, more flexible, and faster than ever. This is the first release candidate for the upcoming 1.0 version. Feel free to try it out with pip install --pre mlscraper.

    • scrapers can extract arbitrary data structures (lists, dicts, lists of dicts and even lists of lists)
    • depending on the page, one example might be enough to train a scraper
    • the generation of CSS selectors has been overhauled and is now more efficient
    Source code(tar.gz)
    Source code(zip)
Owner
Karl Lorey
I build, therefore I am. ML, Crawling, Scraping. Using whatever gets the job done, mostly Python, SQL, HTML/CSS, JS.
Karl Lorey
Python framework to scrape Pastebin pastes and analyze them

pastepwn - Paste-Scraping Python Framework Pastebin is a very helpful tool to store or rather share ascii encoded data online. In the world of OSINT,

Rico 105 Dec 29, 2022
DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques

DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques, based in France Only. The particularity of this program i

Dalunacrobate 347 Jan 07, 2023
tweet random sand cat pictures

sandcatbot setup pip3 install --user -r requirements.txt cp sandcatbot.example.conf sandcatbot.conf vim sandcatbot.conf running the first parameter i

jess 8 Aug 07, 2022
This is a web crawler that works on employ email data by gmane.org and visualizes it in different ways.

crawler_to_visual_gmane Analyzing an EMAIL Archive from gmane and vizualizing the data using the D3 JavaScript library. This is a set of tools that al

Saim Zafar 1 Dec 20, 2021
A Python package that scrapes Google News article data while remaining undetected by Google.

A Python package that scrapes Google News article data while remaining undetected by Google. Our scraper can scrape page data up until the last page and never trigger a CAPTCHA (download stats: https

Geminid Systems, Inc 6 Aug 10, 2022
🐞 Douban Movie / Douban Book Scarpy

Python3-based Douban Movie/Douban Book Scarpy crawler for cover downloading + data crawling + review entry.

Xingbo Jia 1 Dec 03, 2022
A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

Universal Online Judge Spider Introduction This is a spider for Universal Online Judge (UOJ) system (https://uoj.ac/). It also works for all other Onl

TriNitroTofu 1 Dec 07, 2021
Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a bot

Aliexpress to telegram post Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a b

Fernando 6 Dec 06, 2022
Screen scraping and web crawling framework

Pomp Pomp is a screen scraping and web crawling framework. Pomp is inspired by and similar to Scrapy, but has a simpler implementation that lacks the

Evgeniy Tatarkin 61 Jun 21, 2021
京东云无线宝积分推送,支持查看多设备积分使用情况

JDRouterPush 项目简介 本项目调用京东云无线宝API,可每天定时推送积分收益情况,帮助你更好的观察主要信息 更新日志 2021-03-02: 查询绑定的京东账户 通知排版优化 脚本检测更新 支持Server酱Turbo版 2021-02-25: 实现多设备查询 查询今

雷疯 199 Dec 12, 2022
Ebay Webscraper for Getting Average Product Price

Ebay-Webscraper-for-Getting-Average-Product-Price The code in this repo is used to determine the average price of an item on Ebay given a valid search

17 Jan 05, 2023
Open Crawl Vietnamese Text

Open Crawl Vietnamese Text This repo contains crawled Vietnamese text from multiple sources. This list of a topic-centric public data sources in high

QAI Research 4 Jan 05, 2022
FilmMikirAPI - A simple rest-api which is used for scrapping on the Kincir website using the Python and Flask package

FilmMikirAPI - A simple rest-api which is used for scrapping on the Kincir website using the Python and Flask package

UserGhost411 1 Nov 17, 2022
Scrapes proxies and saves them to a text file

Proxy Scraper Scrapes proxies from https://proxyscrape.com and saves them to a file. Also has a customizable theme system Made by nell and Lamp

nell 2 Dec 22, 2021
Github scraper app is used to scrape data for a specific user profile created using streamlit and BeautifulSoup python packages

Github Scraper Github scraper app is used to scrape data for a specific user profile. Github scraper app gets a github profile name and check whether

Siva Prakash 6 Apr 05, 2022
A low-code tool that generates python crawler code based on curl or url

KKBA Intruoduction A low-code tool that generates python crawler code based on curl or url Requirement Python = 3.6 Install pip install kkba Usage Co

8 Sep 20, 2021
A tool can scrape product in aliexpress: Title, Price, and URL Product.

Scrape-Product-Aliexpress A tool can scrape product in aliexpress: Title, Price, and URL Product. Usage: 1. Install Python 3.8 3.9 padahal halaman ins

Rahul Joshua Damanik 1 Dec 30, 2021
Web scrapper para cotizar articulos

WebScrapper Este web scrapper esta desarrollado en python 3.10.0 para buscar en la pagina de cyber puerta articulos dentro del catalogo. El programa t

Jordan Gaona 1 Oct 27, 2021
爬取各大SRC当日公告 | 通过微信通知的小工具 | 赏金工具

OnTimeHacker V1.0 OnTimeHacker 是一个爬取各大SRC当日公告,并通过微信通知的小工具 OnTimeHacker目前版本为1.0,已支持24家SRC,列表如下 360、爱奇艺、阿里、百度、哔哩哔哩、贝壳、Boss、58、菜鸟、滴滴、斗鱼、 饿了么、瓜子、合合、享道、京东、

Bywalks 95 Jan 07, 2023
A Scrapper with python

Scrapper-en-python Scrapper des données signifie récuperer des données pour les traiter ou les analyser. En python, il y'a 2 grands moyens de scrapper

Lun4rIum 1 Dec 05, 2021