A modern CSS selector implementation for BeautifulSoup

Overview

Donate via PayPal Discord Build Coverage Status PyPI Version PyPI - Python Version License

Soup Sieve

Overview

Soup Sieve is a CSS selector library designed to be used with Beautiful Soup 4. It aims to provide selecting, matching, and filtering using modern CSS selectors. Soup Sieve currently provides selectors from the CSS level 1 specifications up through the latest CSS level 4 drafts and beyond (though some are not yet implemented).

Soup Sieve was written with the intent to replace Beautiful Soup's builtin select feature, and as of Beautiful Soup version 4.7.0, it now is 🎊 . Soup Sieve can also be imported in order to use its API directly for more controlled, specialized parsing.

Soup Sieve has implemented most of the CSS selectors up through the latest CSS draft specifications, though there are a number that don't make sense in a non-browser environment. Selectors that cannot provide meaningful functionality simply do not match anything. Some of the supported selectors are:

  • .classes
  • #ids
  • [attributes=value]
  • parent child
  • parent > child
  • sibling ~ sibling
  • sibling + sibling
  • :not(element.class, element2.class)
  • :is(element.class, element2.class)
  • parent:has(> child)
  • and many more

Installation

You must have Beautiful Soup already installed:

pip install beautifulsoup4

In most cases, assuming you've installed version 4.7.0, that should be all you need to do, but if you've installed via some alternative method, and Soup Sieve is not automatically installed for your, you can install it directly:

pip install soupsieve

If you want to manually install it from source, navigate to the root of the project and run

python setup.py build
python setup.py install

Documentation

Documentation is found here: https://facelessuser.github.io/soupsieve/.

License

MIT License

Copyright (c) 2018 - 2021 Isaac Muse [email protected]

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Comments
  • Suggestion: adding soupsieve.strain

    Suggestion: adding soupsieve.strain

    Hello @facelessuser,

    I need to create BeautifulSoup strainers to optimize scrapers for my tool here. But as a convenience for my users, I am going to build them from simple css selectors such as li.item[align=left].

    I can do so by using your CSSParser class that processes and return Selector instances and assess whether the selector is simple enough to befit a strainer. If so, I can build a function that will "apply" this selector to tell the strainer whether it should parse the current node etc.

    I will implement this for me in my tool but I was wondering if you'd like me to contribute to this lib instead by adding something like soupsieve.strain basically. It would return an arg (typically a function) you can give to bs4.SoupStrainer and should raise a custom error if the selector is found to be too complex for the task. If this is of any interest I can open a PR for this.

    Have a good day and thanks for your work,

    T: feature P: maybe 
    opened by Yomguithereal 29
  • CDATA handling in HTML changed in lxml parser with libxml2 2.9.12

    CDATA handling in HTML changed in lxml parser with libxml2 2.9.12

    After upgrading the system libxml2 to 2.9.12 (or 2.9.11; 2.9.10 is the previous working version I have here), the two following tests fail with lxml built against the system library:

    FAILED tests/test_extra/test_soup_contains.py::TestSoupContains::test_contains_cdata_html - AssertionError: Lists differ: ['1', '2'] != ['1']
    FAILED tests/test_extra/test_soup_contains_own.py::TestSoupContainsOwn::test_contains_own_cdata_html - AssertionError: Lists differ: ['1', '2']...
    

    The cause seems to be a different representation of CDATA:

            soup       = <html><body><div id="1">Testing that <span id="2">&lt;![CDATA[that]]&gt;</span>contains works.</div></body>
    </html>
    

    (i.e. &lt![CDATA[... instead of <!--[CDATA[...)

    Note that in order to reproduce you need to both upgrade libxml2 and build lxml against the new version. Binary wheels are statically linked to an old version of libxml2, so they do not reproduce the issue yet. For example, I have been able to reproduce it with tox after swapping the installed lxml version:

    . .tox/py39/bin/activate
    pip uninstall lxml
    pip install lxml --no-binary lxml
    

    I am also not sure whether this isn't a bug in libxml2 or lxml.

    S: more-info-needed S: triage 
    opened by mgorny 21
  • 2.2.1: pytest based test suite is failing

    2.2.1: pytest based test suite is failing

    IMO it would be good to fix pytest support as pytest has a bit shorter list of dependencies than tox.

    + PYTHONPATH=/home/tkloczko/rpmbuild/BUILDROOT/python-soupsieve-2.2.1-2.fc35.x86_64/usr/lib64/python3.8/site-packages:/home/tkloczko/rpmbuild/BUILDROOT/python-soupsieve-2.2.1-2.fc35.x86_64/usr/lib/python3.8/site-packages
    + /usr/bin/python3 -Bm pytest -ra
    =========================================================================== test session starts ============================================================================
    platform linux -- Python 3.8.8, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
    rootdir: /home/tkloczko/rpmbuild/BUILD/soupsieve-2.2.1, configfile: tox.ini
    plugins: flaky-3.6.1, forked-1.3.0, shutil-1.7.0, virtualenv-1.7.0, asyncio-0.14.0, expect-1.1.0, cov-2.11.1, mock-3.5.1, httpbin-1.0.0, xdist-2.2.1, flake8-1.0.7, timeout-1.4.2, betamax-0.8.1, hypothesis-6.8.1, pyfakefs-4.4.0, freezegun-0.4.2
    collected 360 items
    
    tests/test_api.py ..........................................                                                                                                         [ 11%]
    tests/test_bs4_cases.py .....                                                                                                                                        [ 13%]
    tests/test_versions.py ....                                                                                                                                          [ 14%]
    tests/test_extra/test_attribute.py ...                                                                                                                               [ 15%]
    tests/test_extra/test_custom.py ..........                                                                                                                           [ 17%]
    tests/test_extra/test_soup_contains.py ..F...............                                                                                                            [ 22%]
    tests/test_extra/test_soup_contains_own.py .F...                                                                                                                     [ 24%]
    tests/test_level1/test_active.py .                                                                                                                                   [ 24%]
    tests/test_level1/test_at_rule.py .                                                                                                                                  [ 24%]
    tests/test_level1/test_class.py ........                                                                                                                             [ 26%]
    tests/test_level1/test_comments.py ..                                                                                                                                [ 27%]
    tests/test_level1/test_descendant.py .                                                                                                                               [ 27%]
    tests/test_level1/test_escapes.py .                                                                                                                                  [ 28%]
    tests/test_level1/test_id.py ...                                                                                                                                     [ 28%]
    tests/test_level1/test_link.py ..                                                                                                                                    [ 29%]
    tests/test_level1/test_list.py ....                                                                                                                                  [ 30%]
    tests/test_level1/test_pseudo_class.py ..                                                                                                                            [ 31%]
    tests/test_level1/test_pseudo_element.py .                                                                                                                           [ 31%]
    tests/test_level1/test_type.py .....                                                                                                                                 [ 32%]
    tests/test_level1/test_visited.py .                                                                                                                                  [ 33%]
    tests/test_level2/test_attribute.py ..............................                                                                                                   [ 41%]
    tests/test_level2/test_child.py .....                                                                                                                                [ 42%]
    tests/test_level2/test_first_child.py .                                                                                                                              [ 43%]
    tests/test_level2/test_focus.py ..                                                                                                                                   [ 43%]
    tests/test_level2/test_hover.py .                                                                                                                                    [ 43%]
    tests/test_level2/test_lang.py ..                                                                                                                                    [ 44%]
    tests/test_level2/test_next_sibling.py ...                                                                                                                           [ 45%]
    tests/test_level2/test_universal_type.py .                                                                                                                           [ 45%]
    tests/test_level3/test_attribute.py ...                                                                                                                              [ 46%]
    tests/test_level3/test_checked.py .                                                                                                                                  [ 46%]
    tests/test_level3/test_disabled.py .......                                                                                                                           [ 48%]
    tests/test_level3/test_empty.py .                                                                                                                                    [ 48%]
    tests/test_level3/test_enabled.py ......                                                                                                                             [ 50%]
    tests/test_level3/test_first_of_type.py ...                                                                                                                          [ 51%]
    tests/test_level3/test_last_child.py ..                                                                                                                              [ 51%]
    tests/test_level3/test_last_of_type.py ...                                                                                                                           [ 52%]
    tests/test_level3/test_namespace.py ..............                                                                                                                   [ 56%]
    tests/test_level3/test_not.py ....                                                                                                                                   [ 57%]
    tests/test_level3/test_nth_child.py ......                                                                                                                           [ 59%]
    tests/test_level3/test_nth_last_child.py ..                                                                                                                          [ 60%]
    tests/test_level3/test_nth_last_of_type.py ..                                                                                                                        [ 60%]
    tests/test_level3/test_nth_of_type.py ..                                                                                                                             [ 61%]
    tests/test_level3/test_only_child.py .                                                                                                                               [ 61%]
    tests/test_level3/test_only_of_type.py .                                                                                                                             [ 61%]
    tests/test_level3/test_root.py ...........                                                                                                                           [ 64%]
    tests/test_level3/test_subsequent_sibling.py .                                                                                                                       [ 65%]
    tests/test_level3/test_target.py ..                                                                                                                                  [ 65%]
    tests/test_level4/test_any_link.py ....                                                                                                                              [ 66%]
    tests/test_level4/test_attribute.py .....                                                                                                                            [ 68%]
    tests/test_level4/test_current.py ....                                                                                                                               [ 69%]
    tests/test_level4/test_default.py .....                                                                                                                              [ 70%]
    tests/test_level4/test_defined.py ..                                                                                                                                 [ 71%]
    tests/test_level4/test_dir.py ...........                                                                                                                            [ 74%]
    tests/test_level4/test_focus_visible.py ..                                                                                                                           [ 74%]
    tests/test_level4/test_focus_within.py ..                                                                                                                            [ 75%]
    tests/test_level4/test_future.py ..                                                                                                                                  [ 75%]
    tests/test_level4/test_has.py ..............                                                                                                                         [ 79%]
    tests/test_level4/test_host.py ..                                                                                                                                    [ 80%]
    tests/test_level4/test_host_context.py .                                                                                                                             [ 80%]
    tests/test_level4/test_in_range.py .......                                                                                                                           [ 82%]
    tests/test_level4/test_indeterminate.py ..                                                                                                                           [ 83%]
    tests/test_level4/test_is.py ........                                                                                                                                [ 85%]
    tests/test_level4/test_lang.py ..................                                                                                                                    [ 90%]
    tests/test_level4/test_local_link.py ..                                                                                                                              [ 90%]
    tests/test_level4/test_matches.py ..                                                                                                                                 [ 91%]
    tests/test_level4/test_not.py .                                                                                                                                      [ 91%]
    tests/test_level4/test_nth_child.py ..                                                                                                                               [ 92%]
    tests/test_level4/test_optional.py ..                                                                                                                                [ 92%]
    tests/test_level4/test_out_of_range.py .......                                                                                                                       [ 94%]
    tests/test_level4/test_past.py ..                                                                                                                                    [ 95%]
    tests/test_level4/test_paused.py ..                                                                                                                                  [ 95%]
    tests/test_level4/test_placeholder_shown.py .                                                                                                                        [ 96%]
    tests/test_level4/test_playing.py ..                                                                                                                                 [ 96%]
    tests/test_level4/test_read_only.py .                                                                                                                                [ 96%]
    tests/test_level4/test_read_write.py .                                                                                                                               [ 97%]
    tests/test_level4/test_required.py ..                                                                                                                                [ 97%]
    tests/test_level4/test_scope.py ...                                                                                                                                  [ 98%]
    tests/test_level4/test_target_within.py ..                                                                                                                           [ 99%]
    tests/test_level4/test_user_invalid.py .                                                                                                                             [ 99%]
    tests/test_level4/test_where.py ..                                                                                                                                   [100%]
    
    ================================================================================= FAILURES =================================================================================
    ________________________________________________________________ TestSoupContains.test_contains_cdata_html _________________________________________________________________
    
    self = <tests.test_extra.test_soup_contains.TestSoupContains testMethod=test_contains_cdata_html>
    
        def test_contains_cdata_html(self):
            """Test contains CDATA in HTML5."""
    
            markup = """
            <body><div id="1">Testing that <span id="2"><![CDATA[that]]></span>contains works.</div></body>
            """
    
    >       self.assert_selector(
                markup,
                'body *:-soup-contains("that")',
                ['1'],
                flags=util.HTML
            )
    
    tests/test_extra/test_soup_contains.py:154:
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
    tests/util.py:122: in assert_selector
        self.assertEqual(sorted(ids), sorted(expected_ids))
    E   AssertionError: Lists differ: ['1', '2'] != ['1']
    E
    E   First list contains 1 additional elements.
    E   First extra element 1:
    E   '2'
    E
    E   - ['1', '2']
    E   + ['1']
    --------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------
    ----Running Selector Test----
    PATTERN:  body *:-soup-contains("that")
    ## PARSING: 'body *:-soup-contains("that")'
    TOKEN: 'tag' --> 'body' at position 0
    TOKEN: 'combine' --> ' ' at position 4
    TOKEN: 'tag' --> '*' at position 5
    TOKEN: 'pseudo_contains' --> ':-soup-contains("that")' at position 6
    ## END PARSING
    
    ====PARSER:  html5lib
    TAG:  div
    
    ====PARSER:  lxml
    TAG:  div
    TAG:  span
    _____________________________________________________________ TestSoupContainsOwn.test_contains_own_cdata_html _____________________________________________________________
    
    self = <tests.test_extra.test_soup_contains_own.TestSoupContainsOwn testMethod=test_contains_own_cdata_html>
    
        def test_contains_own_cdata_html(self):
            """Test contains CDATA in HTML5."""
    
            markup = """
            <body><div id="1">Testing that <span id="2"><![CDATA[that]]></span>contains works.</div></body>
            """
    
    >       self.assert_selector(
                markup,
                'body *:-soup-contains-own("that")',
                ['1'],
                flags=util.HTML
            )
    
    tests/test_extra/test_soup_contains_own.py:45:
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
    tests/util.py:122: in assert_selector
        self.assertEqual(sorted(ids), sorted(expected_ids))
    E   AssertionError: Lists differ: ['1', '2'] != ['1']
    E
    E   First list contains 1 additional elements.
    E   First extra element 1:
    E   '2'
    E
    E   - ['1', '2']
    E   + ['1']
    --------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------
    ----Running Selector Test----
    PATTERN:  body *:-soup-contains-own("that")
    ## PARSING: 'body *:-soup-contains-own("that")'
    TOKEN: 'tag' --> 'body' at position 0
    TOKEN: 'combine' --> ' ' at position 4
    TOKEN: 'tag' --> '*' at position 5
    TOKEN: 'pseudo_contains' --> ':-soup-contains-own("that")' at position 6
    ## END PARSING
    
    ====PARSER:  html5lib
    TAG:  div
    
    ====PARSER:  lxml
    TAG:  div
    TAG:  span
    ========================================================================= short test summary info ==========================================================================
    FAILED tests/test_extra/test_soup_contains.py::TestSoupContains::test_contains_cdata_html - AssertionError: Lists differ: ['1', '2'] != ['1']
    FAILED tests/test_extra/test_soup_contains_own.py::TestSoupContainsOwn::test_contains_own_cdata_html - AssertionError: Lists differ: ['1', '2'] != ['1']
    ====================================================================== 2 failed, 358 passed in 2.25s =======================================================================
    
    S: duplicate 
    opened by kloczek 21
  • perf: don't import bs4 in every `is_...` function

    perf: don't import bs4 in every `is_...` function

    In my app, this makes is_tag (the third hottest function according to profiling) about 24% faster:

    func | ncalls | time | owntime ---- | ----- | ---- | -------- master/is_tag | 2054073 | 1566 | 1179 perf-istag/is_tag | 2054073 | 1191 | 775

    I assume unpacking all of the classes from bs4 might make things a little faster still, but it's probably not worth the mess?

    S: approved C: source C: css-matching 
    opened by akx 19
  • Improve CSS syntax error reporting

    Improve CSS syntax error reporting

    This produces tracebacks like the following:

    Traceback (most recent call last):
      ...
      File "/home/mg/src/zopefoundation/zc.catalog/.tox/py37/lib/python3.7/site-packages/zope/testbrowser/browser.py", line 1370, in getControlLabels
        forlbls = html.select('label[for=%s]' % controlid)
      File "/home/mg/src/zopefoundation/zc.catalog/.tox/py37/lib/python3.7/site-packages/bs4/element.py", line 1376, in select
        return soupsieve.select(selector, self, namespaces, limit, **kwargs)
      File "/home/mg/src/soupsieve/soupsieve/__init__.py", line 108, in select
        return compile(select, namespaces, flags).select(tag, limit)
      File "/home/mg/src/soupsieve/soupsieve/__init__.py", line 59, in compile
        return cp._cached_css_compile(pattern, namespaces, flags)
      File "/home/mg/src/soupsieve/soupsieve/css_parser.py", line 192, in _cached_css_compile
        CSSParser(pattern, flags).process_selectors(),
      File "/home/mg/src/soupsieve/soupsieve/css_parser.py", line 930, in process_selectors
        return self.parse_selectors(self.selector_iter(self.pattern), index, flags)
      File "/home/mg/src/soupsieve/soupsieve/css_parser.py", line 772, in parse_selectors
        key, m = next(iselector)
      File "/home/mg/src/soupsieve/soupsieve/css_parser.py", line 917, in selector_iter
        raise SelectorSyntaxError(msg, self.pattern)
      File "<string>", line 1
        label[for=BrowserAdd__zope.catalog.catalog.Catalog]
             ^
    soupsieve.css_parser.SelectorSyntaxError: Malformed attribute selector at position 5
    

    whereas before the traceback ended in

      File "/home/mg/src/zopefoundation/zc.catalog/.tox/py37/lib/python3.7/site-packages/soupsieve/css_parser.py", line 881, in selector_iter
        raise SyntaxError(msg)
      File "<string>", line None
    SyntaxError: Malformed attribute selector at position 5
    

    making it difficult to see what exactly was malformed about the selector.

    I've also chosen to introduce an exception subclass (SelectorSyntaxError), so that CSS parse errors could be distinguished from genuine Python syntax errors.

    S: rejected T: maintenance 
    opened by mgedmin 18
  • An easy way to set priority?

    An easy way to set priority?

    https://developer.mozilla.org/en-US/docs/Web/CSS/Specificity "Using !important, however, is bad practice and should be avoided because it makes debugging more difficult by breaking the natural cascading in your stylesheets." But sometimes when I use complex words to select, it is hard for me to review. Can we use something like parentheses (eg:3*(3+4)=21)?

    S: more-info-needed 
    opened by yjqiang 18
  • the :not selector don't work as expected.

    the :not selector don't work as expected.

    the minimal code which can reproduce the bug lists below

    import bs4
    b = bs4.BeautifulSoup("<a href=\"http://www.example.com\"></a>") 
    b.body.a['foo'] = None  # str(b) ->  <html><body><a foo href="http://www.example.com"></a></body></html>
    b.select("a:not([foo])")  # -> [<a foo href="http://www.example.com"></a>]
    

    in this case, the tag a shouldn't be selected.

    T: bug 
    opened by jimages 16
  • Selectors '> tag', '+ tag', and '~ tag'

    Selectors '> tag', '+ tag', and '~ tag'

    '>+~' symbols at the beginning of the selectors. These selectors worked in Beautiful Soup 4.6.x. But in 4.7.x there is no support for such selectors.

    For example, the code below causes an soupsieve.util.SelectorSyntaxError exception.

    from bs4 import BeautifulSoup
    BeautifulSoup('<a>test<b>test2</b></a>').a.select('> b')
    

    Result:

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "D:\Programs\Programming\Python-3\lib\site-packages\bs4\element.py", line 1376, in select
        return soupsieve.select(selector, self, namespaces, limit, **kwargs)
      File "D:\Programs\Programming\Python-3\lib\site-packages\soupsieve\__init__.py", line 112, in select
        return compile(select, namespaces, flags, **kwargs).select(tag, limit)
      File "D:\Programs\Programming\Python-3\lib\site-packages\soupsieve\__init__.py", line 63, in compile
        return cp._cached_css_compile(pattern, namespaces, custom, flags)
      File "D:\Programs\Programming\Python-3\lib\site-packages\soupsieve\css_parser.py", line 205, in _cached_css_compile
        CSSParser(pattern, custom=custom_selectors, flags=flags).process_selectors(),
      File "D:\Programs\Programming\Python-3\lib\site-packages\soupsieve\css_parser.py", line 1010, in process_selectors
        return self.parse_selectors(self.selector_iter(self.pattern), index, flags)
      File "D:\Programs\Programming\Python-3\lib\site-packages\soupsieve\css_parser.py", line 888, in parse_selectors
        sel, m, has_selector, selectors, relations, is_pseudo, index
      File "D:\Programs\Programming\Python-3\lib\site-packages\soupsieve\css_parser.py", line 713, in parse_combinator
        index
    soupsieve.util.SelectorSyntaxError: The combinator '>' at postion 0, must have a selector before it
      line 1:
    > b
    ^
    
    S: wontfix 
    opened by unreal666 14
  • XML default namespace leads to TypeError: __init__() keywords must be strings

    XML default namespace leads to TypeError: __init__() keywords must be strings

    This is a bug with handling valid XML namespaces; soupsieve assumes all namespaces have a prefix:

    <prefix:tag xmlns:prefix="...">
    

    but the prefix can be omitted to define a default namespace:

    <tag xmlns="...">
    

    meaning that any element without a prefix: prepended to the tag name is in that namespace. See section 6.2 of the XML namespaces 1.1 spec.

    During parsing, lxml passes in a default namespace under the None key, e.g. {None: "..."}, and unique keys are accumulated in the soup._namespaces dictionary. soupsieve assumes the dictionary only ever has string keys, so an XML document with a default namespace leads to an exception.

    Test case (using BeautifulSoup 4.7 for convenience):

    >>> from bs4 import BeautifulSoup, __version__
    >>> __version__
    '4.7.0'
    >>> sample = b'''\
    ... <?xml version="1.1"?>
    ... <!-- unprefixed element types are from "books" -->
    ... <book xmlns='urn:loc.gov:books'
    ...       xmlns:isbn='urn:ISBN:0-395-36341-6'>
    ...     <title>Cheaper by the Dozen</title>
    ...     <isbn:number>1568491379</isbn:number>
    ... </book>
    ... '''
    >>> soup = BeautifulSoup(sample, 'xml')
    >>> soup._namespaces
    {'xml': 'http://www.w3.org/XML/1998/namespace', None: 'urn:loc.gov:books', 'isbn': 'urn:ISBN:0-395-36341-6'}
    >>> soup.select_one('title')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/mj/Development/venvs/stackoverflow-latest/lib/python3.7/site-packages/bs4/element.py", line 1345, in select_one
        value = self.select(selector, namespaces, 1, **kwargs)
      File "/Users/mj/Development/venvs/stackoverflow-latest/lib/python3.7/site-packages/bs4/element.py", line 1377, in select
        return soupsieve.select(selector, self, namespaces, limit, **kwargs)
      File "/Users/mj/Development/venvs/stackoverflow-latest/lib/python3.7/site-packages/soupsieve/__init__.py", line 108, in select
        return compile(select, namespaces, flags).select(tag, limit)
      File "/Users/mj/Development/venvs/stackoverflow-latest/lib/python3.7/site-packages/soupsieve/__init__.py", line 50, in compile
        namespaces = ct.Namespaces(**(namespaces))
    TypeError: __init__() keywords must be strings
    

    where <title>Cheaper by the Dozen</title> was expected.

    T: feature C: API S: rejected 
    opened by mjpieters 13
  • Did I make a mistake?

    Did I make a mistake?

    import requests
    from bs4 import BeautifulSoup
    
    
    url = 'https://www.ptwxz.com/html/0/296/39948.html'
    
    cookie = ""
    
    user_agent = ('Mozilla/5.0 (iPhone; CPU iPhone OS 11_2_6 like'
                  'Mac OS X) AppleWebKit/604.1.34 (KHTML, like Gecko)'
                  'CriOS/65.0.3325.152 Mobile/15D100 Safari/604.1')
    
    headers = {'User-Agent': user_agent, 'cookie': cookie}
    
    '''
    socks5 = 'socks5://127.0.0.1:10086'  # because this website is blocked by the government of China, I have to use proxy
    rsp = requests.get(
        url,
        headers=headers,
        proxies={'http': socks5, 'https': socks5})
    '''
    rsp = requests.get(url, headers=headers)
    rsp.encoding = 'gbk'
    text = rsp.text
    
    soups = BeautifulSoup(text, 'html.parser')
    print(soups.prettify())
    print('_____________________')
    
    tag_center = soups.select_one('table[align="center"]')
    list_tag_center_next_siblings = list(tag_center.next_siblings)
    for i in list_tag_center_next_siblings[-4:]:
        print(type(i), i)
    
    tag_head = soups.select_one('head')
    list_tag_head_children = list(tag_head.children)
    for i in list_tag_head_children[-4:]:
        print(type(i), i, '|')
    
    for x, y in zip(reversed(list_tag_center_next_siblings), reversed(list_tag_head_children)):
        assert x is y
    
    
    tags_after_center = soups.select('table[align="center"] ~ *')
    print(tags_after_center)
    
    T: support 
    opened by yjqiang 12
  • Help: is soupsieve case-insensitive?

    Help: is soupsieve case-insensitive?

    In [122]: xml = """<Envelope><Header>...</Header></Envelope>"""
    
    In [123]: s = BeautifulSoup(xml, "xml")
    
    In [124]: s.select("header")
    Out[124]: [<Header>...</Header>]
    
    In [125]: s.select("Header")
    Out[125]: []
    

    Before, BeautifulSoup accepted (and I think required) case-sensitive tag name in selector.

    Now that BeautifulSoup uses soupsieve, it seems that only lower-case selectors are supported.

    I'm really not sure why or if I can change this behaviour.

    T: bug S: confirmed 
    opened by dimaqq 11
  • `:has()` is no longer forgiving

    `:has()` is no longer forgiving

    CSS has resolved that :has() should no longer be forgiving in order mitigate some JQuery issues. We have never really implemented true forgiveness, just forgiveness as far as trailing and leading commas and empty entries. We will need to drop such support for :has(). We can deprecate the behavior or just remove it. I have no idea if anyone relies on such behavior.

    C: css-parsing skip-triage T: enhancement 
    opened by facelessuser 0
  • LXML does not currently generate wheels for Python 3.11 on Windows

    LXML does not currently generate wheels for Python 3.11 on Windows

    Due to this, SoupSieve currently ignores any testing on Python 3.11 that requires LXML. In time, once LXML properly generates wheels for Windows, we will once again enable testing of LXML for Windows on Python 3.11.

    Related issue LXML issue: https://bugs.launchpad.net/lxml/+bug/1977998

    T: maintenance skip-triage 
    opened by facelessuser 0
  • Interesting psuedo class to keep an eye on `:in()`

    Interesting psuedo class to keep an eye on `:in()`

    https://drafts.csswg.org/css-cascade-6/#in-scope-selector

    It would be way too early to expect that this gets implemented officially or that the spec wouldn't change right under us, but something to keep an eye on. It may be fun to play with to see how the code would actually look and how useful it is.

    If I'm feeling adventurous, maybe implement it under something like :--soup-in() for experimental purposes.

    T: feature C: css-custom P: maybe skip-triage 
    opened by facelessuser 8
  • Consider possibly deprecating [attr!=value]

    Consider possibly deprecating [attr!=value]

    There is no rush to do such, but moving forward, I think we will shy away from delving into syntax that deviates from the CSS specification. We've started moving custom pseudo-classes over to have prefixes to avoid future conflicts, and it is possible that one day [attr!=value] could have some meaning in the CSS spec in the future, and it could be different that what we currently do.

    IIRC this syntax was borrowed from JQuery, but TBH, it really doesn't add functionality as you can do the same with :not([attr=value]).

    T: feature skip-triage 
    opened by facelessuser 0
  • Experimental: Language tag canonicalization

    Experimental: Language tag canonicalization

    There is talk about potentially having the CSS level 4 :lang() pseudo-class canonicalizing tags and ranges to better help in situations such as: :lang(yue, zh-yue, zh-HK). The idea is you could then just do something like: :lang(yue). For best matches, it is recommended to canonicalize both the range used in the pseudo-class and the tag it is comparing. Canonicalization would also output in the extlang form.

    Generally * are ignored in ranges except when at the start: *-yue. Things like en-*-US resolve to en-US, though implicit matching between tags will still match en-xxx-US with en-US.

    Currently, in this pull, we have canonicalization implemented according to RFC5646, but there are still some questions:

    1. Should we abandon canonicalization, like we are currently doing, when the tag is invalid? Or do we just canonicalize the valid parts and ignore the failing parts?

    2. As mentioned above, ranges can use *, so we strip out non-essential *s and them canonicalize the range. This seems like the only sane approach, but am I misunderstanding something?

    3. It is only suggested that we MAY order variants to improve matching. We decided to go ahead and do this. Should we though? We have also omitted any failures if the required prefixes for a given variant are not found in the tag. This is to help ensure that both the the tag variant order is the same as the range's variant order, as specified range may not explicitly define all required ranges and rely on implicit matching to grab those. This seems reasonable, but should we abort canonicalization if the prefixes are not found? It is not a MUST requirement in the spec, only a SHOULD.

    Anyways, some things to think about. Technically we could merge this as is and simply disable the canonicalization and it should behave exactly how it did before. We could also enable this functionality under an experimental flag if we wanted. Right now, we are simply waiting to see what is decided for the official level 4 CSS spec.

    C: docs S: work-in-progress C: infrastructure C: tests C: css-matching 
    opened by facelessuser 1
Releases(2.3.2.post1)
  • 2.3.2.post1(Apr 14, 2022)

  • 2.3.2(Apr 6, 2022)

  • 2.3.1(Nov 11, 2021)

  • 2.3(Nov 3, 2021)

    2.3

    • NEW: Officially support Python 3.10.
    • NEW: Add static typing.
    • NEW: :has(), :is(), and :where() now use use a forgiving selector list. While not as forgiving as CSS might be, it will forgive such things as empty sets and empty slots due to multiple consecutive commas, leading commas, or trailing commas. Essentially, these pseudo-classes will match all non-empty selectors and ignore empty ones. As the scraping environment is different than a browser environment, it was chosen not to aggressively forgive bad syntax and invalid features to ensure the user is alerted that their program may not perform as expected.
    • NEW: Add support to output a pretty print format of a compiled SelectorList for debug purposes.
    • FIX: Some small corner cases discovered with static typing.
    Source code(tar.gz)
    Source code(zip)
  • 2.2.1(Mar 19, 2021)

  • 2.2(Feb 9, 2021)

    2.2

    • NEW: :link and :any-link no longer include <link> due to a change in the level 4 selector specification. This actually yields more sane results.
    • FIX: BeautifulSoup, when using find, is quite forgiving of odd types that a user may place in an element's attribute value. Soup Sieve will also now be more forgiving and attempt to match these unexpected values in a sane manner by normalizing them before compare. (#212)
    Source code(tar.gz)
    Source code(zip)
  • 2.1.0(Dec 10, 2020)

    2.1.0

    • NEW: Officially support Python 3.9.
    • NEW: Drop official support for Python 3.5.
    • NEW: In order to avoid conflicts with future CSS specification changes, non-standard pseudo classes will now start with the :-soup- prefix. As a consequence, :contains() will now be known as :-soup-contains(), though for a time the deprecated form of :contains() will still be allowed with a warning that users should migrate over to :-soup-contains().
    • NEW: Added new non-standard pseudo class :-soup-contains-own() which operates similar to :-soup-contains() except that it only looks at text nodes directly associated with the currently scoped element and not its descendants.
    • FIX: Import bs4 globally instead of in local functions as it appears there are no adverse affects due to circular imports as bs4 does not immediately reference soupsieve functions and soupsieve does not immediately reference bs4 functions. This should give a performance boost to functions that had previously included bs4 locally.
    Source code(tar.gz)
    Source code(zip)
  • 2.0.1(May 16, 2020)

  • 1.9.6(May 16, 2020)

    1.9.6

    Note: Last version for Python 2.7

    • FIX: Prune dead code.
    • FIX: Corner case with splitting namespace and tag name that that have an escaped |.
    Source code(tar.gz)
    Source code(zip)
  • 2.0.0(Feb 23, 2020)

    2.0.0

    • NEW: SelectorSyntaxError is derived from Exception not SyntaxError.
    • NEW: Remove deprecated comments and icomments from the API.
    • NEW: Drop support for EOL Python versions (Python 2 and Python < 3.5).
    • FIX: Corner case with splitting namespace and tag name that have an escaped |.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.5(Nov 2, 2019)

  • 1.9.4(Sep 26, 2019)

    1.9.4

    • FIX: :checked rule was too strict with option elements. The specification for :checked does not require an option element to be under a select element.
    • FIX: Fix level 4 :lang() wildcard match handling with singletons. Implicit wildcard matching should not match any singleton. Explicit wildcard matching (* in the language range: *-US) is allowed to match singletons.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.3(Aug 18, 2019)

    1.9.3

    • FIX: [attr!=value] pattern was mistakenly using :not([attr|=value]) logic instead of :not([attr=value]).
    • FIX: Remove undocumented _QUIRKS mode flag. Beautiful Soup was meant to use it to help with transition to Soup Sieve, but never released with it. Help with transition at this point is no longer needed.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.2(Jun 23, 2019)

    1.9.2

    • FIX: Shortcut last descendant calculation if possible for performance.
    • FIX: Fix issue where Doctype strings can be mistaken for a normal text node in some cases.
    • FIX: A top level tag is not a :root tag if it has sibling text nodes or tag nodes. This is an issue that mostly manifests when using html.parser as the parser will allow multiple root nodes.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.1(Apr 13, 2019)

    1.9.1

    • FIX: :root, :contains(), :default, :indeterminate, :lang(), and :dir() will properly account for HTML iframe elements in their logic when selecting or matching an element. Their logic will be restricted to the document for which the element under consideration applies.
    • FIX: HTML pseudo-classes will check that all key elements checked are in the XHTML namespace (HTML parsers that do not provide namespaces will assume the XHTML namespace).
    • FIX: Ensure that all pseudo-class names are case insensitive and allow CSS escapes.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.0(Mar 26, 2019)

    1.9.0

    • NEW: Allow :contains() to accept a list of text to search for. (#115)
    • NEW: Add new escape function for escaping CSS identifiers. (#125)
    • NEW: Deprecate comments and icomments functions in the API to ensure Soup Sieve focuses only in CSS selectors. comments and icomments will most likely be removed in 2.0. (#130)
    • NEW: Add Python 3.8 support. (#133)
    • FIX: Don't install test files when installing the soupsieve package. (#111)
    • FIX: Improve efficiency of :contains() comparison.
    • FIX: Null characters should translate to the Unicode REPLACEMENT CHARACTER (U+FFFD) according to the specification. This applies to CSS escaped NULL characters as well. (#124)
    • FIX: Escaped EOF should translate to U+FFFD outside of CSS strings. In a string, they should just be ignored, but as there is no case where we could resolve such a string and still have a valid selector, string handling remains the same. (#128)
    Source code(tar.gz)
    Source code(zip)
  • 1.8.0(Feb 17, 2019)

    1.8.0

    • NEW: Add custom selector support. (#92)(#108)
    • FIX: Small tweak to CSS identifier pattern to ensure it matches the CSS specification exactly. Specifically, you can't have an identifier of only -. (#107)
    • FIX: CSS string patterns should allow escaping newlines to span strings across multiple lines. (#107)
    • FIX: Newline regular expression for CSS newlines should treat \r\n as a single character, especially in cases such as string escapes: \\\r\n. (#107)
    • FIX: Allow -- as a valid identifier or identifier start. (#107)
    • FIX: Bad CSS syntax now raises a SelectorSyntaxError, which is still currently derived from SyntaxError, but will most likely be derived from Exception in the future.
    Source code(tar.gz)
    Source code(zip)
  • 1.7.3(Jan 23, 2019)

    1.7.3

    • FIX: Fix regression with tag names in regards to case sensitivity, and ensure there are tests to prevent breakage in the future.
    • FIX: XHTML should always be case sensitive like XML.
    Source code(tar.gz)
    Source code(zip)
  • 1.7.2(Jan 18, 2019)

    1.7.2

    • FIX: Fix HTML detection for type selector.
    • FIX: Fixes for :enabled and :disabled.
    • FIX: Provide a way for Beautiful Soup to parse selectors in a quirks mode to mimic some of the quirks of the old select method prior to Soup Sieve, but with warnings. This is to help old scripts to not break during the transitional period with newest Beautiful Soup. In the future, these quirks will raise an exception as Soup Sieve requires selectors to follow the CSS specification.
    Source code(tar.gz)
    Source code(zip)
  • 1.7.1(Jan 13, 2019)

    1.7.1

    • FIX: Fix issue with :has() selector where a leading combinator can only be provided in the first selector in a relative selector list.
    Source code(tar.gz)
    Source code(zip)
  • 1.7.0(Jan 10, 2019)

    1.7.0

    • NEW: Add support for :in-range and :out-of-range selectors. (#60)
    • NEW: Add support for :defined selector. (#76)
    • FIX: Fix pickling issue when compiled selector contains a NullSelector object. (#70)
    • FIX: Better exception messages in the CSS selector parser and fix a position reporting issue that can occur in some exceptions. (#72, #73)
    • FIX: Don't compare prefixes when evaluating attribute namespaces, compare the actual namespace. (#75)
    • FIX: Split whitespace attribute lists by all whitespace characters, not just space.
    • FIX: :nth-* patterns were converting numbers to base 16 when they should have been converting to base 10.
    Source code(tar.gz)
    Source code(zip)
  • 1.6.2(Jan 4, 2019)

    1.6.2

    • FIX: Fix pattern compile issues on Python < 2.7.4.
    • FIX: Don't use \d in Unicode Re patterns as they will contain characters outside the range of [0-9].
    Source code(tar.gz)
    Source code(zip)
  • 1.6.1(Jan 2, 2019)

  • 1.6.0(Dec 31, 2018)

  • 1.5.0(Dec 28, 2018)

    1.5.0

    • NEW: Add select_one method like Beautiful Soup has.
    • NEW: Add :dir() selector (HTML only).
    • FIX: Fix handling issues of HTML fragments (elements without a BeautifulSoup object as a parent).
    • FIX: Fix internal nth range check.
    Source code(tar.gz)
    Source code(zip)
  • 1.4.0(Dec 27, 2018)

    1.4.0

    • NEW: Throw NotImplementedError for at-rules: @page, etc.
    • NEW: Match nothing for :host, :host(), and :host-context().
    • NEW: Add support for :read-write and :read-only.
    • NEW: Selector patterns can be annotated with CSS comments.
    • FIX: \r, \n, and \f cannot be escaped with \ in CSS. You must use Unicode escapes.
    Source code(tar.gz)
    Source code(zip)
  • 1.3.1(Dec 24, 2018)

  • 1.3.0(Dec 22, 2018)

    1.3.0

    • NEW: Add support for :scope.
    • NEW: :user-invalid, :playing, :paused, and :local-link will not cause a failure, but all will match nothing as their use cases are not possible in an environment outside a web browser.
    • FIX: Fix [attr~=value] handling of whitespace. According to the spec, if the value contains whitespace, or is an empty string, it should not match anything.
    • FIX: Precompile internal patterns for pseudo-classes to prevent having to parse them again.
    Source code(tar.gz)
    Source code(zip)
  • 1.2.1(Dec 20, 2018)

    1.2.1

    • FIX: More descriptive exceptions. Exceptions will also now mention position in the pattern that is problematic.
    • FIX: filter ignores NavigableString objects in normal iterables and Tag iterables. Basically, it filters all Beautiful Soup document parts regardless of iterable type where as it used to only filter out a NavigableString in a Tag object. This is viewed as fixing an inconsistency.
    • FIX: DEBUG flag has been added to help with debugging CSS selector parsing. This is mainly for development.
    • FIX: If forced to search for language in meta tag, and no language is found, cache that there is no language in the meta tag to prevent searching again during the current select.
    • FIX: If a non BeautifulSoup/Tag object is given to the API to compare against, raise a TypeError.
    Source code(tar.gz)
    Source code(zip)
  • 1.2.0(Dec 19, 2018)

Web-Scraping using Selenium Master

Web-Scraping using Selenium What is the need of Selenium? Some websites don't like to be scrapped and in that case you need to disguise your webscrapi

Md Rashidul Islam 1 Oct 26, 2021
爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书》

lxSpider 爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说网站、招标采购网》 简介: 时光荏苒,记不清写了多少案例了。

lx 793 Jan 05, 2023
京东茅台抢购最新优化版本,京东茅台秒杀,优化了茅台抢购进程队列

京东茅台抢购最新优化版本,京东茅台秒杀,优化了茅台抢购进程队列

MaoTai 129 Dec 14, 2022
A package that provides you Latest Cyber/Hacker News from website using Web-Scraping.

cybernews A package that provides you Latest Cyber/Hacker News from website using Web-Scraping. Latest Cyber/Hacker News Using Webscraping Developed b

Hitesh Rana 4 Jun 02, 2022
Html Content / Article Extractor, web scrapping lib in Python

Python-Goose - Article Extractor Intro Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a

Xavier Grangier 3.8k Jan 02, 2023
Twitter Scraper

Twitter's API is annoying to work with, and has lots of limitations — luckily their frontend (JavaScript) has it's own API, which I reverse–engineered. No API rate limits. No restrictions. Extremely

Tayyab Kharl 45 Dec 30, 2022
Creating Scrapy scrapers via the Django admin interface

django-dynamic-scraper Django Dynamic Scraper (DDS) is an app for Django which builds on top of the scraping framework Scrapy and lets you create and

Holger Drewes 1.1k Dec 17, 2022
Extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file.

GetTss python Package extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file. Install $ pip install GetTss Us

laojunjun 6 Nov 21, 2022
PS5 bot to find a console in france for chrismas 🎄🎅🏻 NOT FOR SCALPERS

Une PS5 pour Noël Python + Chrome --headless = une PS5 pour noël MacOS Installer chrome Tweaker le .yaml pour la listes sites a scrap et les criteres

Olivier Giniaux 3 Feb 13, 2022
This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

Faisal Ahmed 1 Jan 10, 2022
A Python module to bypass Cloudflare's anti-bot page.

cloudflare-scrape A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Reque

3k Jan 04, 2023
Telegram Group Scrapper

this programe is make your work so much easy on telegrame. do you want to send messages on everyone to your group or others group. use this script it will do your work automatically with one click. a

HackArrOw 3 Dec 03, 2022
Binance Smart Chain Contract Scraper + Contract Evaluator

Pulls Binance Smart Chain feed of newly-verified contracts every 30 seconds, then checks their contract code for links to socials.Returns only those with socials information included, and then submit

14 Dec 09, 2022
This project was created using Python technology and flask tools to scrape a music site

python-scrapping This project was created using Python technology and flask tools to scrape a music site You need to install the following packages to

hosein moradi 1 Dec 07, 2021
A Spider for BiliBili comments with a simple API server.

BiliComment A spider for BiliBili comment. Spider Usage Put config.json into config directory, and then python . ./config/config.json. A example confi

Hao 3 Jul 05, 2021
This program scrapes information and images for movies and TV shows.

Media-WebScraper This program scrapes information and images for movies and TV shows. Summary For more information on the program, read the WebScrape_

1 Dec 05, 2021
京东茅台抢购 2021年4月最新版

Jd_Seckill 特别声明: 本仓库发布的jd_seckill项目中涉及的任何脚本,仅用于测试和学习研究,禁止用于商业用途,不能保证其合法性,准确性,完整性和有效性,请根据情况自行判断。 本项目内所有资源文件,禁止任何公众号、自媒体进行任何形式的转载、发布。 huanghyw 对任何脚本问题概不

45 Dec 14, 2022
Demonstration on how to use async python to control multiple playwright browsers for web-scraping

Playwright Browser Pool This example illustrates how it's possible to use a pool of browsers to retrieve page urls in a single asynchronous process. i

Bernardas Ališauskas 8 Oct 27, 2022
Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website (prior form publication) by form number and returns the results as json. It provides the option to download pdfs over a ra

1 Jan 04, 2022
A social networking service scraper in Python

snscrape snscrape is a scraper for social networking services (SNS). It scrapes things like user profiles, hashtags, or searches and returns the disco

2.4k Jan 01, 2023