Soccerdata - Efficiently scrape soccer data from various sources

Overview

SoccerData

PyPI Python Version License Read the documentation at https://soccerdata.readthedocs.io/ Tests Codecov pre-commit Black

SoccerData is a collection of wrappers over soccer data from Club Elo, ESPN, FBref, FiveThirtyEight, Football-Data.co.uk, SoFIFA and WhoScored. You get Pandas DataFrames with sensible, matching column names and identifiers across datasets. Data is downloaded when needed and cached locally.

import soccerdata as sd

# Create scraper class instance for the Premier League
five38 = sd.FiveThirtyEight('ENG-Premier League', '1819')

# Fetch dataframes
games = five38.read_games()

To learn how to install, configure and use SoccerData, see the Quickstart guide. For documentation on each of the supported data sources, see the API reference.

Disclaimer: As soccerdata relies on web scraping, any changes to the scraped websites will break the package. Hence, do not expect that all code will work all the time. If you spot any bugs, then please fork it and start a pull request.

Comments
  • [FBref] 403 error when downloading data

    [FBref] 403 error when downloading data

    Which Python version are you using?

    Python 3.8.13

    Which version of soccerdata are you using?

    1.0.1

    What did you do?

    fbref = sd.FBref(leagues="NED-Eredivisie", seasons="2021-2022", proxy='tor') team_season_stats = fbref.read_schedule()

    What did you expect to see?

    Downloaded team stats

    What did you see instead?

    requests.exceptions.HTTPError: 403
    Client Error: Forbidden for url:
    https://fbref.com/en/comps/

    opened by koenklomps 9
  • [General] Selenium fails with SOCKS proxy (for tor) with `WebDriverException: Message: unknown error: net::ERR_PROXY_CONNECTION_FAILED`

    [General] Selenium fails with SOCKS proxy (for tor) with `WebDriverException: Message: unknown error: net::ERR_PROXY_CONNECTION_FAILED`

    • Which Python version are you using?
    • Which version of soccerdata are you using?
    import soccerdata as sd
    import sys
    print(sd.__version__)
    print(sys.version)
    
    0.0.2
    3.7.11 (default, Jul 27 2021, 09:42:29) [MSC v.1916 64 bit (AMD64)]
    
    • What did you do?
    • What did you expect to see?
    • What did you see instead?

    I tried to set use_tor=True for downloading events for a match with tor running in the background, but read_events ended with an error indicating that the proxy connection failed.

    ws = sd.WhoScored(leagues="ENG-Premier League", seasons="20-21", use_tor=True)
    events = ws.read_events(match_id=1485185)
    
    [03/19/22 09:54:01] INFO     Saving cached data to                              [_common.py](file:///C:/Users/antho/anaconda3/envs/soccerdata/lib/site-packages/soccerdata/_common.py):[59](file:///C:/Users/antho/anaconda3/envs/soccerdata/lib/site-packages/soccerdata/_common.py#59)
                                 C:\Users\antho\soccerdata\data\WhoScored                        
    [03/19/22 09:54:04] INFO     Retrieving game schedule of ENG-Premier League  [whoscored.py](file:///C:/Users/antho/anaconda3/envs/soccerdata/lib/site-packages/soccerdata/whoscored.py):[314](file:///C:/Users/antho/anaconda3/envs/soccerdata/lib/site-packages/soccerdata/whoscored.py#314)
                                 - 2021 from the cache                                           
                        INFO     [2/1] Retrieving game with id=1485185           [whoscored.py](file:///C:/Users/antho/anaconda3/envs/soccerdata/lib/site-packages/soccerdata/whoscored.py):[499](file:///C:/Users/antho/anaconda3/envs/soccerdata/lib/site-packages/soccerdata/whoscored.py#499)
                        INFO     Scraping                                        [whoscored.py](file:///C:/Users/antho/anaconda3/envs/soccerdata/lib/site-packages/soccerdata/whoscored.py):[577](file:///C:/Users/antho/anaconda3/envs/soccerdata/lib/site-packages/soccerdata/whoscored.py#577)
                                 https://www.whoscored.com/Matches/1485185/Live                  
    ---------------------------------------------------------------------------
    WebDriverException                        Traceback (most recent call last)
    ~\AppData\Local\Temp\ipykernel_27592\4024154899.py in <module>
          1 ws = sd.WhoScored(leagues="ENG-Premier League", seasons="20-21", use_tor=True, path_to_browser="c:/users/antho/downloads/chromedriver.exe")
    ----> 2 events = ws.read_events(match_id=1485185)
    
    ~\anaconda3\envs\soccerdata\lib\site-packages\soccerdata\whoscored.py in read_events(self, match_id, force_cache, live)
        507                 filepath,
        508                 var="requirejs.s.contexts._.config.config.params.args.matchCentreData",
    --> 509                 no_cache=live,
        510             )
        511             json_data = json.load(reader)
    
    ~\anaconda3\envs\soccerdata\lib\site-packages\soccerdata\whoscored.py in _download_and_save(self, url, filepath, max_age, no_cache, var)
        576         if cache_invalid or filepath is None or not filepath.exists():
        577             logger.info("Scraping %s", url)
    --> 578             self.driver.get(url)
        579             time.sleep(5 + random.random() * 5)
        580             if "Incapsula incident ID" in self.driver.page_source:
    
    ~\anaconda3\envs\soccerdata\lib\site-packages\undetected_chromedriver\__init__.py in get_wrapped(*args, **kwargs)
        495                     },
        496                 )
    --> 497             return orig_get(*args, **kwargs)
        498 
        499         self.get = get_wrapped
    
    ~\anaconda3\envs\soccerdata\lib\site-packages\undetected_chromedriver\__init__.py in get(self, url)
        533         if self._get_cdc_props():
        534             self._hook_remove_cdc_props()
    --> 535         return super().get(url)
        536 
        537     def add_cdp_listener(self, event_name, callback):
    
    ~\anaconda3\envs\soccerdata\lib\site-packages\selenium\webdriver\remote\webdriver.py in get(self, url)
        435         Loads a web page in the current browser session.
        436         """
    --> 437         self.execute(Command.GET, {'url': url})
        438 
        439     @property
    
    ~\anaconda3\envs\soccerdata\lib\site-packages\selenium\webdriver\remote\webdriver.py in execute(self, driver_command, params)
        423         response = self.command_executor.execute(driver_command, params)
        424         if response:
    --> 425             self.error_handler.check_response(response)
        426             response['value'] = self._unwrap_value(
        427                 response.get('value', None))
    
    ~\anaconda3\envs\soccerdata\lib\site-packages\selenium\webdriver\remote\errorhandler.py in check_response(self, response)
        245                 alert_text = value['alert'].get('text')
        246             raise exception_class(message, screen, stacktrace, alert_text)  # type: ignore[call-arg]  # mypy is not smart enough here
    --> 247         raise exception_class(message, screen, stacktrace)
        248 
        249     def _value_or_default(self, obj: Mapping[_KT, _VT], key: _KT, default: _VT) -> _VT:
    
    WebDriverException: Message: unknown error: net::ERR_PROXY_CONNECTION_FAILED
      (Session info: headless chrome=99.0.4844.74)
    Stacktrace:
    Backtrace:
    	Ordinal0 [0x00509943+2595139]
    	Ordinal0 [0x0049C9F1+2148849]
    	Ordinal0 [0x00394528+1066280]
    	Ordinal0 [0x00390DB4+1052084]
    	Ordinal0 [0x003863BD+1008573]
    	Ordinal0 [0x00386F7C+1011580]
    	Ordinal0 [0x003865CA+1009098]
    	Ordinal0 [0x00385BC6+1006534]
    	Ordinal0 [0x00384AD0+1002192]
    	Ordinal0 [0x00384FAD+1003437]
    	Ordinal0 [0x00395C4A+1072202]
    	Ordinal0 [0x003EC19D+1425821]
    	Ordinal0 [0x003DB9EC+1358316]
    	Ordinal0 [0x003EBAF2+1424114]
    	Ordinal0 [0x003DB806+1357830]
    	Ordinal0 [0x003B6086+1204358]
    	Ordinal0 [0x003B6F96+1208214]
    	GetHandleVerifier [0x006AB232+1658114]
    	GetHandleVerifier [0x0076312C+2411516]
    	GetHandleVerifier [0x0059F261+560433]
    	GetHandleVerifier [0x0059E366+556598]
    	Ordinal0 [0x004A286B+2173035]
    	Ordinal0 [0x004A75F8+2192888]
    	Ordinal0 [0x004A76E5+2193125]
    	Ordinal0 [0x004B11FC+2232828]
    	BaseThreadInitThunk [0x76106739+25]
    	RtlGetFullPathName_UEx [0x76FF8E7F+1215]
    	RtlGetFullPathName_UEx [0x76FF8E4D+1165]
    

    Here's what my terminal looks like with tor running (prior to calling read_events()

    [email protected]:/c/Users/antho/soccerdata$ tor
    
    Mar 19 09:53:33.865 [notice] Tor 0.4.2.7 running on Linux with Libevent 2.1.11-stable, OpenSSL 1.1.1f, Zlib 1.2.11, Liblzma 5.2.4, and Libzstd 1.4.4.
    Mar 19 09:53:33.865 [notice] Tor can't help you if you use it wrong! Learn how to be safe at https://www.torproject.org/download/download#warning
    Mar 19 09:53:33.865 [notice] Read configuration file "/etc/tor/torrc".
    Mar 19 09:53:33.866 [notice] Opening Socks listener on 127.0.0.1:9050
    Mar 19 09:53:33.866 [notice] Opened Socks listener on 127.0.0.1:9050
    Mar 19 09:53:33.000 [notice] Parsing GEOIP IPv4 file /usr/share/tor/geoip.
    Mar 19 09:53:33.000 [notice] Parsing GEOIP IPv6 file /usr/share/tor/geoip6.
    Mar 19 09:53:34.000 [notice] Bootstrapped 0% (starting): Starting
    Mar 19 09:53:34.000 [notice] Starting with guard context "default"
    Mar 19 09:53:35.000 [notice] Bootstrapped 5% (conn): Connecting to a relay
    Mar 19 09:53:35.000 [notice] Bootstrapped 10% (conn_done): Connected to a relay
    Mar 19 09:53:35.000 [notice] Bootstrapped 14% (handshake): Handshaking with a relay
    Mar 19 09:53:35.000 [notice] Bootstrapped 15% (handshake_done): Handshake with a relay done
    Mar 19 09:53:35.000 [notice] Bootstrapped 75% (enough_dirinfo): Loaded enough directory info to build circuits        
    Mar 19 09:53:35.000 [notice] Bootstrapped 90% (ap_handshake_done): Handshake finished with a relay to build circuits  
    Mar 19 09:53:35.000 [notice] Bootstrapped 95% (circuit_create): Establishing a Tor circuit
    Mar 19 09:53:36.000 [notice] Bootstrapped 100% (done): Done
    

    I've opened my browser to the port to verify that something is running, although this is using an HTTP proxy, so the warning here is expected.

    image

    opened by tonyelhabr 7
  • [FBref] Unable to scrape Men's World Cup stats

    [FBref] Unable to scrape Men's World Cup stats

    Hi @probberechts - this looks like a wonderful set of tools. Can't wait to get stuck deeper into it. Thank you!

    Objective: To be able to scrape FBRef stats for historic World Cups (and upcoming 2022 World Cup) from this page

    World Cup stats landing page -> https://fbref.com/en/comps/1/World-Cup-Stats Stats page for 2018 World Cup -> https://fbref.com/en/comps/1/2018/2018-FIFA-World-Cup-Stats

    1. Adding a new league - Working as expected

    In the "Adding additional leagues" (here: https://soccerdata.readthedocs.io/en/latest/usage.html) I successfully added a new league called "INTL-WorldCup"

    Content of league_dict.json

    {
      "INTL-WorldCup": {
        "FBref": "World-Cup-Stats",
        "season_start": "Aug",
        "season_end": "May"
      }
    }
    

    Note: I had to remove a comma from just after the 2nd last curly bracket.

    Result: When I sd.FBref.available_leagues() it returns the expected result below

    [
      'Big 5 European Leagues Combined',
      'ENG-Premier League',
      'ESP-La Liga',
      'FRA-Ligue 1',
      'GER-Bundesliga',
      'INTL-WorldCup',
      'ITA-Serie A'
    ]
    

    2. Can I pull back scraped data?

    This line ran without error: fbref = sd.FBref(leagues="INTL-WorldCup", seasons=2018)

    However, when I ran the 2 lines below

    team_season_stats = fbref.read_team_season_stats(stat_type="standard")
    team_season_stats.head()
    

    ...I got this error below. What am I doing wrong?


    ValueError                                Traceback (most recent call last)
    ~\AppData\Local\Temp/ipykernel_11984/128004415.py in <module>
    ----> 1 team_season_stats = fbref.read_team_season_stats(stat_type="standard")
          2 team_season_stats.head()
    
    soccerdata\fbref.py in read_team_season_stats(self, stat_type, opponent_stats)
        252 
        253         # get league IDs
    --> 254         seasons = self.read_seasons()
        255 
        256         # collect teams
    
    soccerdata\fbref.py in read_seasons(self)
        169             seasons.append(df_table)
        170 
    --> 171         df = pd.concat(seasons).pipe(standardize_colnames)
        172         # A competition name field is not inlcuded in the Big 5 European Leagues Combined
        173         if "competition_name" in df.columns:
    
    ~\Miniconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
        309                     stacklevel=stacklevel,
        310                 )
    --> 311             return func(*args, **kwargs)
        312 
        313         return wrapper
    
    ~\Miniconda3\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
        302         verify_integrity=verify_integrity,
        303         copy=copy,
    --> 304         sort=sort,
        305     )
        306 
    
    ~\Miniconda3\lib\site-packages\pandas\core\reshape\concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
        349 
        350         if len(objs) == 0:
    --> 351             raise ValueError("No objects to concatenate")
        352 
        353         if keys is None:
    
    ValueError: No objects to concatenate
    
    enhancement 
    opened by philbywalsh 6
  • [FBref] Team against stats

    [FBref] Team against stats

    Hi! First of all, thank you for the code!

    I'd like to ask if it would be possible to also get stats from the "Opponent Stats" table, Thanks!

    enhancement 
    opened by RobiFera 5
  • Faster scraping of player seasons stats - fbref.

    Faster scraping of player seasons stats - fbref.

    I am having another go at this [previous attempt #69] because you have updated the FBRef class to use the league pages. I have tried this against all stat_types for 2020-2021 and it seems to work

    • Amended the FBRef scraper so it uses the Big 5 pages if all five leagues are requested.
    • Added some type checks for the stats_type argument.

    I am not able to run the tests locally, but I'll try to fix anything that doesn't work after.

    opened by andrewRowlinson 3
  • [General] Tor port is not up to date

    [General] Tor port is not up to date

    (seen on Windows 11)

    The Tor port specified in the code for when initializing a scraper with the proxy="tor" option is 9050, whereas the new Tor versions seem to use port 9150.

    Something should be done to check whether port 9050 works in the first place and if it doesn't check with 9150.

    For other people it doesn't work for right now, you can always do this :

    return_proxies = lambda: {
        "http": "socks5://127.0.0.1:9150",
        "https": "socks5://127.0.0.1:9150",
    }
    
    ws = sd.WhoScored(leagues="ENG-Premier League", seasons=20-21, proxy=return_proxies)
    
    documentation 
    opened by david-leconte 3
  • [FBref] Can't fetch schedule data

    [FBref] Can't fetch schedule data

    if you run:

    import soccerdata as sd
    fbref = sd.FBref(leagues="ENG-Premier League", seasons=2021)
    print(fbref.__doc__)
    
    epl_schedule = fbref.read_schedule()
    

    You will get an error

    frame.py 3832 _set_item value = self._sanitize_column(value)

    frame.py 4535 _sanitize_column com.require_length_match(value, self.index)

    common.py 557 require_length_match raise ValueError(

    ValueError: Length of values (0) does not match length of index (31)

    opened by BelkacemB 3
  • Update dependency Sphinx to v5

    Update dependency Sphinx to v5

    Mend Renovate

    This PR contains the following updates:

    | Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | Sphinx (source) | ^4.3.2 -> ^5.0.0 | age | adoption | passing | confidence | | sphinx (source) | ==4.5.0 -> ==5.0.2 | age | adoption | passing | confidence |


    Release Notes

    sphinx-doc/sphinx

    v5.0.2

    Compare Source

    =====================================

    Features added

    • #​10523: HTML Theme: Expose the Docutils's version info tuple as a template variable, docutils_version_info. Patch by Adam Turner.

    Bugs fixed

    • #​10538: autodoc: Inherited class attribute having docstring is documented even if :confval:autodoc_inherit_docstring is disabled
    • #​10509: autosummary: autosummary fails with a shared library
    • #​10497: py domain: Failed to resolve strings in Literal. Patch by Adam Turner.
    • #​10523: HTML Theme: Fix double brackets on citation references in Docutils 0.18+. Patch by Adam Turner.
    • #​10534: Missing CSS for nav.contents in Docutils 0.18+. Patch by Adam Turner.

    v5.0.1

    Compare Source

    =====================================

    Bugs fixed

    • #​10498: gettext: TypeError is raised when sorting warning messages if a node has no line number. Patch by Adam Turner.
    • #​10493: HTML Theme: :rst:dir:topic directive is rendered incorrectly with Docutils 0.18. Patch by Adam Turner.
    • #​10495: IndexError is raised for a :rst:role:kbd role having a separator. Patch by Adam Turner.

    v5.0.0

    Compare Source

    =====================================

    Dependencies

    5.0.0 b1

    • #​10164: Support Docutils 0.18_. Patch by Adam Turner.

    .. _Docutils 0.18: https://docutils.sourceforge.io/RELEASE-NOTES.html#release-0-18-2021-10-26

    Incompatible changes

    5.0.0 b1

    • #​10031: autosummary: sphinx.ext.autosummary.import_by_name() now raises ImportExceptionGroup instead of ImportError when it failed to import target object. Please handle the exception if your extension uses the function to import Python object. As a workaround, you can disable the behavior via grouped_exception=False keyword argument until v7.0.
    • #​9962: texinfo: Customizing styles of emphasized text via @definfoenclose command was not supported because the command was deprecated since texinfo 6.8
    • #​2068: :confval:intersphinx_disabled_reftypes has changed default value from an empty list to ['std:doc'] as avoid too surprising silent intersphinx resolutions. To migrate: either add an explicit inventory name to the references intersphinx should resolve, or explicitly set the value of this configuration variable to an empty list.
    • #​10197: html theme: Reduce body_min_width setting in basic theme to 360px
    • #​9999: LaTeX: separate terms from their definitions by a CR (refs: #​9985)
    • #​10062: Change the default language to 'en' if any language is not set in conf.py

    5.0.0 final

    • #​10474: :confval:language does not accept None as it value. The default value of language becomes to 'en' now. Patch by Adam Turner and Takeshi KOMIYA.

    Deprecated

    5.0.0 b1

    • #​10028: jQuery and underscore.js will no longer be automatically injected into themes from Sphinx 6.0. If you develop a theme or extension that uses the jQuery, $, or $u global objects, you need to update your JavaScript or use the mitigation below.

      To re-add jQuery and underscore.js, you will need to copy jquery.js and underscore.js from the Sphinx repository_ to your static directory, and add the following to your layout.html:

      .. _the Sphinx repository: https://github.com/sphinx-doc/sphinx/tree/v4.3.2/sphinx/themes/basic/static .. code-block:: html+jinja

      {%- block scripts %} {{ super() }} {%- endblock %}

      Patch by Adam Turner.

    • setuptools integration. The build_sphinx sub-command for setup.py is marked as deprecated to follow the policy of setuptools team.

    • The locale argument of sphinx.util.i18n:babel_format_date() becomes required

    • The language argument of sphinx.util.i18n:format_date() becomes required

    • sphinx.builders.html.html5_ready

    • sphinx.io.read_doc()

    • sphinx.util.docutils.__version_info__

    • sphinx.util.docutils.is_html5_writer_available()

    • sphinx.writers.latex.LaTeXWriter.docclasses

    Features added

    5.0.0 b1

    • #​9075: autodoc: The default value of :confval:autodoc_typehints_format is changed to 'smart'. It will suppress the leading module names of typehints (ex. io.StringIO -> StringIO).
    • #​8417: autodoc: :inherited-members: option now takes multiple classes. It allows to suppress inherited members of several classes on the module at once by specifying the option to :rst:dir:automodule directive
    • #​9792: autodoc: Add new option for autodoc_typehints_description_target to include undocumented return values but not undocumented parameters.
    • #​10285: autodoc: singledispatch functions having typehints are not documented
    • autodoc: :confval:autodoc_typehints_format now also applies to attributes, data, properties, and type variable bounds.
    • #​10258: autosummary: Recognize a documented attribute of a module as non-imported
    • #​10028: Removed internal usages of JavaScript frameworks (jQuery and underscore.js) and modernised doctools.js and searchtools.js to EMCAScript 2018. Patch by Adam Turner.
    • #​10302: C++, add support for conditional expressions (?:).
    • #​5157, #​10251: Inline code is able to be highlighted via :rst:dir:role directive
    • #​10337: Make sphinx-build faster by caching Publisher object during build. Patch by Adam Turner.

    Bugs fixed

    5.0.0 b1

    • #​10200: apidoc: Duplicated submodules are shown for modules having both .pyx and .so files. Patch by Adam Turner and Takeshi KOMIYA.
    • #​10279: autodoc: Default values for keyword only arguments in overloaded functions are rendered as a string literal
    • #​10280: autodoc: :confval:autodoc_docstring_signature unexpectedly generates return value typehint for constructors if docstring has multiple signatures
    • #​10266: autodoc: :confval:autodoc_preserve_defaults does not work for mixture of keyword only arguments with/without defaults
    • #​10310: autodoc: class methods are not documented when decorated with mocked function
    • #​10305: autodoc: Failed to extract optional forward-ref'ed typehints correctly via :confval:autodoc_type_aliases
    • #​10421: autodoc: :confval:autodoc_preserve_defaults doesn't work on class methods
    • #​10214: html: invalid language tag was generated if :confval:language contains a country code (ex. zh_CN)
    • #​9974: html: Updated jQuery version from 3.5.1 to 3.6.0
    • #​10236: html search: objects are duplicated in search result
    • #​9962: texinfo: Deprecation message for @definfoenclose command on bulding texinfo document
    • #​10000: LaTeX: glossary terms with common definition are rendered with too much vertical whitespace
    • #​10188: LaTeX: alternating multiply referred footnotes produce a ? in pdf output
    • #​10363: LaTeX: make 'howto' title page rule use \linewidth for compatibility with usage of a twocolumn class option
    • #​10318: :prepend: option of :rst:dir:literalinclude directive does not work with :dedent: option

    5.0.0 final

    • #​9575: autodoc: The annotation of return value should not be shown when autodoc_typehints="description"
    • #​9648: autodoc: *args and **kwargs entries are duplicated when autodoc_typehints="description"
    • #​8180: autodoc: Docstring metadata ignored for attributes
    • #​10443: epub: EPUB builder can't detect the mimetype of .webp file
    • #​10104: gettext: Duplicated locations are shown if 3rd party extension does not provide correct information
    • #​10456: py domain: :meta: fields are displayed if docstring contains two or more meta-field
    • #​9096: sphinx-build: the value of progress bar for paralle build is wrong
    • #​10110: sphinx-build: exit code is not changed when error is raised on builder-finished event

    Configuration

    📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

    🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

    Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

    🔕 Ignore: Close this PR and you won't be reminded about these updates again.


    • [ ] If you want to rebase/retry this PR, click this checkbox.

    This PR has been generated by Mend Renovate. View repository job log here.

    opened by renovate[bot] 3
  • [FBref] Seasons parameter does not work with read_player_season_stats method

    [FBref] Seasons parameter does not work with read_player_season_stats method

    FBRef read_player_season_stats method does not seems to be considering the seasons parameter. No matter what the season is mentioned inside the brackets it just retrevied the recent season player stats. 2021-2022.

    Code used is mentioned below

    import soccerdata as sd
    
    fbref = sd.FBref(no_cache=False, no_store=False, leagues="ENG-Premier League", seasons='11-12')
    pl_player_season_stats = fbref.read_player_season_stats(stat_type='standard')
    pl_player_season_stats .head()
    

    Seasons parameter works perfectly with the read_team_season_stats method though.

    bug 
    opened by Suwadith 3
  • FBRef pulling back

    FBRef pulling back "out of date" statistics

    When using the API to scrape from FBRef I noticed something odd.

    The APIs for each category of stats (e.g. shooting, passing) don't seem to be working off the same baseline of minutes played.

    For example: take Emiliano Martínez (Argentina). The "90s" value should be consistent across each of these categories - but across the 5 categories below it varies extensively.

    fbref.read_player_season_stats(stat_type="standard") = 2.0 fbref.read_player_season_stats(stat_type="shooting") = 2.0 fbref.read_player_season_stats(stat_type="goal_shot_creation") = 3.0 fbref.read_player_season_stats(stat_type="passing") = 3.0 fbref.read_player_season_stats(stat_type="defense") = 5.3

    Note: This didn't seem to be an issue earlier in the tournament. Also, I only recently scraped "defense" statistics for the first time, so I wonder if somehow the results of old queries are being cached?

    Also, when I navigate to the specific pages in by browser the data looks to be consistent & fully up to-date (i.e. a value of 6.3 for the "90s" attribute for Emiliano Martínez)

    https://fbref.com/en/comps/1/passing/World-Cup-Stats

    opened by philbywalsh 2
  • [FBref] NaNs found in 'standard' and 'playing_time' stat_types

    [FBref] NaNs found in 'standard' and 'playing_time' stat_types

    Hello,

    I have found a small bug when pulling data from FBRef.com. NaN values appearing in the MP columns in the data for stat_types standard and playing_time for players who have played in the season.

    I found this problem after I wrote a function to obtain multiple stat_types for multiple seasons and converted the DataFrames from a multiindex to a standard pandas DataFrame. I found a large quantity of NaNs due to this transformation. 

    To troubleshoot, I did a single pull using the .read_player_season_stats(stat_type = 'standard') call on 2 seasons of data (1718 & 1819) and found NaN values in both the MP and Playing Time MP columns. Players who played and did not play had received NaN values in the aforementioned columns.  Under the "Playing Time" section's MP column, I found 890 NaN values and in the standalone 'MP' column, I found 380 NaN values.  I am transitioning from R to Python and have always used the flattened-style DataFrame in the past.

    Attached is a csv file containing the aforementioned data.

    Call:

    fbref_test = sd.FBref(leagues=['ENG-Premier League'], seasons= ['1718', '1819'])
    
    hold = fbref_test.read_player_season_stats(stat_type = 'standard')
    hold.head()
    

    I greatly appreciate your assistance. fbref_nan_bug_df.csv

    opened by spartanovo 2
  • [WhoScored] Issue in scrapping if game has no goal

    [WhoScored] Issue in scrapping if game has no goal

    Hello everyone,

    If you try to scrap a game from WhoScored using the package, with a 0-0 score, a KeyError occurs. Within the file whoscored.py at line 676, the process try to access the feature 'is_goal', which seems to not exist if the score at the end of the game is 0-0.

    Example of game where this issue happen : https://www.whoscored.com/Matches/1640849/Live/England-Premier-League-2022-2023-Newcastle-Leeds

    Python code :

    import soccerdata as sd ws = sd.WhoScored(leagues="ENG-Premier League", seasons="22-23") events = ws.read_events(match_id=1640849)

    Error :

    KeyError: ['is_goal'] not in index

    A solution could be to check if all the necessary features are indeed part of the dataframe, and if not, add it with np.nan values. The python code could be as follow, before the line 676 :

    Python code :

    cols = ['event_id', 'expanded_minute', 'is_touch', 'minute', 'outcome_type', 'period', 'qualifiers', 'satisfied_events_types', 'second', 'team_id', 'type', 'x', 'y', 'end_x', 'end_y', 'player_id', 'blocked_x', 'blocked_y', 'goal_mouth_y', 'goal_mouth_z', 'is_shot', 'related_event_id', 'related_player_id', 'is_goal', 'card_type', '$idx', '$len', 'field', 'minute_info', 'satisfiers', 'text', 'game_id', 'player', 'team']

    for col in cols: if col not in df.columns: df[col] = np.nan

    PS: It exists a similar issue with feature 'card_type' if no card were given during the game.

    Thanks, Ben

    opened by BenSarfatiDS 0
  • [WhoScored] Date Format problem

    [WhoScored] Date Format problem

    Hello,

    I'm trying to pull the schedule from any league, but it keeps getting an error in the date format. Even when I input the match ID, keeps with problem to read the data because of the date format. How can I solve it? ValueError: time data 'Jumatatu, Des 26 2022 12:30' does not match format '%A, %b %d %Y %H:%M'

    opened by CBatatinha 2
  • Update dependency poetry to v1.3.1

    Update dependency poetry to v1.3.1

    Mend Renovate

    This PR contains the following updates:

    | Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | poetry (source, changelog) | ==1.2.2 -> ==1.3.1 | age | adoption | passing | confidence |


    Release Notes

    python-poetry/poetry

    v1.3.1

    Compare Source

    Fixed
    • Fix an issue where an explicit dependency on lockfile was missing, resulting in a broken Poetry in rare circumstances (7169).

    v1.3.0

    Compare Source

    Added
    • Mark the lock file with an @generated comment as used by common tooling (#​2773).
    • poetry check validates trove classifiers and warns for deprecations (#​2881).
    • Introduce a top level -C, --directory option to set the working path (#​6810).
    Changed
    • New lock file format (version 2.0) (#​6393).
    • Path dependency metadata is unconditionally re-locked (#​6843).
    • URL dependency hashes are locked (#​7121).
    • poetry update and poetry lock should now resolve dependencies more similarly (#​6477).
    • poetry publish will report more useful errors when a file does not exist (#​4417).
    • poetry add will check for duplicate entries using canonical names (#​6832).
    • Wheels are preferred to source distributions when gathering metadata (#​6547).
    • Git dependencies of extras are only fetched if the extra is requested (#​6615).
    • Invoke pip with --no-input to prevent hanging without feedback (#​6724, #​6966).
    • Invoke pip with --isolated to prevent the influence of user configuration (#​6531).
    • Interrogate environments with Python in isolated (-I) mode (#​6628).
    • Raise an informative error when multiple version constraints overlap and are incompatible (#​7098).
    Fixed
    • Fix an issue where concurrent instances of Poetry would corrupt the artifact cache (#​6186).
    • Fix an issue where Poetry can hang after being interrupted due to stale locking in cache (#​6471).
    • Fix an issue where the output of commands executed with --dry-run contained duplicate entries (#​4660).
    • Fix an issue where requests's pool size did not match the number of installer workers (#​6805).
    • Fix an issue where poetry show --outdated failed with a runtime error related to direct origin dependencies (#​6016).
    • Fix an issue where only the last command of an ApplicationPlugin is registered (#​6304).
    • Fix an issue where git dependencies were fetched unnecessarily when running poetry lock --no-update (#​6131).
    • Fix an issue where stdout was polluted with messages that should go to stderr (#​6429).
    • Fix an issue with poetry shell activation and zsh (#​5795).
    • Fix an issue where a url dependencies were shown as outdated (#​6396).
    • Fix an issue where the source field of a dependency with extras was ignored (#​6472).
    • Fix an issue where a package from the wrong source was installed for a multiple-constraints dependency with different sources (#​6747).
    • Fix an issue where dependencies from different sources where merged during dependency resolution (#​6679).
    • Fix an issue where experimental.system-git-client could not be used via environment variable (#​6783).
    • Fix an issue where Poetry fails with an AssertionError due to distribution.files being None (#​6788).
    • Fix an issue where poetry env info did not respect virtualenvs.prefer-active-python (#​6986).
    • Fix an issue where poetry env list does not list the in-project environment (#​6979).
    • Fix an issue where poetry env remove removed the wrong environment (#​6195).
    • Fix an issue where the return code of a script was not relayed as exit code (#​6824).
    • Fix an issue where the solver could silently swallow ValueError (#​6790).
    Docs
    • Improve documentation of package sources (#​5605).
    • Correct the default cache path on Windows (#​7012).
    poetry-core (1.4.0)
    • The PEP 517 metadata_directory is now respected as an input to the build_wheel hook (#​487).
    • ParseConstraintError is now raised on version and constraint parsing errors, and includes information on the package that caused the error (#​514).
    • Fix an issue where invalid PEP 508 requirements were generated due to a missing space before semicolons (#​510).
    • Fix an issue where relative paths were encoded into package requirements, instead of a file:// URL as required by PEP 508 (#​512).
    poetry-plugin-export (^1.2.0)
    • Ensure compatibility with Poetry 1.3.0. No functional changes.
    cleo (^2.0.0)
    • Fix an issue where shell completions had syntax errors (#​247).
    • Fix an issue where not reading all the output of a command resulted in a "Broken pipe" error (#​165).
    • Fix an issue where errors were not shown in non-verbose mode (#​166).

    Configuration

    📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

    🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

    Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

    🔕 Ignore: Close this PR and you won't be reminded about this update again.


    • [ ] If you want to rebase/retry this PR, check this box

    This PR has been generated by Mend Renovate. View repository job log here.

    opened by renovate[bot] 0
  • Duplicate data after updating to version 1.3.0

    Duplicate data after updating to version 1.3.0

    After updating from version 1.2.0 to 1.3.0 data from the fbref is repeated twice The sample code

    fb_data = fbref.read_player_match_stats(
                    stat_type='summary', match_id=None, force_cache=False)
                fb_data.to_csv('./summary.csv')
    

    log from version 1.3.0 image

    log from version 1.2.0 image

    In version 1.3.0, the log Retrieving game with id=**** is repeating twice.

    opened by DonBrowny 2
  • Update dependency flake8 to v6

    Update dependency flake8 to v6

    Mend Renovate

    This PR contains the following updates:

    | Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | flake8 (changelog) | ^5.0.4 -> ^6.0.0 | age | adoption | passing | confidence |


    Release Notes

    pycqa/flake8

    v6.0.0

    Compare Source


    Configuration

    📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

    🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

    Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

    🔕 Ignore: Close this PR and you won't be reminded about this update again.


    • [ ] If you want to rebase/retry this PR, check this box

    This PR has been generated by Mend Renovate. View repository job log here.

    opened by renovate[bot] 2
  • In-depth tutorial on how to add new leagues?

    In-depth tutorial on how to add new leagues?

    Hi,

    This is an amazing package! I think the docs are mostly very clear. However, is it possible to have a more in-depth tutorial on how to add new leagues to FBRef? I'm trying to add the English Championship, which is available on FB Ref, but wasn't able to. I added a league_dict.json file (with the correct config I assume) to the "SOCCERDATA_DIR/config/" file path, but it seems like the code is not picking up on it when I call fbref = sd.FBref(leagues="EFL Championship", seasons=2019). It gave me a ValueError noting "Invalid League". Thank you so much!

    documentation 
    opened by cj0121 5
Releases(v1.3.0)
  • v1.3.0(Nov 26, 2022)

    New features

    Add support for scraping World Cup data

    The World Cup was added to the default available leagues for the WhoScored and FBref readers. Other tournaments can be added by modifying the league_dict.json config file.

    from soccerdata import WhoScored, FBref
    
    ws = WhoScored(leagues="INT-World Cup", seasons="2022")
    fb = FBref(leagues="INT-World Cup", seasons="2022")
    

    Changes

    • The WhoScored reader now uses the non-headless mode by default. Scraping in headless mode typically results in getting blocked quickly. The old behaviour can be recovered by initializing the reader as WhoScored(..., headless=True).

    Fixes

    • The WhoScored reader can now deal with an empty match schedule, which can occur before the start of a season or tournament round.
    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Oct 23, 2022)

    New features

    Faster scraping of Big 5 leagues stats (by @andrewRowlinson)

    FBref has pages for the big five European leagues that allow you to more efficiently get team and player data from multiple leagues. This commit adds a special "Big 5 European Leagues Combined" league option to get data from these pages.

    import soccerdata as sd
    fbref = sd.FBref(leagues="Big 5 European Leagues Combined", seasons="20-21")
    team_season_stats = fbref.read_team_season_stats(stat_type="standard")
    player_season_stats = fbref.read_player_season_stats(stat_type="standard")
    
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Sep 27, 2022)

    New features

    FBref

    Faster scraping of player season stats (#69)

    Previously, the fbref.read_team_season_stats method visited the page of each individual team in a league to obtain stats for players in a league. FBRef now has a single page for each league/season where player stats can be obtained for each player in the league (e.g., https://fbref.com/en/comps/9/stats/Premier-League-Stats). Due to this change the fbref.read_team_season_stats(...) method now uses 15-20x less requests, leading to a large speed-up.

    Support retrieving "Opponent Stats" (#78)

    A "opponent_stats" flag was added to the fbref.read_season_stats(...) function, which enables retrieving the "Opponent Stats" table of a team.

    Always group "MP" under "Playing Time" (#79)

    FBRef is inconsistent in how it displays the "MP" (Matches Played) column. For some seasons, it is displayed as a separate category, while it is grouped under "Playing Time" for other seasons. This results in a column with NaN values when two seasons are merged. Therefore, the "MP" column is now always put under "Playing Time".

    Docs

    Add docs for specifying custom proxy (#83)

    Not all Tor distribution use the same default port of 9050. The docs now describe how to configure a custom port.

    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Apr 23, 2022)

    Breaking Changes

    • Several columns were renamed, added and droped in the output dataframes to increase uniformity between datasources.

    New features

    WhoScored

    The WhoScored reader can now return event data in various output formats. The following formats are supported:

    • A dataframe with all events.
    • A dict with the original unformatted WhoScored JSON.
    • A dataframe with the SPADL representation of the original events.
    • A dataframe with the Atomic-SPADL representation of the original events.
    • A socceration.data.opta.OptaLoader instance.
    • No data. This is useful for caching data.

    See https://soccerdata.readthedocs.io/en/latest/datasources/WhoScored.html for examples.

    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Apr 22, 2022)

    Breaking Changes

    • The use_tor parameter was replaced by a use_proxy='tor' parameter in all readers

    New features

    • You can specify a custom proxy using the use_proxy parameter for all readers.
    ws = soccerdata.WhoScored(use_proxy={'http': 'http://126.352.12.3:5471'})
    

    Fixes

    FBref

    • FBref has implemented a new rate-limiting polity allowing only one request every two seconds. The FBref reader is now configured to comply with this.
    Source code(tar.gz)
    Source code(zip)
  • v0.0.3(Mar 20, 2022)

    Bugfixes

    WhoScored

    • The summary tab is now used as a backup for retrieving the schedule when the fixtures tab is empty. This often occurs for multi-stage tournaments. (#15)
    • Fixed incorrect resolver rules for the Tor proxy. (#23)

    MatchHistory

    • Football-data.co.uk switched from http to https only.

    Docs

    • Added example notebooks for reading data from each supported data source.
    Source code(tar.gz)
    Source code(zip)
  • v0.0.2(Feb 16, 2022)

    Bugfixes

    FBref

    • The FBref reader crashed while scraping match stats, lineups or shots for the current season as it did not handle future games correctly.

    Testing Improvements

    • Sets up CI using Github Actions
    • Sets up automatic dependency updates using Renovate bot
    Source code(tar.gz)
    Source code(zip)
Owner
Pieter Robberechts
CS Engineer, PhD student in sports analytics, Data geek
Pieter Robberechts
DeltaPy - Tabular Data Augmentation (by @firmai)

DeltaPy⁠⁠ — Tabular Data Augmentation & Feature Engineering Finance Quant Machine Learning ML-Quant.com - Automated Research Repository Introduction T

Derek Snow 470 Dec 28, 2022
A curated list of awesome mathematics resources

A curated list of awesome mathematics resources

Cyrille Rossant 6.7k Jan 05, 2023
Cleaner script to normalize knock's output EPUBs

clean-epub The excellent knock application by Benton Edmondson outputs EPUBs that seem to be DRM-free. However, if you run the application twice on th

2 Dec 16, 2022
Python bindings to OpenSlide

OpenSlide Python OpenSlide Python is a Python interface to the OpenSlide library. OpenSlide is a C library that provides a simple interface for readin

OpenSlide 297 Dec 21, 2022
A set of Python libraries that assist in calling the SoftLayer API.

SoftLayer API Python Client This library provides a simple Python client to interact with SoftLayer's XML-RPC API. A command-line interface is also in

SoftLayer 155 Sep 20, 2022
charcade is a string manipulation library that can animate, color, and bruteforce strings

charcade charcade is a string manipulation library that can animate, color, and bruteforce strings. Features Animating text for CLI applications with

Aaron 8 May 23, 2022
This is a tool to make easier brawl stars modding using csv manipulation

Brawler Maker : Modding Tool for Brawl Stars This is a tool to make easier brawl stars modding using csv manipulation if you want to support me, just

6 Nov 16, 2022
Tips for Writing a Research Paper using LaTeX

Tips for Writing a Research Paper using LaTeX

Guanying Chen 727 Dec 26, 2022
Plotting and analysis tools for ARTIS simulations

Artistools Artistools is collection of plotting, analysis, and file format conversion tools for the ARTIS radiative transfer code. Installation First

ARTIS Monte Carlo Radiative Transfer 8 Nov 07, 2022
Żmija is a simple universal code generation tool.

Żmija Żmija is a simple universal code generation tool. It is intended to be used as a means to generate code that is both efficient and easily mainta

Adrian Samoticha 2 Nov 23, 2021
Your Project with Great Documentation.

Read Latest Documentation - Browse GitHub Code Repository The only thing worse than documentation never written, is documentation written but never di

Timothy Edmund Crosley 809 Dec 28, 2022
A simple malware that tries to explain the logic of computer viruses with Python.

Simple-Virus-With-Python A simple malware that tries to explain the logic of computer viruses with Python. What Is The Virus ? Computer viruses are ma

Xrypt0 6 Nov 18, 2022
layout-parser 3.4k Dec 30, 2022
PyPresent - create slide presentations from notes

PyPresent Create slide presentations from notes Add some formatting to text file

1 Jan 06, 2022
Comprehensive Python Cheatsheet

Comprehensive Python Cheatsheet Download text file, Buy PDF, Fork me on GitHub or Check out FAQ. Contents 1. Collections: List, Dictionary, Set, Tuple

Jefferson 1 Jan 23, 2022
Service for visualisation of high dimensional for hydrosphere

hydro-visualization Service for visualization of high dimensional for hydrosphere DEPENDENCIES DEBUG_ENV = bool(os.getenv("DEBUG_ENV", False)) APP_POR

hydrosphere.io 1 Nov 12, 2021
✨ Real-life Data Analysis and Model Training Workshop by Global AI Hub.

🎓 Data Analysis and Model Training Course by Global AI Hub Syllabus: Day 1 What is Data? Multimedia Structured and Unstructured Data Data Types Data

Global AI Hub 71 Oct 28, 2022
A Python validator for SHACL

pySHACL A Python validator for SHACL. This is a pure Python module which allows for the validation of RDF graphs against Shapes Constraint Language (S

RDFLib 187 Dec 29, 2022
An introduction course for Python provided by VetsInTech

Introduction to Python This is an introduction course for Python provided by VetsInTech. For every "boot camp", there usually is a pre-req, but becaus

Vets In Tech 2 Dec 02, 2021
Projeto em Python colaborativo para o Bootcamp de Dados do Itaú em parceria com a Lets Code

🧾 lets-code-todo-list por Henrique V. Domingues e Josué Montalvão Projeto em Python colaborativo para o Bootcamp de Dados do Itaú em parceria com a L

Henrique V. Domingues 1 Jan 11, 2022