Apache (Py)Spark type annotations (stub files).

Overview

PySpark Stubs

Build Status PyPI version Conda Forge version

A collection of the Apache Spark stub files. These files were generated by stubgen and manually edited to include accurate type hints.

Tests and configuration files have been originally contributed to the Typeshed project. Please refer to its contributors list and license for details.

Important

This project has been merged with the main Apache Spark repository (SPARK-32714). All further development for Spark 3.1 and onwards will be continued there.

For Spark 2.4 and 3.0, development of this package will be continued, until their official deprecation.

  • If your problem is specific to Spark 2.3 and 3.0 feel free to create an issue or open pull requests here.
  • Otherwise, please check the official Spark JIRA and contributing guidelines. If you create a JIRA ticket or Spark PR related to type hints, please ping me with [~zero323] or @zero323 respectively. Thanks in advance.

Motivation

  • Static error detection (see SPARK-20631)

    SPARK-20631

  • Improved autocompletion.

    Syntax completion

Installation and usage

Please note that the guidelines for distribution of type information is still work in progress (PEP 561 - Distributing and Packaging Type Information). Currently installation script overlays existing Spark installations (pyi stub files are copied next to their py counterparts in the PySpark installation directory). If this approach is not acceptable you can add stub files to the search path manually.

According to PEP 484:

Third-party stub packages can use any location for stub storage. Type checkers should search for them using PYTHONPATH.

Moreover:

Third-party stub packages can use any location for stub storage. Type checkers should search for them using PYTHONPATH. A default fallback directory that is always checked is shared/typehints/python3.5/ (or 3.6, etc.)

Please check usage before proceeding.

The package is available on PYPI:

pip install pyspark-stubs

and conda-forge:

conda install -c conda-forge pyspark-stubs

Depending on your environment you might also need a type checker, like Mypy or Pytype [1], and autocompletion tool, like Jedi.

Editor Type checking Autocompletion Notes
Atom [2] [3] Through plugins.
IPython / Jupyter Notebook [4]  
PyCharm  
PyDev [5] ?  
VIM / Neovim [6] [7] Through plugins.
Visual Studio Code [8] [9] Completion with plugin
Environment independent / other editors [10] [11] Through Mypy and Jedi.

This package is tested against MyPy development branch and in rare cases (primarily important upstrean bugfixes), is not compatible with the preceding MyPy release.

PySpark Version Compatibility

Package versions follow PySpark versions with exception to maintenance releases - i.e. pyspark-stubs==2.3.0 should be compatible with pyspark>=2.3.0,<2.4.0. Maintenance releases (post1, post2, ..., postN) are reserved for internal annotations updates.

API Coverage:

As of release 2.4.0 most of the public API is covered. For details please check API coverage document.

See also

Disclaimer

Apache Spark, Spark, PySpark, Apache, and the Spark logo are trademarks of The Apache Software Foundation. This project is not owned, endorsed, or sponsored by The Apache Software Foundation.

Footnotes

[1] Not supported or tested.
[2] Requires atom-mypy or equivalent.
[3] Requires autocomplete-python-jedi or equivalent.
[4] It is possible to use magics to type check directly in the notebook. In general though, you'll have to export whole notebook to .py file and run type checker on the result.
[5] Requires PyDev 7.0.3 or later.
[6] TODO Using vim-mypy, syntastic or Neomake.
[7] With jedi-vim.
[8] With Mypy linter.
[9] With Python extension for Visual Studio Code.
[10] Just use your favorite checker directly, optionally combined with tool like entr.
[11] See Jedi editor plugins list.
Comments
  • Fix 2-argument math functions

    Fix 2-argument math functions

    Fixes the binary math functions:

    • atan2 and hypot take two arguments, not one
    • pow supports taking a literal numeric value as its second argument in addition to a Column.
    bug 3.0 2.3 2.4 
    opened by harpaj 10
  • Jedi doesn't work with MLReaders

    Jedi doesn't work with MLReaders

    It seems like there is some problem with Jedi compatibility. Some components seem to work pretty well. For example DataFrame without stubs:

    In [1]: import jedi                                                                                                                                                                                                
    
    In [2]: from pyspark.sql import SparkSession                                                                                                                                                                       
    
    In [3]: jedi.Interpreter("SparkSession.builder.getOrCreate().createDataFrame([]).", [globals()]).completions()                                                                                                     
    ---------------------------------------------------------------------------
    AttributeError   
    ...
    AttributeError: 'ModuleContext' object has no attribute 'py__path__'
    

    and with stubs:

    In [1]: from pyspark.sql import SparkSession                                                                                                                                                                       
    
    In [2]: import jedi                                                                                                                                                                                                
    
    In [3]: jedi.Interpreter("SparkSession.builder.getOrCreate().createDataFrame([]).", [globals()]).completions()                                                                                                     
    Out[3]: 
    [<Completion: agg>,
     <Completion: alias>,
     <Completion: approxQuantile>,
     <Completion: cache>,
     <Completion: checkpoint>,
     <Completion: coalesce>,
     <Completion: collect>,
     <Completion: colRegex>,
     <Completion: columns>,
     <Completion: corr>,
     <Completion: count>,
     <Completion: cov>,
    ...
     <Completion: __str__>]
    

    So far so good. However, if take for example LinearRegressionModel.load things don't work so well. Without stubs provides no suggestions

    In [1]: import jedi                                                                                                                                                                                                
    
    In [2]: from pyspark.ml.regression import LinearRegressionModel                                                                                                                                                    
    
    In [3]: jedi.Interpreter("LinearRegressionModel.load('foo').", [globals()]).completions()                                                                                                                          
    Out[3]: []
    

    but one provided with stubs

    In [1]: import jedi                                                                                                                                                                                                
    
    In [2]: from pyspark.ml.regression import LinearRegressionModel                                                                                                                                                    
    
    In [3]: jedi.Interpreter("LinearRegressionModel.load('foo').", [globals()]).completions()                                                                                                                          
    Out[3]: 
    [<Completion: load>,
     <Completion: read>,
     <Completion: __annotations__>,
     <Completion: __class__>,
     <Completion: __delattr__>,
     <Completion: __dict__>,
     <Completion: __dir__>,
     <Completion: __doc__>,
     <Completion: __eq__>,
     <Completion: __format__>,
     <Completion: __getattribute__>,
     <Completion: __hash__>,
     <Completion: __init__>,
     <Completion: __init_subclass__>,
     <Completion: __module__>,
     <Completion: __ne__>,
     <Completion: __new__>,
     <Completion: __reduce__>,
     <Completion: __reduce_ex__>,
     <Completion: __repr__>,
     <Completion: __setattr__>,
     <Completion: __sizeof__>,
     <Completion: __slots__>,
    

    don't make much sense. If model is fitted:

    In [4]: from pyspark.ml.regression import LinearRegression                                                                                                                                                         
    
    In [5]: jedi.Interpreter("LinearRegression().fit(...).", [globals()]).completions()                                                                                                                                
    Out[5]: 
    [<Completion: aggregationDepth>,
     <Completion: append>,
     <Completion: clear>,
     <Completion: coefficients>,
     <Completion: copy>,
     <Completion: count>,
    ....
     <Completion: __str__>]
    

    Model which is explicitly annotated works fine, so it seems like there is something in MLReader or one of the sub-classes that causes a failure.

    We already have data tests for this (as well as some test cases from apache/spark examples, and mypy seems to be fine with this.

    Since LinearRegression.fit works fine (and some toy tests confirm that), Generics are not sufficient to reproduce the problem. So it seems like type parameter is not processed correctly on the path:

    Tested with:

    • jedi==0.15.2 and jedi==0.16.0 (0c56aa4).
    • pyspark-stubs==3.0.0.dev5
    • pyspark==3.0.0.dev0 (afe70b3)
    opened by zero323 7
  • DataFrameReader.load parameters incorrectly expected all to be strings

    DataFrameReader.load parameters incorrectly expected all to be strings

    Using 2.4.0.post6

    spark.read.load(folders, inferSchema=True, header=False)
    

    mypy reports Expected type 'str', got 'bool' instead for both inferSchema and header.

    Looks like the issue is in third_party/3/pyspark/sql/readwriter.pyi Line 23 where in the definition for load() we have **options: str. For csv suppport this needs to be **options: Optional[Union[bool, str, int]] but to handle the general case it probably needs to be **options: Any.

    enhancement 
    opened by ghost 7
  • Added contains to Column

    Added contains to Column

    The contains method is missing from the stubs causing mypy to raise error: "Column" not callable.

    This PR adds the typehints to 2.4 specifically (the version we are using), but it should probably also be added to the other versions.

    opened by Braamling 6
  • #394: Use Union[List[Column], List[str]] for Select

    #394: Use Union[List[Column], List[str]] for Select

    Passing a List[str] to select raises a mypy warning, similar for List[Column]. We change the type from List[Union[Column, str]] to Union[List[Column], List[str]].

    Fixes #394 .

    opened by jhereth 5
  • Update distinct() and repartition() definitions

    Update distinct() and repartition() definitions

    Update repartition functions to allow for Col in numPartitions parameter.

    Reference

    numPartitions – can be an int to specify the target number of partitions or a Column.
        If it is a Column, it will be used as the first partitioning column.
        If not specified, the default number of partitions is used.
    

    Also add stub for DataFrame#distinct()

    opened by zpencerq 5
  • Allow `Column` type for timezone argument in pyspark.sql.functions

    Allow `Column` type for timezone argument in pyspark.sql.functions

    In the functions here: https://github.com/zero323/pyspark-stubs/blob/3c4684a224c1be4eea4577e475f8bb4d045edddd/third_party/3/pyspark/sql/functions.pyi#L100-L101 we currently have tz: str but this can also be specified as a Column

    Example:

    >>> from pyspark.sql import functions
    >>> df = spark.sql("SELECT CAST(0 AS TIMESTAMP) AS timestamp, 'Asia/Tokyo' AS tz")
    >>> df.select(functions.from_utc_timestamp(df.timestamp, df.tz)).collect()
    [Row(from_utc_timestamp(timestamp, tz)=datetime.datetime(1970, 1, 1, 18, 0))]
    

    I think this could be expanded to tz: ColumnOrName?

    3.0 2.4 3.1 
    opened by charlietsai 4
  • Overload DataFrame.drop: sequences must be *str

    Overload DataFrame.drop: sequences must be *str

    The method DataFrame.drop expects either 1 Column, or 1 str, or an iterable of strings. This is only type checked inside the function though.

    Currently the type hints (and the actual API) allow to pass multiple Columns but it does result in a runtime error. Personally, I'd like to have that caught earlier. But as this might be getting too close to the internals of the functions, I’d like to hear your opinion on whether or not the type hints should “look inside” to aid development.

    opened by oliverw1 4
  • provide overloaded methods for sample

    provide overloaded methods for sample

    The fraction is a required argument to the sample method. Anytime someone calls df.sample(.01) this is met in mypy with

    Argument 1 to "sample" of "DataFrame" has incompatible type "float"; expected "Optional[bool]"

    In the Pyspark API, the three arguments are in fact pure keyword arguments that are handled later to ensure fraction must be given. This is probably done to keep consistent with the Scala API.

    By overloading the methods, the issue is resolved.

    opened by oliverw1 4
  • Allow non-string load/save parameters

    Allow non-string load/save parameters

    Resolves #273

    Additional parameters to DataFrameReader.load() and DataFrameWriter.save()/.saveTable() are passed to the file-type specific reader or writer types. These parameters can be of any type.

    opened by mark-oppenheim 4
  • Fix return type for DataFrame.groupBy / cube / rollup

    Fix return type for DataFrame.groupBy / cube / rollup

    2.3 has these data types and I was erroneously gettting errors for them.

    Note this is a port of e2d225f06ff36fcbf79e2123f1c18f380e862728

    I tried a cherry-pick but it had some issues (not sure why)

    opened by dangercrow 4
Releases(3.0.0.post3)
Owner
Maciej
Just a dog on the Internet. I would love to tell you more, but then, of course, I'd have to erase your memory. A30CEF0C31A501EC
Maciej
Machine Learning approach for quantifying detector distortion fields

DistortionML Machine Learning approach for quantifying detector distortion fields. This project is a feasibility study for training a surrogate model

Joel Bernier 1 Nov 05, 2021
A repository to work on Machine Learning course. Select an algorithm to classify writer's gender, of Hebrew texts.

MachineLearning A repository to work on Machine Learning course. Select an algorithm to classify writer's gender, of Hebrew texts. Tested algorithms:

Haim Adrian 1 Feb 01, 2022
LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRerank, Seq2Slate.

LibRerank LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRer

126 Dec 28, 2022
Machine Learning for Time-Series with Python.Published by Packt

Machine-Learning-for-Time-Series-with-Python Become proficient in deriving insights from time-series data and analyzing a model’s performance Links Am

Packt 124 Dec 28, 2022
50% faster, 50% less RAM Machine Learning. Numba rewritten Sklearn. SVD, NNMF, PCA, LinearReg, RidgeReg, Randomized, Truncated SVD/PCA, CSR Matrices all 50+% faster

[Due to the time taken @ uni, work + hell breaking loose in my life, since things have calmed down a bit, will continue commiting!!!] [By the way, I'm

Daniel Han-Chen 1.4k Jan 01, 2023
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 07, 2023
QML: A Python Toolkit for Quantum Machine Learning

QML is a Python2/3-compatible toolkit for representation learning of properties of molecules and solids.

176 Dec 09, 2022
🚪✊Knock Knock: Get notified when your training ends with only two additional lines of code

Knock Knock A small library to get a notification when your training is complete or when it crashes during the process with two additional lines of co

Hugging Face 2.5k Jan 07, 2023
A Lightweight Hyperparameter Optimization Tool 🚀

The mle-hyperopt package provides a simple and intuitive API for hyperparameter optimization of your Machine Learning Experiment (MLE) pipeline.

Robert Lange 137 Dec 02, 2022
nn-Meter is a novel and efficient system to accurately predict the inference latency of DNN models on diverse edge devices

A DNN inference latency prediction toolkit for accurately modeling and predicting the latency on diverse edge devices.

Microsoft 241 Dec 26, 2022
Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and t

164 Jan 04, 2023
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

TensorFlowOnSpark TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from the T

Yahoo 3.8k Jan 04, 2023
ClearML - Auto-Magical Suite of tools to streamline your ML workflow. Experiment Manager, MLOps and Data-Management

ClearML - Auto-Magical Suite of tools to streamline your ML workflow Experiment Manager, MLOps and Data-Management ClearML Formerly known as Allegro T

ClearML 4k Jan 09, 2023
A single Python file with some tools for visualizing machine learning in the terminal.

Machine Learning Visualization Tools A single Python file with some tools for visualizing machine learning in the terminal. This demo is composed of t

Bram Wasti 35 Dec 29, 2022
Data Efficient Decision Making

Data Efficient Decision Making

Microsoft 197 Jan 06, 2023
A high performance and generic framework for distributed DNN training

BytePS BytePS is a high performance and general distributed training framework. It supports TensorFlow, Keras, PyTorch, and MXNet, and can run on eith

Bytedance Inc. 3.3k Dec 28, 2022
Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by

Robustness Gym 115 Dec 12, 2022
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.9k Jan 05, 2023
Bottleneck a collection of fast, NaN-aware NumPy array functions written in C.

Bottleneck Bottleneck is a collection of fast, NaN-aware NumPy array functions written in C. As one example, to check if a np.array has any NaNs using

Python for Data 835 Dec 27, 2022
李航《统计学习方法》复现

本项目复现李航《统计学习方法》每一章节的算法 特点: 笔记摘要:在每个文件开头都会有一些核心的摘要 pythonic:这里会用尽可能规范的方式来实现,包括编程风格几乎严格按照PEP8 循序渐进:前期的算法会更list的方式来做计算,可读性比较强,后期几乎完全为numpy.array的计算,并且辅助详

58 Oct 22, 2021