Large-scale linear classification, regression and ranking in Python

Last update: Dec 31, 2022

Related tags

Sklearn Utilities machine-learning

Overview

https://travis-ci.org/scikit-learn-contrib/lightning.svg?branch=master

https://ci.appveyor.com/api/projects/status/mmm0llccmvn5iooq?svg=true

lightning

lightning is a library for large-scale linear classification, regression and ranking in Python.

Highlights:

follows the scikit-learn API conventions
supports natively both dense and sparse data representations
computationally demanding parts implemented in Cython

Solvers supported:

primal coordinate descent
dual coordinate descent (SDCA, Prox-SDCA)
SGD, AdaGrad, SAG, SAGA, SVRG
FISTA

Example

Example that shows how to learn a multiclass classifier with group lasso penalty on the News20 dataset (c.f., Blondel et al. 2013):

from sklearn.datasets import fetch_20newsgroups_vectorized
from lightning.classification import CDClassifier

# Load News20 dataset from scikit-learn.
bunch = fetch_20newsgroups_vectorized(subset="all")
X = bunch.data
y = bunch.target

# Set classifier options.
clf = CDClassifier(penalty="l1/l2",
                   loss="squared_hinge",
                   multiclass=True,
                   max_iter=20,
                   alpha=1e-4,
                   C=1.0 / X.shape[0],
                   tol=1e-3)

# Train the model.
clf.fit(X, y)

# Accuracy
print(clf.score(X, y))

# Percentage of selected features
print(clf.n_nonzero(percentage=True))

Dependencies

lightning requires Python >= 2.7, setuptools, Numpy >= 1.3, SciPy >= 0.7 and scikit-learn >= 0.15. Building from source also requires Cython and a working C/C++ compiler. To run the tests you will also need nose >= 0.10.

Installation

Precompiled binaries for the stable version of lightning are available for the main platforms and can be installed using pip:

pip install sklearn-contrib-lightning

or conda:

conda install -c conda-forge sklearn-contrib-lightning

The development version of lightning can be installed from its git repository. In this case it is assumed that you have the git version control system, a working C++ compiler, Cython and the numpy development libraries. In order to install the development version, type:

git clone https://github.com/scikit-learn-contrib/lightning.git
cd lightning
python setup.py build
sudo python setup.py install

Documentation

http://contrib.scikit-learn.org/lightning/

On Github

https://github.com/scikit-learn-contrib/lightning

Citing

If you use this software, please cite it. Here is a BibTex snippet that you can use:

@misc{lightning_2016,
  author       = {Blondel, Mathieu and
                  Pedregosa, Fabian},
  title        = {{Lightning: large-scale linear classification,
                 regression and ranking in Python}},
  year         = 2016,
  doi          = {10.5281/zenodo.200504},
  url          = {https://doi.org/10.5281/zenodo.200504}
}

Other citing formats are available in its Zenodo entry .

Authors

Mathieu Blondel, 2012-present
Manoj Kumar, 2015-present
Arnaud Rachez, 2016-present
Fabian Pedregosa, 2016-present

Comments

[MRG] Parallelize OvR method in primal_cd

@mblondel I was trying to get some speed gains by parallelizing the OvR method. However when I set n_jobs>1 it keeps failing with this error, TypeError: __cinit__() takes exactly 1 positional argument (0 given). Note that it works like how it is supposed to for n_jobs=1

opened by MechCoder 37
[WIP] Adding prox capability to SAGA.
Continuing #37 after discussing with @fabianp. Added prox capability in _sag_fit of file lightning/impl/sag_fast.pyx where @fabianp left room for it.

The proximity operator is currently specified when a classifier/regressor is built with the prox keyword (type ProxFunction mimicking LossFunction in lightning/impl/sgd_fast.pyx). Not sure this is the best way to specify it by default...

Notes prox implementation breaks sparse updates and the code is excruciatingly slow on sklearn.datasets.fetch_20newsgroups_vectorized (cf. this gist)

[x] Draf of proximity operators.

[x] Need to add tests.

[x] Add sparsity in L1
opened by zermelozf 31
[MRG] Just in time SAGA.
A squashed version of #38 ontaining:

SAGA algorithm in cython.

Basic python version of SAG and SAGA for testing.

Support for proximity operators through the Penalty base class.

L1 proximity operator with just in time update for sparse data.
opened by zermelozf 24
Documentation update

Hi @mblondel . Some of the recent additions (such as SAGA) don't show up in the webpage. Would you mind pushing a new version of the doc? (I wouldn't mind doing it myself if it was on github pages)

opened by fabianp 18
FIX for SAG with sparse samples.

The problem was that when the solution was updated just in time the different scaling accumulated were not considered. They were treated as if they had been constant in the last iterations.

This should fix issue #33 , although because of some python 3 incompatibility I've not yet run the full test suite.

opened by fabianp 14
raise AttributeError if predict_proba is not available
In scikit-learn when predit_proba method is not available, AttributeError is raised instead of NotImplementedError. In this PR:

classifiers are changed to follow the same convention;

removed predict_log_proba mentions because lightning doesn't provide this method;

added more tests for predict_proba results.
opened by kmike 12
0.1 release
I'd like to do a 0.1 release and upload binary packages to pypi and conda. TODO:

[x] Make binary conda packages for (at least) windows (appveyor).

[x] Update README with build instructions for binary packages.

[x] Update the website with the latests stable version.

[x] Create maintenance branch 0.1.X

[x] After release, upgrade version number to 0.2.dev0.

What do you think @mblondel ?
opened by fabianp 12
Release `0.6.2`

I believe Python 3.10 support that has been added recently (3afcb4a9967a0d9e3961acd967705e42a593e448) deserves new release of the package. In new release we'll upload wheels for Python 3.10 making users' life easier.

opened by StrikerRUS 11
Build artifacts at GitHub Actions

Wheels for all platforms and source archive will be automatically uploaded to Releases tab with each tagged commit.

For example please refer to https://github.com/StrikerRUS/lightning/releases/tag/untagged-a19e7c8d925f0295f2b6.

Unfortunately, neither manylinux2010 nor manylinux1 containers cannot be used due to the following restriction of Node.js: https://github.com/actions/runner/issues/337. But I think manylinux2014 is better than nothing. Moreover, CentOS 6 and CentOS 5 on which those containers are based have already reached their EOL. https://github.com/pypa/manylinux

opened by StrikerRUS 11
Should the .pxd files be included with the distribution?

I'm working on a package that uses lightning cython code as a dependency via:

from lightning.impl.dataset_fast cimport ColumnDataset.

When installing lightning via conda or pip, generating the cython file fails, but if I distribute the generated cpp files the code runs fine.

Should the .pxd files be distributed with lightning to allow this use case?

opened by vene 10
[HOTFIX] fix compatibility with new scikit-learn version
This PR will allow using lightning with the latest version (0.23.0) of scikit-learn. Right now if you try to upgrade scikit-learn, lightning fails with the error about that it cannot import neither joblib nor six because they are no longer exist in sklearn.externals:

from lightning.classification import KernelSVC ../../../virtualenv/python3.6.7/lib/python3.6/site-packages/lightning/classification.py:1: in <module> from .impl.adagrad import AdaGradClassifier ../../../virtualenv/python3.6.7/lib/python3.6/site-packages/lightning/impl/adagrad.py:8: in <module> from sklearn.externals.six.moves import xrange

six was dropped along with Python 2 support. https://scikit-learn.org/stable/whats_new/v0.21.html#sklearn-externals

joblib is now a dependency: https://scikit-learn.org/stable/whats_new/v0.21.html#miscellaneous

This PR should be treated as a hotfix, and ideally lightning should drop the support of Python 2 with six dependency.
opened by StrikerRUS 9
Why not initialize SAG/SAGA memory with 0 and divide by seen indices so far as in sklearn?

Why you don't use initialize gradient memory with 0 and use the number of indices seen so far in SAG algorithm as suggested in the paper

In the update of x in Algorithm 1, we normalize the direction d by the total number of data points n. When initializing with y_i = 0 we believe this leads to steps that are too small on early iterations of the algorithm where we have only seen a fraction of the data points, because many y_i variables contributing to d are set to the uninformative zero-vector. Following Blatt et al. [2007], the more logical normalization is to divide d by m, the number of data points that we have seen at least once

SAGA paper suggests a similar procedure

Our algorithm assumes that initial gradients are known for each f_i at the starting point x0. Instead, a heuristic may be used where during the first pass, data-points are introduced one-by-one, in a non-randomized order, with averages computed in terms of those data-points processed so far. This procedure has been successfully used with SAG [1].

opened by NikZak 0

DOC: sometimes the Lasso solution is the same as sklearn, sometimes not

Hi @mblondel @fabianp I think this will be short to answer, why is the solution sometimes equal to that of sklearn, and sometimes not ?

This should be quick to reproduce, look at 1st and 3rd result over 5 seeds:

import numpy as np
from numpy.linalg import norm
from lightning.regression import CDRegressor
from sklearn.linear_model import Lasso

np.random.seed(0)
X = np.random.randn(200, 500)
beta = np.ones(X.shape[1])
beta[20:] = 0
y = X @ beta + 0.3 * np.random.randn(X.shape[0])
alpha = norm(X.T @ y, ord=np.inf) / 10


def p_obj(X, y, alpha, w):
    return norm(y - X @ w) ** 2 / 2 + alpha * norm(w, ord=1)


for seed in range(5):
    print('-' * 80)
    clf = CDRegressor(C=0.5, alpha=alpha, penalty='l1',
                      tol=1-30, random_state=seed)
    clf.fit(X, y)

    las = Lasso(fit_intercept=False, alpha=alpha/len(y), tol=1e-10).fit(X, y)
    print(norm(clf.coef_[0] - las.coef_))

    light_o = p_obj(X, y, alpha, clf.coef_[0])
    sklea_o = p_obj(X, y, alpha, las.coef_)

    print(light_o - sklea_o)

ping @qb3 @agramfort

opened by mathurinm 5

do you have Regression for spars categorical big data after one hot transformation

do you have Regression for spars categorical big data after one hot transformation

then data is spars and only ones and zeros values many zeros and few ones?

opened by Sandy4321 0

Releases(0.6.2.post0)

Owner

scikit-learn compatible projects

GitHub Repository http://contrib.scikit-learn.org/lightning/

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

imbalanced-learn imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-cla

6.2k Jan 01, 2023

scikit-learn cross validators for iterative stratification of multilabel data

iterative-stratification iterative-stratification is a project that provides scikit-learn compatible cross validators with stratification for multilab

745 Jan 05, 2023

A library of sklearn compatible categorical variable encoders

Categorical Encoding Methods A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques

2.1k Jan 02, 2023

scikit-learn inspired API for CRFsuite

sklearn-crfsuite sklearn-crfsuite is a thin CRFsuite (python-crfsuite) wrapper which provides interface simlar to scikit-learn. sklearn_crfsuite.CRF i

418 Jan 09, 2023

Multivariate imputation and matrix completion algorithms implemented in Python

A variety of matrix completion and imputation algorithms implemented in Python 3.6. To install: pip install fancyimpute Do not use conda. We don't sup

1.1k Dec 18, 2022

machine learning with logical rules in Python

skope-rules Skope-rules is a Python machine learning module built on top of scikit-learn and distributed under the 3-Clause BSD license. Skope-rules a

504 Dec 31, 2022

A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

4.2k Dec 28, 2022

Large-scale linear classification, regression and ranking in Python

lightning lightning is a library for large-scale linear classification, regression and ranking in Python. Highlights: follows the scikit-learn API con

1.6k Dec 31, 2022

Data Analysis Baseline Library

dabl The data analysis baseline library. "Mr Sanchez, are you a data scientist?" "I dabl, Mr president." Find more information on the website. State o

122 Dec 27, 2022

Fast solver for L1-type problems: Lasso, sparse Logisitic regression, Group Lasso, weighted Lasso, Multitask Lasso, etc.

celer Fast algorithm to solve Lasso-like problems with dual extrapolation. Currently, the package handles the following problems: Lasso weighted Lasso

168 Dec 13, 2022

Scikit-learn compatible estimation of general graphical models

skggm : Gaussian graphical models using the scikit-learn API In the last decade, learning networks that encode conditional independence relationships

213 Jan 02, 2023

Topological Data Analysis for Python🐍

Scikit-TDA is a home for Topological Data Analysis Python libraries intended for non-topologists. This project aims to provide a curated library of TD

373 Dec 24, 2022

A Python library for dynamic classifier and ensemble selection

DESlib DESlib is an easy-to-use ensemble learning library focused on the implementation of the state-of-the-art techniques for dynamic classifier and

425 Dec 18, 2022

Extra blocks for scikit-learn pipelines.

scikit-lego We love scikit learn but very often we find ourselves writing custom transformers, metrics and models. The goal of this project is to atte

941 Dec 30, 2022

A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

803 Jan 05, 2023

Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

scikit-opt Swarm Intelligence in Python (Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm, Immune Algorithm,A

3.7k Jan 01, 2023

(AAAI' 20) A Python Toolbox for Machine Learning Model Combination

combo: A Python Toolbox for Machine Learning Model Combination Deployment & Documentation & Stats Build Status & Coverage & Maintainability & License

606 Dec 21, 2022

Large-scale linear classification, regression and ranking in Python

Related tags

Overview

lightning

Example

Dependencies

Installation

Documentation

On Github

Citing

Authors

Comments

Releases(0.6.2.post0)

0.6.2.post0(Jan 30, 2022)

0.6.2(Jan 29, 2022)

0.6.1(Jun 16, 2021)

Owner

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

scikit-learn cross validators for iterative stratification of multilabel data

A library of sklearn compatible categorical variable encoders

scikit-learn inspired API for CRFsuite

Multivariate imputation and matrix completion algorithms implemented in Python

machine learning with logical rules in Python

A library of extension and helper modules for Python's data analysis and machine learning libraries.

Large-scale linear classification, regression and ranking in Python

Data Analysis Baseline Library

Fast solver for L1-type problems: Lasso, sparse Logisitic regression, Group Lasso, weighted Lasso, Multitask Lasso, etc.

Scikit-learn compatible estimation of general graphical models

Topological Data Analysis for Python🐍

A Python library for dynamic classifier and ensemble selection

Extra blocks for scikit-learn pipelines.

A scikit-learn based module for multi-label et. al. classification

Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

(AAAI' 20) A Python Toolbox for Machine Learning Model Combination