sklearn.feature_extraction: feature extraction
The sklearn.feature_extraction Module processing extracts features from raw data . It currently includes methods of extracting features from text and images .

User guide : For more information , See the feature extraction section .

feature_extraction.DictVectorizer(*[, ...])

Convert the eigenvalue mapping list to a vector .


Implement feature hashing , That is, the hashing technique .

From image
The sklearn.feature_extraction.image The sub module collects utilities to extract features from images .


take 2D The image is reshaped into a set of patches

feature_extraction.image.grid_to_graph(n_x, n_y)

Pixel to pixel connection diagram .

feature_extraction.image.img_to_graph( picture ,*

Pixel to pixel gradient connection diagram .


Rebuild the image from all its patches .

feature_extraction.image.PatchExtractor(*[, ...])

Extract patches from image sets .

From text
The sklearn.feature_extraction.text The sub module collects utilities to build feature vectors from text documents .

feature_extraction.text.CountVectorizer(*[, ...])

Convert a collection of text documents into a token count matrix .


Convert a collection of text documents into a matrix of mark occurrences .


Convert the counting matrix into a standardized tf or tf-idf Express .

feature_extraction.text.TfidfVectorizer(*[, ...])

Convert the original document collection to TF-IDF Characteristic matrix .

sklearn.feature_selection: feature selection
The sklearn.feature_selection The module implements the feature selection algorithm . It currently includes univariate filter selection method and recursive feature elimination algorithm .

User guide : For more information , Please refer to the function selection section .


Univariate feature selector with configurable policy .


Select features according to the percentile of the highest score .

feature_selection.SelectKBest([score_func, k])

according to k Highest score selection feature .

feature_selection.SelectFpr([score_func, alpha])

filter : according to FPR Test selection below alpha Of pvalues.

feature_selection.SelectFdr([score_func, alpha])

filter : Choose for the estimated error detection rate p value .

feature_selection.SelectFromModel( It is estimated that ,*

Meta converter for selecting features based on importance weight .

feature_selection.SelectFwe([score_func, alpha])

filter : Choice and Family-wise error rate Corresponding p value .


A converter that performs sequential feature selection .

feature_selection.RFE( estimator ,*[,...]

Feature ranking with recursive feature elimination .

feature_selection.RFECV( estimator ,*[,...]

Use cross validation for recursive feature elimination to select the number of features .

feature_selection.VarianceThreshold([ critical point ]

Delete the feature selector for all low variance features .

feature_selection.chi2(X, y)

Calculate chi square statistics between each nonnegative feature and class .

feature_selection.f_classif(X, y)

Calculate ANOVA F value .

feature_selection.f_regression(X, y, *[, ...])

return F Statistics and p Univariate linear regression test of value .

feature_selection.r_regression(X, y, *[, ...])

Calculate for each feature and target Pearson Of r.

feature_selection.mutual_info_classif(X, y, *)

Estimate the mutual information of discrete target variables .

feature_selection.mutual_info_regression(X, y, *)

Estimate the mutual information of continuous target variables .

Example :

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> vectorizer.get_feature_names_out()
array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], ...)
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
>>> vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
>>> X2 = vectorizer2.fit_transform(corpus)
>>> vectorizer2.get_feature_names_out()
array(['and this', 'document is', 'first document', 'is the', 'is this',
       'second document', 'the first', 'the second', 'the third', 'third one',
       'this document', 'this is', 'this the'], ...)
 >>> print(X2.toarray())
 [[0 0 1 1 0 0 1 0 0 0 0 1 0]
 [0 1 0 1 0 1 0 1 0 0 1 0 0]
 [1 0 0 1 0 0 0 0 1 1 0 1 0]
 [0 0 1 0 1 0 1 0 0 0 0 0 1]]

Example :

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> vectorizer.get_feature_names_out()
array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], ...)
>>> print(X.shape)
(4, 9)

