当前位置:网站首页>Feature of sklearn_ extraction. text. CountVectorizer / TfidVectorizer
Feature of sklearn_ extraction. text. CountVectorizer / TfidVectorizer
2022-07-06 12:00:00 【Want to be a kite】
sklearn.feature_extraction: feature extraction
The sklearn.feature_extraction Module processing extracts features from raw data . It currently includes methods of extracting features from text and images .
User guide : For more information , See the feature extraction section .
feature_extraction.DictVectorizer(*[, ...])
Convert the eigenvalue mapping list to a vector .
feature_extraction.FeatureHasher([...])
Implement feature hashing , That is, the hashing technique .
From image
The sklearn.feature_extraction.image The sub module collects utilities to extract features from images .
feature_extraction.image.extract_patches_2d(...)
take 2D The image is reshaped into a set of patches
feature_extraction.image.grid_to_graph(n_x, n_y)
Pixel to pixel connection diagram .
feature_extraction.image.img_to_graph( picture ,*)
Pixel to pixel gradient connection diagram .
feature_extraction.image.reconstruct_from_patches_2d(...)
Rebuild the image from all its patches .
feature_extraction.image.PatchExtractor(*[, ...])
Extract patches from image sets .
From text
The sklearn.feature_extraction.text The sub module collects utilities to build feature vectors from text documents .
feature_extraction.text.CountVectorizer(*[, ...])
Convert a collection of text documents into a token count matrix .
feature_extraction.text.HashingVectorizer(*)
Convert a collection of text documents into a matrix of mark occurrences .
feature_extraction.text.TfidfTransformer(*)
Convert the counting matrix into a standardized tf or tf-idf Express .
feature_extraction.text.TfidfVectorizer(*[, ...])
Convert the original document collection to TF-IDF Characteristic matrix .
sklearn.feature_selection: feature selection
The sklearn.feature_selection The module implements the feature selection algorithm . It currently includes univariate filter selection method and recursive feature elimination algorithm .
User guide : For more information , Please refer to the function selection section .
feature_selection.GenericUnivariateSelect([...])
Univariate feature selector with configurable policy .
feature_selection.SelectPercentile([...])
Select features according to the percentile of the highest score .
feature_selection.SelectKBest([score_func, k])
according to k Highest score selection feature .
feature_selection.SelectFpr([score_func, alpha])
filter : according to FPR Test selection below alpha Of pvalues.
feature_selection.SelectFdr([score_func, alpha])
filter : Choose for the estimated error detection rate p value .
feature_selection.SelectFromModel( It is estimated that ,*)
Meta converter for selecting features based on importance weight .
feature_selection.SelectFwe([score_func, alpha])
filter : Choice and Family-wise error rate Corresponding p value .
feature_selection.SequentialFeatureSelector(...)
A converter that performs sequential feature selection .
feature_selection.RFE( estimator ,*[,...])
Feature ranking with recursive feature elimination .
feature_selection.RFECV( estimator ,*[,...])
Use cross validation for recursive feature elimination to select the number of features .
feature_selection.VarianceThreshold([ critical point ])
Delete the feature selector for all low variance features .
feature_selection.chi2(X, y)
Calculate chi square statistics between each nonnegative feature and class .
feature_selection.f_classif(X, y)
Calculate ANOVA F value .
feature_selection.f_regression(X, y, *[, ...])
return F Statistics and p Univariate linear regression test of value .
feature_selection.r_regression(X, y, *[, ...])
Calculate for each feature and target Pearson Of r.
feature_selection.mutual_info_classif(X, y, *)
Estimate the mutual information of discrete target variables .
feature_selection.mutual_info_regression(X, y, *)
Estimate the mutual information of continuous target variables .
feature_extraction.text.TfidVectorizer
Example :
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> vectorizer.get_feature_names_out()
array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
'this'], ...)
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
>>> vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
>>> X2 = vectorizer2.fit_transform(corpus)
>>> vectorizer2.get_feature_names_out()
array(['and this', 'document is', 'first document', 'is the', 'is this',
'second document', 'the first', 'the second', 'the third', 'third one',
'this document', 'this is', 'this the'], ...)
>>> print(X2.toarray())
[[0 0 1 1 0 0 1 0 0 0 0 1 0]
[0 1 0 1 0 1 0 1 0 0 1 0 0]
[1 0 0 1 0 0 0 0 1 1 0 1 0]
[0 0 1 0 1 0 1 0 0 0 0 0 1]]
sklearn.feature_extraction.text.TfidfVectorizer
Example :
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> vectorizer.get_feature_names_out()
array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
'this'], ...)
>>> print(X.shape)
(4, 9)
边栏推荐
- 树莓派 轻触开关 按键使用
- Reading notes of difficult career creation
- Stage 4 MySQL database
- Togglebutton realizes the effect of switching lights
- 列表的使用
- 互聯網協議詳解
- Detailed explanation of 5g working principle (explanation & illustration)
- [yarn] CDP cluster yarn configuration capacity scheduler batch allocation
- OSPF message details - LSA overview
- Raspberry pie tap switch button to use
猜你喜欢
高通&MTK&麒麟 手机平台USB3.0方案对比
Wangeditor rich text reference and table usage
Word排版(小計)
Gallery之图片浏览、组件学习
Unit test - unittest framework
I2C总线时序详解
FTP file upload file implementation, regularly scan folders to upload files in the specified format to the server, C language to realize FTP file upload details and code case implementation
sklearn之feature_extraction.text.CountVectorizer / TfidVectorizer
【ESP32学习-2】esp32地址映射
物联网系统框架学习
随机推荐
SQL time injection
[Presto] Presto parameter configuration optimization
Kaggle竞赛-Two Sigma Connect: Rental Listing Inquiries(XGBoost)
冒泡排序【C语言】
Distribute wxWidgets application
4、安装部署Spark(Spark on Yarn模式)
Come and walk into the JVM
E-commerce data analysis -- User Behavior Analysis
物联网系统框架学习
选择法排序与冒泡法排序【C语言】
arduino UNO R3的寄存器写法(1)-----引脚电平状态变化
共用体(union)详解【C语言】
OPPO VOOC快充电路和协议
Time slice polling scheduling of RT thread threads
STM32 如何定位导致发生 hard fault 的代码段
MP3mini播放模块arduino<DFRobotDFPlayerMini.h>函数详解
使用LinkedHashMap实现一个LRU算法的缓存
Pytorch-温度预测
Wangeditor rich text component - copy available
Kaggle竞赛-Two Sigma Connect: Rental Listing Inquiries