当前位置:网站首页>Feature of sklearn_ extraction. text. CountVectorizer / TfidVectorizer
Feature of sklearn_ extraction. text. CountVectorizer / TfidVectorizer
2022-07-06 12:00:00 【Want to be a kite】
sklearn.feature_extraction: feature extraction
The sklearn.feature_extraction Module processing extracts features from raw data . It currently includes methods of extracting features from text and images .
User guide : For more information , See the feature extraction section .
feature_extraction.DictVectorizer(*[, ...])
Convert the eigenvalue mapping list to a vector .
feature_extraction.FeatureHasher([...])
Implement feature hashing , That is, the hashing technique .
From image
The sklearn.feature_extraction.image The sub module collects utilities to extract features from images .
feature_extraction.image.extract_patches_2d(...)
take 2D The image is reshaped into a set of patches
feature_extraction.image.grid_to_graph(n_x, n_y)
Pixel to pixel connection diagram .
feature_extraction.image.img_to_graph( picture ,*)
Pixel to pixel gradient connection diagram .
feature_extraction.image.reconstruct_from_patches_2d(...)
Rebuild the image from all its patches .
feature_extraction.image.PatchExtractor(*[, ...])
Extract patches from image sets .
From text
The sklearn.feature_extraction.text The sub module collects utilities to build feature vectors from text documents .
feature_extraction.text.CountVectorizer(*[, ...])
Convert a collection of text documents into a token count matrix .
feature_extraction.text.HashingVectorizer(*)
Convert a collection of text documents into a matrix of mark occurrences .
feature_extraction.text.TfidfTransformer(*)
Convert the counting matrix into a standardized tf or tf-idf Express .
feature_extraction.text.TfidfVectorizer(*[, ...])
Convert the original document collection to TF-IDF Characteristic matrix .
sklearn.feature_selection: feature selection
The sklearn.feature_selection The module implements the feature selection algorithm . It currently includes univariate filter selection method and recursive feature elimination algorithm .
User guide : For more information , Please refer to the function selection section .
feature_selection.GenericUnivariateSelect([...])
Univariate feature selector with configurable policy .
feature_selection.SelectPercentile([...])
Select features according to the percentile of the highest score .
feature_selection.SelectKBest([score_func, k])
according to k Highest score selection feature .
feature_selection.SelectFpr([score_func, alpha])
filter : according to FPR Test selection below alpha Of pvalues.
feature_selection.SelectFdr([score_func, alpha])
filter : Choose for the estimated error detection rate p value .
feature_selection.SelectFromModel( It is estimated that ,*)
Meta converter for selecting features based on importance weight .
feature_selection.SelectFwe([score_func, alpha])
filter : Choice and Family-wise error rate Corresponding p value .
feature_selection.SequentialFeatureSelector(...)
A converter that performs sequential feature selection .
feature_selection.RFE( estimator ,*[,...])
Feature ranking with recursive feature elimination .
feature_selection.RFECV( estimator ,*[,...])
Use cross validation for recursive feature elimination to select the number of features .
feature_selection.VarianceThreshold([ critical point ])
Delete the feature selector for all low variance features .
feature_selection.chi2(X, y)
Calculate chi square statistics between each nonnegative feature and class .
feature_selection.f_classif(X, y)
Calculate ANOVA F value .
feature_selection.f_regression(X, y, *[, ...])
return F Statistics and p Univariate linear regression test of value .
feature_selection.r_regression(X, y, *[, ...])
Calculate for each feature and target Pearson Of r.
feature_selection.mutual_info_classif(X, y, *)
Estimate the mutual information of discrete target variables .
feature_selection.mutual_info_regression(X, y, *)
Estimate the mutual information of continuous target variables .
feature_extraction.text.TfidVectorizer
Example :
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> vectorizer.get_feature_names_out()
array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
'this'], ...)
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
>>> vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
>>> X2 = vectorizer2.fit_transform(corpus)
>>> vectorizer2.get_feature_names_out()
array(['and this', 'document is', 'first document', 'is the', 'is this',
'second document', 'the first', 'the second', 'the third', 'third one',
'this document', 'this is', 'this the'], ...)
>>> print(X2.toarray())
[[0 0 1 1 0 0 1 0 0 0 0 1 0]
[0 1 0 1 0 1 0 1 0 0 1 0 0]
[1 0 0 1 0 0 0 0 1 1 0 1 0]
[0 0 1 0 1 0 1 0 0 0 0 0 1]]
sklearn.feature_extraction.text.TfidfVectorizer
Example :
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> vectorizer.get_feature_names_out()
array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
'this'], ...)
>>> print(X.shape)
(4, 9)
边栏推荐
- Contiki源码+原理+功能+编程+移植+驱动+网络(转)
- 电商数据分析--用户行为分析
- 机器学习--决策树(sklearn)
- FTP文件上传文件实现,定时扫描文件夹上传指定格式文件文件到服务器,C语言实现FTP文件上传详解及代码案例实现
- IOT system framework learning
- MySQL数据库面试题
- ESP8266通过arduino IED连接巴法云(TCP创客云)
- [CDH] cdh5.16 configuring the setting of yarn task centralized allocation does not take effect
- Unit test - unittest framework
- 荣耀Magic 3Pro 充电架构分析
猜你喜欢
机器学习--线性回归(sklearn)
Machine learning -- decision tree (sklearn)
PyTorch四种常用优化器测试
Word typesetting (subtotal)
Kaggle竞赛-Two Sigma Connect: Rental Listing Inquiries(XGBoost)
Kaggle竞赛-Two Sigma Connect: Rental Listing Inquiries
Detailed explanation of Union [C language]
A possible cause and solution of "stuck" main thread of RT thread
Password free login of distributed nodes
Unit test - unittest framework
随机推荐
关键字 inline (内联函数)用法解析【C语言】
XML文件详解:XML是什么、XML配置文件、XML数据文件、XML文件解析教程
C language, log print file name, function name, line number, date and time
Detailed explanation of Union [C language]
OPPO VOOC快充电路和协议
Word排版(小计)
R & D thinking 01 ----- classic of embedded intelligent product development process
A possible cause and solution of "stuck" main thread of RT thread
[mrctf2020] dolls
RT-Thread API参考手册
Nodejs connect mysql
The first simple case of GNN: Cora classification
Vert. x: A simple TCP client and server demo
Linux yum安装MySQL
[Flink] cdh/cdp Flink on Yan log configuration
Contiki source code + principle + function + programming + transplantation + drive + network (turn)
Word typesetting (subtotal)
Priority inversion and deadlock
There are three iPhone se 2022 models in the Eurasian Economic Commission database
Kaggle competition two Sigma connect: rental listing inquiries