当前位置:网站首页>Sklearn Feature Engineering (summary)
Sklearn Feature Engineering (summary)
2022-06-28 05:51:00 【bingbangx】
1、 Feature Engineering
Dictionary feature extraction
from sklearn.feature_extraction import DictVectorizer # Feature extracted packages

Text feature extraction and jieba participle
Text feature extraction , For example, document classification 、 Spam classification and news classification . Text classification is based on whether words exist 、 And the probability of words ( Importance ) To express .

If you want to count the number of Chinese words , It is necessary to segment Chinese words first .jieba
tf-idf Text extraction
It is a commonly used weighting technique for information retrieval and text mining , This statistical method , Used to evaluate the importance of a word in a document .
from sklearn.feature_extraction.text import TfidfVectorizer
Feature Engineering ~ normalization
normalization
X=(x-min)/(max-min)
among ,max and min Are the maximum and minimum values of a column respectively ,x Is the value before normalization .
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
scaler =MinMaxScaler()
data =[
[180,75,35],[175,80,17],[159,50,46],[149,79,45]
]
result =scaler.fit_transform(data)
print(result)

Standardization
from sklearn.preprocessing import StandardScaler # Standardization
scaler=StandardScaler()
result=scaler.fit_transform(data)
print(result)

Feature Engineering - Data dimension reduction
Principal component analysis
Principal component analysis , It is a statistical method . Through orthogonal transformation, a group of variables that may have correlation variables are transformed into a group of linearly uncorrelated variables , The transformed set of variables is called principal component .
The principal components need to remember two things :
The covariance between the features after dimensionality reduction is 0, Indicates that each feature relationship is independent , Each feature will not change regularly with the change of other features .
The variance of each feature should be as large as possible .
from sklearn.decomposition import PCA
def pca_decomposition():
pca=PCA(n_components=2)#1、0~1 Between , Scale of dimensions -1;2、 plastic : Specific dimensions , It has to be for min(n_samples,n_features) within
result =pca.fit_transform(
[
[4,2,76,9],
[1,192,1,56],
[34,5,20,90]
])
print(result)
pca_decomposition()

边栏推荐
- Codeworks 5 questions per day (1700 for each)
- 脚本语言和编程语言
- ERP软件公司选型的重要根据
- 数据仓库:分层设计详解
- Solution of dam safety automatic monitoring system for medium and small reservoirs
- 5G网络整体架构
- The windows environment redis uses AOF persistence and cannot generate an AOF file. After generation, the content of the AOF file cannot be loaded
- Shutter nestedscrollview sliding folding head pull-down refresh effect
- 简单手写debounce函数
- 学术搜索相关论文
猜你喜欢
随机推荐
At first glance, I can see several methods used by motionlayout
jsp连接Oracle实现登录注册
Syn retransmission caused by IPVS
Video tutorial on website operation to achieve SEO operation [21 lectures]
深度學習19種損失函數
Jenkins持续集成1
numpy.reshape, numpy.transpose的理解
如何在您的Shopify商店中添加实时聊天功能?
bash install. SH ******** error
数据仓库:DWS层设计原则
6. graduation design temperature and humidity monitoring system (esp8266 + DHT11 +oled real-time upload temperature and humidity data to the public network server and display the real-time temperature
原动力×云原生正发声 降本增效大讲堂
数据仓库:金融/银行业主题层划分方案
Binder interview: memory management unit
猿粉猿动力-开发者活动袭!
sklearn 特征工程(总结)
Maskrcnn,fast rcnn, faster rcnn优秀视频
1404. number of steps to reduce binary representation to 1
Filecoin hacker song developer competition
Codeworks 5 questions per day (1700 for each)









