当前位置:网站首页>利用tsne将不同句子关于相似度可视化出来
利用tsne将不同句子关于相似度可视化出来
2022-06-30 00:42:00 【这个利弗莫尔不太冷】
TSNE目的:将高维数据降维并进行可视化

通过映射变换将每个数据点映射到相应的概率分布上。具体的是,在高维空间中使用高斯分布将距离转换为概率分布,在低维空间中,使用长尾分布来将距离转换为概率分布,从而是的高维度空间中的中低等距离在映射后能够有个较大的距离,使得降维时能够避免过多关注局部特征,而忽视全局特征。
import gensim, logging, os
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from cope_dataset import get_unrepeat_txt
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
import nltk
import numpy as np
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
# nltk.download('punkt')
from nltk.tokenize import word_tokenize
import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
import pandas as pd
def cosine(u, v):
return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))
def tsne_plot(tokens,labels):
"""利用tsne生成图片"""
tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
new_values = tsne_model.fit_transform(tokens)
x = []
y = []
for value in new_values:
x.append(value[0])
y.append(value[1])
plt.figure(figsize=(32, 32))
for i in range(len(x)):
plt.scatter(x[i], y[i])
plt.annotate(labels[i],
xy=(x[i], y[i]),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
plt.show()
plt.savefig('embedding_map.png')
def kmeans_vis(sentence_embeddings):
plt.figure(figsize=(32,32))
clf = KMeans(n_clusters=1000)
y_pred = KMeans(n_clusters=1000).fit_predict(sentence_embeddings)
plt.scatter(sentence_embeddings[:, 0], sentence_embeddings[:, 1], c=y_pred)
plt.title("Anisotropicly Disributed Blobs")
plt.show()
plt.savefig('kmeans.jpg')
s = clf.fit(sentence_embeddings)
#获取到所有词向量所属类别
labels=clf.labels_
print(clf.cluster_centers_)
def debcans_vis(sentence_embeddings):
plt.figure(figsize=(16,16))
y_pred = DBSCAN(eps = 0.5, min_samples = 100).fit_predict(sentence_embeddings)
plt.scatter(sentence_embeddings[:, 0], sentence_embeddings[:, 1], c=y_pred)
plt.show()
plt.savefig('debacns.jpg')
def train(str_en_text):
tokenized_sent = []
for s in str_en_text:
tokenized_sent.append(word_tokenize(s.lower()))
tagged_data = [TaggedDocument(d, [i]) for i, d in enumerate(tokenized_sent)]
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')
sentence_embeddings = sbert_model.encode(str_en_text) # 得到了每一个句子的向量
#kmeans_vis(sentence_embeddings)
debcans_vis(sentence_embeddings)
#tsne_plot(sentence_embeddings[:88],str_en_text[:88])
# 对比句子之间的相似度
#query = "Will the cat tower topple over easily?My cat is about 7-8 pounds."
# query_vec = sbert_model.encode([query])[0]
# for sent in str_en_text:
# sim = cosine(query_vec, sbert_model.encode([sent])[0])
# #print("Sentence = ", sent, "; similarity = ", sim)
# if float(sim)>0.8:
# break
if __name__ == '__main__':
json_path = '/cloud/cloud_disk/users/huh/dataset/nlp_dataset/question_dataset/ori/en_ch_cattree_personality.json'
str_en_text = get_unrepeat_txt()
train(str_en_text)
边栏推荐
- [mrctf2020]ezpop-1 | PHP serialization
- How to create a module in the idea and how to delete a module in the idea?
- HDCP Paring
- 浮点数通信
- If the amount exceeds 6 digits after the decimal point, only 6 digits will be reserved, and if it is less than 6 digits, it will remain the same - Basic accumulation
- Some thoughts on life
- Modbus TCP RTU protocol chart
- How much is the fixed asset management system and the price of the fixed asset management system
- Nested call and chained access of functions in handwritten C language
- @ConfigurationProperties使用不当引发的bug
猜你喜欢
![[lorawan node application] the application and power consumption of Anxin ra-08/ra-08h module in lorawan network](/img/5d/9cff7bd25841c1ca6e5ab8e2994f51.png)
[lorawan node application] the application and power consumption of Anxin ra-08/ra-08h module in lorawan network
Flask web minimalist tutorial (III) - Sqlalchemy (part a)

Botu V16 changes the model and firmware version of PLC

间歇采样转发干扰

网易云音乐内测音乐社交 App“MUS”,通过音乐匹配同频朋友

Nested call and chained access of functions in handwritten C language

如何在IDEA中自定义模板、快速生成完整的代码?

Go 中的 UDP 服务器和客户端

Initial i/o and its basic operations
![[MRCTF2020]Ezpop-1|php序列化](/img/f8/6164b4123e0d1f3b90980ebb7b4097.png)
[MRCTF2020]Ezpop-1|php序列化
随机推荐
Go 中的 UDP 服务器和客户端
Is there any discount for securities account opening? Is it safe to open an account online?
Common settings in idea
Comment personnaliser les modèles et générer rapidement le code complet dans l'idée?
Swift notes
Time does not spare
Some thoughts on life
A Si's mood swings
Crmeb SMS for program configuration of knowledge payment system
[PHP] PHP variable memory release
【UML】UML的几种关系(依赖-关联-聚合-组合-继承-实现)
[lorawan node application] the application and power consumption of Anxin ra-08/ra-08h module in lorawan network
Simple pages
MySQL basics 1
炒股开户选择哪家券商公司比较好哪家平台更安全
PHP wechat merchant transfer to change initiating merchant transfer API
出门在外保护好自己
If the amount exceeds 6 digits after the decimal point, only 6 digits will be reserved, and if it is less than 6 digits, it will remain the same - Basic accumulation
测试用例设计方法之等价类划分方法
Lower expectations