当前位置:网站首页>利用tsne将不同句子关于相似度可视化出来
利用tsne将不同句子关于相似度可视化出来
2022-06-30 00:42:00 【这个利弗莫尔不太冷】
TSNE目的:将高维数据降维并进行可视化

通过映射变换将每个数据点映射到相应的概率分布上。具体的是,在高维空间中使用高斯分布将距离转换为概率分布,在低维空间中,使用长尾分布来将距离转换为概率分布,从而是的高维度空间中的中低等距离在映射后能够有个较大的距离,使得降维时能够避免过多关注局部特征,而忽视全局特征。
import gensim, logging, os
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from cope_dataset import get_unrepeat_txt
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
import nltk
import numpy as np
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
# nltk.download('punkt')
from nltk.tokenize import word_tokenize
import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
import pandas as pd
def cosine(u, v):
return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))
def tsne_plot(tokens,labels):
"""利用tsne生成图片"""
tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
new_values = tsne_model.fit_transform(tokens)
x = []
y = []
for value in new_values:
x.append(value[0])
y.append(value[1])
plt.figure(figsize=(32, 32))
for i in range(len(x)):
plt.scatter(x[i], y[i])
plt.annotate(labels[i],
xy=(x[i], y[i]),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
plt.show()
plt.savefig('embedding_map.png')
def kmeans_vis(sentence_embeddings):
plt.figure(figsize=(32,32))
clf = KMeans(n_clusters=1000)
y_pred = KMeans(n_clusters=1000).fit_predict(sentence_embeddings)
plt.scatter(sentence_embeddings[:, 0], sentence_embeddings[:, 1], c=y_pred)
plt.title("Anisotropicly Disributed Blobs")
plt.show()
plt.savefig('kmeans.jpg')
s = clf.fit(sentence_embeddings)
#获取到所有词向量所属类别
labels=clf.labels_
print(clf.cluster_centers_)
def debcans_vis(sentence_embeddings):
plt.figure(figsize=(16,16))
y_pred = DBSCAN(eps = 0.5, min_samples = 100).fit_predict(sentence_embeddings)
plt.scatter(sentence_embeddings[:, 0], sentence_embeddings[:, 1], c=y_pred)
plt.show()
plt.savefig('debacns.jpg')
def train(str_en_text):
tokenized_sent = []
for s in str_en_text:
tokenized_sent.append(word_tokenize(s.lower()))
tagged_data = [TaggedDocument(d, [i]) for i, d in enumerate(tokenized_sent)]
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')
sentence_embeddings = sbert_model.encode(str_en_text) # 得到了每一个句子的向量
#kmeans_vis(sentence_embeddings)
debcans_vis(sentence_embeddings)
#tsne_plot(sentence_embeddings[:88],str_en_text[:88])
# 对比句子之间的相似度
#query = "Will the cat tower topple over easily?My cat is about 7-8 pounds."
# query_vec = sbert_model.encode([query])[0]
# for sent in str_en_text:
# sim = cosine(query_vec, sbert_model.encode([sent])[0])
# #print("Sentence = ", sent, "; similarity = ", sim)
# if float(sim)>0.8:
# break
if __name__ == '__main__':
json_path = '/cloud/cloud_disk/users/huh/dataset/nlp_dataset/question_dataset/ori/en_ch_cattree_personality.json'
str_en_text = get_unrepeat_txt()
train(str_en_text)
边栏推荐
- Mysql Duplicate entry ‘xxx‘ for key ‘xxx‘
- 赛芯电子冲刺科创板上市:拟募资6.23亿元,共有64项专利申请信息
- ML:置信区间的简介(精密度/准确度/精确度的三者区别及其关系)、使用方法、案例应用之详细攻略
- 学位论文的引用
- Yunna | fixed assets system management, NC system management where are the fixed assets
- 数据中台咋就从“小甜甜”变成了“牛夫人”?
- Some thoughts on life
- Relevance - canonical correlation analysis
- YuMinHong: my retreat and advance; The five best software architecture patterns that architects must understand; Redis kills 52 consecutive questions | manong weekly VIP member exclusive email weekly
- TwinCAT 3 EL7211模块控制倍福伺服
猜你喜欢
![[cloud native] kernel security in container scenario](/img/cc/828a8f246b28cb02b7efa1bdd8dee4.png)
[cloud native] kernel security in container scenario

Simple pages

如何在IDEA中自定義模板、快速生成完整的代碼?

外包干了三年,废的一踏糊涂...

HDCP Paring

CSV文件格式——方便好用个头最小的数据传递方式

Interviewer: why does database connection consume resources? I can't even answer.. I was stunned!

@ConfigurationProperties使用不当引发的bug

简单的页面

In 2022, the latest and most detailed idea associated database method and visual operation of database in idea (including graphic process)
随机推荐
字符串之间的比较之 localeCompare
ML:置信区间的简介(精密度/准确度/精确度的三者区别及其关系)、使用方法、案例应用之详细攻略
xshell中怎么切换到root用户
A Si's mood swings
Byte, word, doubleword relationship
Arlo felt lost
Bytek suffered a disastrous defeat in the interview: he was hanged on one side, but fortunately Huawei pushed him in, and he got an offer on three sides
How much is the fixed asset management system and the price of the fixed asset management system
Sofaregistry source code | data synchronization module analysis
Common interview questions for network workers: Telnet, TTL, router and switch
MySQL基礎2
PHP wechat merchant transfer to change initiating merchant transfer API
Too voluminous ~ eight part essay, the strongest king of interview!
Go 中的 UDP 服务器和客户端
赛芯电子冲刺科创板上市:拟募资6.23亿元,共有64项专利申请信息
[MySQL basic] general syntax 2
字节、字、双字 关系
How about stock online account opening and account opening process? Also, is it safe to open an account online?
Move DataGridView up and down
传统微服务框架如何无缝过渡到服务网格 ASM