当前位置:网站首页>Using tsne to visualize the similarity of different sentences
Using tsne to visualize the similarity of different sentences
2022-06-30 00:49:00 【This Livermore isn't too cold】
TSNE Purpose : Dimensionality reduction and visualization of high-dimensional data

Each data point is mapped to the corresponding probability distribution by mapping transformation . Specifically , Use in high dimensional space Gaussian distribution Convert the distance to a probability distribution , In low dimensional space , Use the long tail distribution to convert the distance to a probability distribution , Therefore, the middle and lower distances in the high-dimensional space can have a larger distance after mapping , This makes it possible to avoid paying too much attention to local features in dimensionality reduction , While ignoring global features .
import gensim, logging, os
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from cope_dataset import get_unrepeat_txt
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
import nltk
import numpy as np
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
# nltk.download('punkt')
from nltk.tokenize import word_tokenize
import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
import pandas as pd
def cosine(u, v):
return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))
def tsne_plot(tokens,labels):
""" utilize tsne Generate pictures """
tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
new_values = tsne_model.fit_transform(tokens)
x = []
y = []
for value in new_values:
x.append(value[0])
y.append(value[1])
plt.figure(figsize=(32, 32))
for i in range(len(x)):
plt.scatter(x[i], y[i])
plt.annotate(labels[i],
xy=(x[i], y[i]),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
plt.show()
plt.savefig('embedding_map.png')
def kmeans_vis(sentence_embeddings):
plt.figure(figsize=(32,32))
clf = KMeans(n_clusters=1000)
y_pred = KMeans(n_clusters=1000).fit_predict(sentence_embeddings)
plt.scatter(sentence_embeddings[:, 0], sentence_embeddings[:, 1], c=y_pred)
plt.title("Anisotropicly Disributed Blobs")
plt.show()
plt.savefig('kmeans.jpg')
s = clf.fit(sentence_embeddings)
# Get the category of all word vectors
labels=clf.labels_
print(clf.cluster_centers_)
def debcans_vis(sentence_embeddings):
plt.figure(figsize=(16,16))
y_pred = DBSCAN(eps = 0.5, min_samples = 100).fit_predict(sentence_embeddings)
plt.scatter(sentence_embeddings[:, 0], sentence_embeddings[:, 1], c=y_pred)
plt.show()
plt.savefig('debacns.jpg')
def train(str_en_text):
tokenized_sent = []
for s in str_en_text:
tokenized_sent.append(word_tokenize(s.lower()))
tagged_data = [TaggedDocument(d, [i]) for i, d in enumerate(tokenized_sent)]
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')
sentence_embeddings = sbert_model.encode(str_en_text) # We get the vector of each sentence
#kmeans_vis(sentence_embeddings)
debcans_vis(sentence_embeddings)
#tsne_plot(sentence_embeddings[:88],str_en_text[:88])
# Compare the similarity between sentences
#query = "Will the cat tower topple over easily?My cat is about 7-8 pounds."
# query_vec = sbert_model.encode([query])[0]
# for sent in str_en_text:
# sim = cosine(query_vec, sbert_model.encode([sent])[0])
# #print("Sentence = ", sent, "; similarity = ", sim)
# if float(sim)>0.8:
# break
if __name__ == '__main__':
json_path = '/cloud/cloud_disk/users/huh/dataset/nlp_dataset/question_dataset/ori/en_ch_cattree_personality.json'
str_en_text = get_unrepeat_txt()
train(str_en_text)
边栏推荐
- UDP servers and clients in go
- Ml: introduction to confidence interval (the difference and relationship between precision / accuracy / accuracy), use method, and detailed introduction to case application
- [qnx hypervisor 2.2 user manual]6.2.2 communication between guest and host
- A Yu's Rainbow Bridge
- Relevance - canonical correlation analysis
- 如何在IDEA中创建Module、以及怎样在IDEA中删除Module?
- The SQL statement concat cannot find the result
- Experience of C language course design: open source sharing of "push box" course design works
- 月薪没到30K的程序员必须要背的面试八股,我先啃为敬!
- 测试用例设计方法之等价类划分方法
猜你喜欢

阿于的彩虹桥
![[MySQL basic] general syntax 2](/img/fe/6837fe96cb99b54e5cbce8f20787a5.png)
[MySQL basic] general syntax 2

博途V16 更改PLC的型号和固件版本

Outsourcing work for three years, waste a step confused
![[daily question 1] traversal of binary tree](/img/e2/313251d574f47708abca308c4c8d5d.png)
[daily question 1] traversal of binary tree

Citation of Dissertation

In 2022, the latest and most detailed idea associated database method and visual operation of database in idea (including graphic process)

How to customize templates and quickly generate complete code in idea?

开发者,为什么说容器技术的成熟预示着云原生时代的到来?

2022-06-29:x = { a, b, c, d }, y = { e, f, g, h }, x、y两个小数组长度都是4。 如果有: a + e = b + f = c + g = d + h
随机推荐
干外包3年,真废了...
Crmeb SMS for program configuration of knowledge payment system
C语言课设心得之“推箱子”课设作品开源分享
太卷了~ 八股文,面试最强王者!
[spark] basic Scala operations (continuous update)
Command line Basics
TwinCAT 3 el7211 module controls Beifu servo
Floating point communication
如何在IDEA中自定義模板、快速生成完整的代碼?
玉米地里的小鸟
Yunna | fixed assets system management, NC system management where are the fixed assets
【Spark】scala基础操作(持续更新)
Yunna | fixed assets information system management, information-based fixed assets management
Move DataGridView up and down
如何在IDEA中自定义模板、快速生成完整的代码?
Modbus TCP RTU protocol chart
YuMinHong: my retreat and advance; The five best software architecture patterns that architects must understand; Redis kills 52 consecutive questions | manong weekly VIP member exclusive email weekly
Outsourcing work for three years, waste a step confused
Exercise "product": self made colorful Prompt string display tool (for loop and if condition judgment)
解决choice金融终端Excel/Wps插件修复visual basic异常