当前位置:网站首页>Using tsne to visualize the similarity of different sentences
Using tsne to visualize the similarity of different sentences
2022-06-30 00:49:00 【This Livermore isn't too cold】
TSNE Purpose : Dimensionality reduction and visualization of high-dimensional data

Each data point is mapped to the corresponding probability distribution by mapping transformation . Specifically , Use in high dimensional space Gaussian distribution Convert the distance to a probability distribution , In low dimensional space , Use the long tail distribution to convert the distance to a probability distribution , Therefore, the middle and lower distances in the high-dimensional space can have a larger distance after mapping , This makes it possible to avoid paying too much attention to local features in dimensionality reduction , While ignoring global features .
import gensim, logging, os
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from cope_dataset import get_unrepeat_txt
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
import nltk
import numpy as np
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
# nltk.download('punkt')
from nltk.tokenize import word_tokenize
import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
import pandas as pd
def cosine(u, v):
return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))
def tsne_plot(tokens,labels):
""" utilize tsne Generate pictures """
tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
new_values = tsne_model.fit_transform(tokens)
x = []
y = []
for value in new_values:
x.append(value[0])
y.append(value[1])
plt.figure(figsize=(32, 32))
for i in range(len(x)):
plt.scatter(x[i], y[i])
plt.annotate(labels[i],
xy=(x[i], y[i]),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
plt.show()
plt.savefig('embedding_map.png')
def kmeans_vis(sentence_embeddings):
plt.figure(figsize=(32,32))
clf = KMeans(n_clusters=1000)
y_pred = KMeans(n_clusters=1000).fit_predict(sentence_embeddings)
plt.scatter(sentence_embeddings[:, 0], sentence_embeddings[:, 1], c=y_pred)
plt.title("Anisotropicly Disributed Blobs")
plt.show()
plt.savefig('kmeans.jpg')
s = clf.fit(sentence_embeddings)
# Get the category of all word vectors
labels=clf.labels_
print(clf.cluster_centers_)
def debcans_vis(sentence_embeddings):
plt.figure(figsize=(16,16))
y_pred = DBSCAN(eps = 0.5, min_samples = 100).fit_predict(sentence_embeddings)
plt.scatter(sentence_embeddings[:, 0], sentence_embeddings[:, 1], c=y_pred)
plt.show()
plt.savefig('debacns.jpg')
def train(str_en_text):
tokenized_sent = []
for s in str_en_text:
tokenized_sent.append(word_tokenize(s.lower()))
tagged_data = [TaggedDocument(d, [i]) for i, d in enumerate(tokenized_sent)]
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')
sentence_embeddings = sbert_model.encode(str_en_text) # We get the vector of each sentence
#kmeans_vis(sentence_embeddings)
debcans_vis(sentence_embeddings)
#tsne_plot(sentence_embeddings[:88],str_en_text[:88])
# Compare the similarity between sentences
#query = "Will the cat tower topple over easily?My cat is about 7-8 pounds."
# query_vec = sbert_model.encode([query])[0]
# for sent in str_en_text:
# sim = cosine(query_vec, sbert_model.encode([sent])[0])
# #print("Sentence = ", sent, "; similarity = ", sim)
# if float(sim)>0.8:
# break
if __name__ == '__main__':
json_path = '/cloud/cloud_disk/users/huh/dataset/nlp_dataset/question_dataset/ori/en_ch_cattree_personality.json'
str_en_text = get_unrepeat_txt()
train(str_en_text)
边栏推荐
- How to seamlessly transition from traditional microservice framework to service grid ASM
- 【Games101】Transformation
- YuMinHong: my retreat and advance; The five best software architecture patterns that architects must understand; Redis kills 52 consecutive questions | manong weekly VIP member exclusive email weekly
- Distributed task scheduling elasticjob demo
- 2022-06-29:x = { a, b, c, d }, y = { e, f, g, h }, x、y两个小数组长度都是4。 如果有: a + e = b + f = c + g = d + h
- 出门在外保护好自己
- 测试用例设计方法之等价类划分方法
- 干外包3年,真废了...
- 科创人·味多美CIO胡博:数字化是不流血的革命,正确答案藏在业务的田间地头
- Small and medium-sized enterprises should pay attention to these points when signing ERP contracts
猜你喜欢
![[cloud native] kernel security in container scenario](/img/cc/828a8f246b28cb02b7efa1bdd8dee4.png)
[cloud native] kernel security in container scenario

Common settings in idea

在线文本数字识别列表求和工具

IDEA中的常用设置

Simple pages

练习副“产品”:自制七彩提示字符串展示工具(for循环、if条件判断)

Relevance - canonical correlation analysis

简单的页面

外包干了三年,废的一踏糊涂...

Programmers with a monthly salary of less than 30K must recite the interview stereotype. I will eat it first!
随机推荐
[MRCTF2020]Ezpop-1|php序列化
Outsourcing for 3 years is a waste
IDEA中的常用设置
Mr. Hu Bo, CIO of weiduomei, a scientific innovator: digitalization is a bloodless revolution, and the correct answer lies in the field of business
Arlo felt lost
【UML】UML的几种关系(依赖-关联-聚合-组合-继承-实现)
Visual Studio 2017 无法打开包括文件: “QOpenGLFunctions_3_3_Core”: No such file or directory
Flask web minimalist tutorial (III) - Sqlalchemy (part a)
The SQL statement concat cannot find the result
【Spark】scala基础操作(持续更新)
MySQL foundation 3
MySQL foundation 2
How to customize templates and quickly generate complete code in idea?
玉米地里的小鸟
Relevance - canonical correlation analysis
[programming problem] maze problem
Sfdp 超级表单开发平台 V6.0.4 正式发布
Interviewer: how to solve the problem of massive requests for data that does not exist in redis, which affects the database?
初始I/O及其基本操作
练习副“产品”:自制七彩提示字符串展示工具(for循环、if条件判断)