我已经写了一段代码,它将根据silhouette_score
的最大值给出最佳的集群数量。现在我想找出每个集群有多少个值。例如,我的结果是最佳聚类数是3,我想找出每个聚类有多少值,例如第一个聚类有1241个值,第二个是3134个值,第三个是351个值。有没有可能做这样的事情?
import pandas as pd
import matplotlib.pyplot as plt
import re
from sklearn.preprocessing import scale
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.cluster import KMeans, MiniBatchKMeans, AffinityPropagation
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.metrics.cluster import adjusted_mutual_info_score
from sklearn.decomposition import PCA
df = pd.read_csv('CNN Comments.csv')
df = df.head(8000)
#print(df)
x = df['Text Data']
cv = TfidfVectorizer(analyzer = 'word',max_features = 10000, preprocessor=None, lowercase=True, tokenizer=None, stop_words = 'english')
#cv = CountVectorizer(analyzer = 'word', max_features = 8000, preprocessor=None, lowercase=True, tokenizer=None, stop_words = 'english')
x = cv.fit_transform(x)
my_list = []
list_of_clusters = []
for i in range(2,5):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(x)
my_list.append(kmeans.inertia_)
cluster_labels = kmeans.fit_predict(x)
silhouette_avg = silhouette_score(x, cluster_labels) * 100
print(round(silhouette_avg,2))
list_of_clusters.append(round(silhouette_avg, 1))
plt.plot(range(2,5),my_list)
plt.show()
number_of_clusters = max(list_of_clusters)
number_of_clusters = list_of_clusters.index(number_of_clusters)+2
print('Number of clusters: ', number_of_clusters)
发布于 2019-09-12 07:58:32
您可以使用分配给cluster_labels
的数组来获取集群分配的分布。我推荐使用集合模块中的Counter
。
from collections import Counter
...
cluster_labels = kmeans.fit_predict(x)
cluster_counts = Counter(cluster_labels)
发布于 2019-09-13 06:12:31
numpy的alternativ:
import numpy as np
...
unique, counts = np.unique(kmeans.fit_predict(x), return_counts=True)
print(dict(zip(unique, counts)))
https://stackoverflow.com/questions/57901993
复制相似问题