今天使用图神经网络和一个很新的库StellarGraph,然后基于TensorFlow搭建了一个分类模型
先来看一下谷歌对cora数据集的介绍:
The Cora dataset consists of 2708 scientific publications classified into one of seven classes.
The citation network consists of 5429 links.
Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary.
翻译过来就是:
====
Cora数据集,该数据集由 2708 篇论文,及它们之间的引用关系构成的 5429 条边组成。这些论文被根据主题划分为7类,分别是神经网络、强化学习、规则学习、概率方法、遗传算法、理论研究、案例相关。每篇论文的特征是通过词袋模型得到的,维度为1433,每一维表示一个词,1表示该词在这篇文章中出现过,0表示未出现。
====
今天我们使用一个深度学习库叫做:
StellarGraph
官网:
然后开始今天的代码:
import pandas as pd
import os
import stellargraph as sg
from stellargraph.mapper import FullBatchNodeGenerator
from stellargraph.layer import GCN
from tensorflow.keras import layers, optimizers, losses, metrics, Model
from sklearn import preprocessing, model_selection
from IPython.display import display, HTML
import matplotlib.pyplot as plt
%matplotlib inline
基本的导入依赖
dataset = sg.datasets.Cora()
display(HTML(dataset.description))
G, node_subjects = dataset.load()
数据集的介绍:
print(G.info())
论文类型:
node_subjects.value_counts().to_frame()
划分数据集:
train_subjects, test_subjects = model_selection.train_test_split(
node_subjects, train_size=140, test_size=None, stratify=node_subjects
)
val_subjects, test_subjects = model_selection.train_test_split(
test_subjects, train_size=500, test_size=None, stratify=test_subjects
)
训练集:
train_subjects.value_counts().to_frame()
对类别进行编码:
target_encoding = preprocessing.LabelBinarizer()
train_targets = target_encoding.fit_transform(train_subjects)
val_targets = target_encoding.transform(val_subjects)
test_targets = target_encoding.transform(test_subjects)
# 独热编码
generator = FullBatchNodeGenerator(G, method="gcn")
anyway,这个generator暂时还没搞懂,看上去是个节点生成器之类的
train_gen = generator.flow(train_subjects.index, train_targets)
gcn = GCN(
layer_sizes=[16, 16], activations=["relu", "relu"], generator=generator, dropout=0.5
)
然后看一下out的shape:
x_inp, x_out = gcn.in_out_tensors()
x_out
然后进行预测:
predictions = layers.Dense(units=train_targets.shape[1], activation="softmax")(x_out)
然后使用tf进行模型的搭建:
model = Model(inputs=x_inp, outputs=predictions)
model.compile(
optimizer=optimizers.Adam(lr=0.01),
loss=losses.categorical_crossentropy,
metrics=["acc"],
)
验证模型&早停:
val_gen = generator.flow(val_subjects.index, val_targets)
from tensorflow.keras.callbacks import EarlyStopping
es_callback = EarlyStopping(monitor="val_acc", patience=50, restore_best_weights=True)
训练:
history = model.fit(
train_gen,
epochs=200,
validation_data=val_gen,
verbose=2,
shuffle=False, # this should be False, since shuffling data means shuffling the whole graph
callbacks=[es_callback],
)
画图来瞅瞅:
sg.utils.plot_history(history)
验证数据:
test_gen = generator.flow(test_subjects.index, test_targets)
test_metrics = model.evaluate(test_gen)
print("\nTest Set Metrics:")
for name, val in zip(model.metrics_names, test_metrics):
print("\t{}: {:0.4f}".format(name, val))
all_nodes = node_subjects.index
all_gen = generator.flow(all_nodes)
all_predictions = model.predict(all_gen)
node_predictions = target_encoding.inverse_transform(all_predictions.squeeze())
df = pd.DataFrame({"Predicted": node_predictions, "True": node_subjects})
df.head(20)
embedding_model = Model(inputs=x_inp, outputs=x_out)
嵌入模型的搭建
emb = embedding_model.predict(all_gen)
emb.shape
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
transform = TSNE # or PCA 降维
X = emb.squeeze(0)
X.shape
trans = transform(n_components=2)
X_reduced = trans.fit_transform(X)
X_reduced.shape
fig, ax = plt.subplots(figsize=(7, 7))
ax.scatter(
X_reduced[:, 0],
X_reduced[:, 1],
c=node_subjects.astype("category").cat.codes,
cmap="jet",
alpha=0.7,
)
ax.set(
aspect="equal",
xlabel="$X_1$",
ylabel="$X_2$",
title=f"{transform.__name__} visualization of GCN embeddings for cora dataset",
)
画图来check一下我们的图神经网络的分类是否准确:
训练之前的: