NormanZyq
发布于 2021-09-13 / 1105 阅读
0
0

Deep Continuous Clustering 阅读笔记

背景

  • Clustering is a fundamental procedure in machine learning and data analysis.
  • Well known approaches:
    • Center-based
    • Spectral methods
    • ...
  • Reliable clustering of noisy high-dimensional datasets remains an open problem.
  • 部分方法的embedding过程和无监督聚类过程显著分开(先完整做完embedding部分的训练,然后再做聚类)

DCC的需求

  1. Express the joint problem as optimization of a single continuous objective.
  2. This optimization should be amenable to scalable gradient-based solvers such as modern variants of SGD.
  3. The formulation should not require setting the number of clusters a priori, since this number is often not known in advance.

现有方法实现上述一项是容易的,但是结合起来则是一项挑战。

训练和优化

Training steps:

  1. 输入散点$X={x_1, \cdots , x_n}$,每个点的都是$D$维的向量;利用kNN或mutual kNN与最小生成树的辅助来构造边,进而构成网络$\mathcal$。
  2. 由$F_\theta :RD\rightarrow Rd, d<<D$,作用包括特征提取和降$Y = F_\theta (X)$
  3. 为了约束$F_\theta$产生可信的embedding,引入一个反向的mapping $G_ω:Rd \rightarrow RD$,期望$minimize_Ω |(|X-G_ω (Y)|)|_F^2$ (Frobenius norm),称作reconstruction loss,记作L1。
  4. 基于$F_θ$,对X做embedding得到$Y=F_θ (X)$,引入特征表示$Z={z_1,…, z_n }, z_i∈R^d$,利用优化目标L2, L3约束Z。
  5. 联合优化L1, L2, L3
  6. 基于$Z$,构造出一个新的网络$G=(V,F)$,其中$F$是边,以邻接矩阵表示则是$F=(f_(i,j) )$,对Z的所有元素两两计算,若$||z_i -z_j ||<\epsilon$,则$f*(i, j)=1$,通过连通子图表示聚类结果。
  7. 如果一轮中使聚类发生变动的边变化少于0.1%时则停止算法。

Image.png

Objective:

Image.png

Image.png

评价 Evaluation

Adjusted 互信息(AMI)和聚类精确度(ACC)

Image.png

MI(X, Y): Mutual Information, H(X): Entropy

Image.png

实验结果

综合结果

Image.png

Image.png

对维度d的探究:

Image.png


评论