背景
- Clustering is a fundamental procedure in machine learning and data analysis.
- Well known approaches:
- Center-based
- Spectral methods
- ...
- Reliable clustering of noisy high-dimensional datasets remains an open problem.
- 部分方法的embedding过程和无监督聚类过程显著分开(先完整做完embedding部分的训练,然后再做聚类)
DCC的需求
- Express the joint problem as optimization of a single continuous objective.
- This optimization should be amenable to scalable gradient-based solvers such as modern variants of SGD.
- The formulation should not require setting the number of clusters a priori, since this number is often not known in advance.
现有方法实现上述一项是容易的,但是结合起来则是一项挑战。
训练和优化
Training steps:
- 输入散点X=\{x_1, \cdots , x_n\},每个点的都是D维的向量;利用kNN或mutual kNN与最小生成树的辅助来构造边,进而构成网络\mathcal{E}。
- 由F_\theta :R^D\rightarrow R^d, d<<D,作用包括特征提取和降Y = F_\theta (X)
- 为了约束F_\theta产生可信的embedding,引入一个反向的mapping G_ω:R^d \rightarrow R^D,期望minimize_Ω |(|X-G_ω (Y)|)|_F^2 (Frobenius norm),称作reconstruction loss,记作L1。
- 基于F_θ,对X做embedding得到Y=F_θ (X),引入特征表示Z={z_1,…, z_n }, z_i∈R^d,利用优化目标L2, L3约束Z。
- 联合优化L1, L2, L3
- 基于Z,构造出一个新的网络G=(V,F),其中F是边,以邻接矩阵表示则是F=(f_(i,j) ),对Z的所有元素两两计算,若||z_i -z_j ||<\epsilon,则f*(i, j)=1,通过连通子图表示聚类结果。
- 如果一轮中使聚类发生变动的边变化少于0.1%时则停止算法。
Objective:
评价 Evaluation
Adjusted 互信息(AMI)和聚类精确度(ACC)
MI(X, Y): Mutual Information, H(X): Entropy
评论区