8.31 Outlier Resistant Unsupervised Deep Architectures for Attributed Network Embedding
Outlier Resistant Unsupervised Deep Architectures for Attributed Network Embedding. Space Ver.
The “outlier” is similar to the term “anomaly”
Source: WSDM 2020
Summary
- Outlier similar to the term anomaly but with a little difference. I think it would be: node anomaly ≈ outlier in this paper
- the main work in this paper focuses on representation learning but not the downstream work. In the experiment section, the authors show many kinds of downstream works that with the proposed methods.
- Outlier will cause the decrease of the ability of a node embedding, which will bring negative effect to the downstream works. The paper aims to address this problem.
Motivations & Contributions
- the outliers will bring negative effect to the node embeddings. e.g. node2vec, which would make nodes from different communities have similar embeddings. The main reason is that the outliers cause the random walk across the communities.
- Proposed DONE, Deep Outlier aware attributed Network Embedding, which is the (first) unsupervised deep architecture used to do anomaly aware structural and attribute embedding
- Proposed AdONE, Adversarial ONE
Problem definition
(a) Structural Outlier: The node is structurally connected to nodes from different communities, i.e., its structural neighborhood is inconsistent. (b) Attribute Outlier: The attributes of the node are similar to that of nodes from different communities, i.e., its attribute neighborhood is inconsistent. (c) Combined Outlier: Node belongs to a community structurally but it belongs to a different community in terms of attribute similarity.
Proposed methods
Pre-process on network
- 因为现实中的网络大多是高度稀疏的,存在节点之间的连接缺失的情况。邻接矩阵的行只能捕捉到观测到的链接。受page rank的启发,作者使用重新启动的随机游走来获得更丰富的上下文,从而在拟提出的解决方案中保留更高阶的邻近性
- 重启随机游走。备注:D是度矩阵,对角阵;$(P^{t}_i)_j$ 是节点i开始经过t个步骤之后到达节点j的概率,由此可以组成$P^t$矩阵。对于$P^0_i$来说,初始状态$(P^0_i)_i = 1$
$$ P_i^t=rP_i^{t-1}[D^{-1}A]+(1-r)P_i^0 $$
最终得到新的X
$$ X=\frac{1}{T}\sum\limits_{\boldsymbol{t}=1}^TP^t $$
- As for now, A: adjacency matrix, D degree matrix, P transition probability matrix, X: input, C: attribute matrix
architecture
first method of this paper: DONE [superscript s denotes structural, superscript a denotes attribute]
- two parallel AE, process structure and node attributes respectively. The input of former AE is X, the later one is C
- every node has two vectors, structural vector and attribute vector, using s and a to distinguish
- loss function: since reconstruction loss is easily affected by the outlier, the proximity loss is changed to :
$$ \mathcal{L}{str}^{Prox}=\frac1N\sum{i=1}^Nlog(\frac1{o_i^s})||\mathbf{x}_i-\hat{\mathbf{x}}_i||_2^2 $$
$o_i^s$ is the anomaly score
- homophily: the next part of loss function is derived from homophily, which means the nodes that are connected with a link should behave with similarities. [here use embeddings]:
$$ \mathcal{L}{str}^{Hom}=\frac1N\sum{i=1}^{N}log(\frac1{o_{i}^{s}})\frac1{|\mathcal{N}(i)|}\sum_{j\in\mathcal{N}(i)}||\mathbf{h}{i}^{s}-\mathbf{h}{j}^{s}||_{2}^{2} $$
- the above loss functions are derived from structural information, and the following parts are derived from attribute information. [use C and attribute embeddings]:
$$ \mathcal{L}{attr}^{Prox}=\frac1N\sum{i=1}^Nlog(\frac1{o_i^a})||\mathbf{c}_i-\mathbf{\hat{c}}_i||_2^2 $$
$$ \mathcal{L}{attr}^{Hom}=\frac{1}{N}\sum{i=1}^{N}log(\frac{1}{o_{i}^{a}})\frac{1}{|\mathcal{N}(i)|}\sum_{j\in\mathcal{N}(i)}||\mathbf{h}{i}^{a}-\mathbf{h}{j}^{a}||_{2}^{2} $$
- the final part is derived from something that combined structural and attribute. [use structural embeddings and attribute embeddings]:
$$ \mathcal{L}^{Com}=\frac1N\sum_{i=1}^Nlog(\frac1{o_i^{com}})||\mathbf{h}_i^s-\mathbf{h}_i^a||_2^2 $$
- the overall loss is as follows:
$$ \begin{aligned} &\min_{\Theta,O}\mathcal{L}{DONE} \ &=\alpha{1}\mathcal{L}{str}^{Prox}+\alpha{2}\mathcal{L}{str}^{Hom}+\alpha{3}\mathcal{L}{attr}^{Prox}+\alpha{4}\mathcal{L}{attr}^{Hom}+\alpha{5}\mathcal{L}^{Com} \end{aligned} $$
- the authors mentioned that, it will take a long time to optimize if use Adam directly. So here use 'alternating minimization technique' to update the params
Supplement: alternating minimization technique
Alternating Minimization is an optimization algorithm used to solve optimization problems that can be decomposed into multiple subproblems, each of which can be optimized separately while holding the others fixed. It is commonly used in various fields, including machine learning, signal processing, and numerical optimization.
The basic idea behind the Alternating Minimization Algorithm is to iteratively update the variables or parameters of the problem in a way that reduces the objective function's value. The algorithm typically follows these steps:
-
Initialize the variables or parameters of the problem.
-
Iteratively update one subset of variables while keeping the others fixed. This update can be done using various optimization techniques, such as gradient descent, coordinate descent, or other specialized methods depending on the problem's nature.
-
Repeat step 2 for each subset of variables until convergence criteria are met. Convergence can be determined based on changes in the objective function, a maximum number of iterations, or other stopping criteria.
-
The algorithm stops when the convergence criteria are met, and the current values of the variables are considered an approximate solution to the optimization problem.
The "alternating" part of the algorithm comes from the fact that it alternates between optimizing different subsets of variables in each iteration. This alternating process can continue until a specified convergence threshold is reached or a maximum number of iterations is performed.
Alternating Minimization is particularly useful for solving problems with a natural decomposition into subproblems, where optimizing each subset of variables individually is easier or computationally more efficient than optimizing the entire problem at once. It is widely used in applications like matrix factorization, image processing, and optimization problems in statistics and machine learning.
the optimized result may not the global optimization
the second method: AdONE
- DONE has a drawback. It uses L2 norm. It will lead a very close distance of structural and attribute embeddings in the euclidean space
- key point: use a discriminator to align the structural and attribute embedding. Its task is to classify the type of a embedding (structure or attribute), while the two AEs try to fool the discriminator.
At the end of this game, the structural and attribute embeddings his and ha i may still not be very close to each other in the Euclidean space. But, they would be aligned in the sense that the structure and attribute embeddings are very close to the decision boundary of the discriminator at equilibrium.
- the discriminator is a two-layer network. the output of structural AE is positive, the output of attr AE is negative (1 and 0). The objective is to maximize:
$$ \mathcal{L}^{Disc}(\Theta_D)=\frac1N\sum_{i=1}^N\left(log(D(\mathbf{h}_i^s))+log(1-D(\mathbf{h}_i^a))\right) $$
For AEs, they should fool the discriminator, so they minimize:
$$ \mathcal{L}^{Alg}=\frac1N\sum_{i=1}^Nlog(\frac1{o_i^{com}}){\left(log(D(\mathbf{h}_i^s))+log(1-D(\mathbf{h}_i^a))\right)} $$
- In AdONE, the overall loss function is:
$$ \begin{aligned} &\min_{\Theta,O}~\mathcal{L}{AdONE} \ &=\beta{1}\mathcal{L}{str}^{Prox}+\beta{2}\mathcal{L}{str}^{Hom}+\beta{3}\mathcal{L}{attr}^{Prox}+\beta{4}\mathcal{L}{attr}^{Hom}+\beta{5}\mathcal{L}^{Alg} \end{aligned} $$
the difference between this one and the previous DONE's loss function is the final term.
Experiments
Compared the following methods in outlier detection, node classification, community detection tasks. DONE and AdONE are not always the best approaches in all tasks and datasets
node2vec
Line
SDNE
GraphSage
DGI
SEANO
ONE
Dominant
来源:WSDM 2020
Summary
- 这里的outlier离群点类似anomaly,但是想阐述的意义有一点点不太一样,感觉在本文的范围内可以粗略的理解为node anomaly ≈ outlier
- 本文工作的重点在于表示学习,而非下游工作
- 异常点的存在会使得node embedding 的表达能力变差,对于很多下游工作有不利影响,本文提出的两个方法旨在解决这个问题
动机&本文贡献
- 异常点会对embedding的计算带来不利影响,主要指出了node2vec对有异常点的时候会使不同社区的节点有相似的embedding,主要原因是异常点会导致出现跨社区的游走,破坏了同质性
- 提出DONE,(首个)无监督的深度架构,用于异常值感知的属性网络嵌入 Deep Outlier aware attributed Network Embedding
- 提出AdONE,利用对抗学习的想法来做异常感知的网络embedding Adversarial ONE
问题定义
(a) Structural Outlier: The node is structurally connected to nodes from different communities, i.e., its structural neighborhood is inconsistent. (b) Attribute Outlier: The attributes of the node are similar to that of nodes from different communities, i.e., its attribute neighborhood is inconsistent. (c) Combined Outlier: Node belongs to a community structurally but it belongs to a different community in terms of attribute similarity.
( a )结构离群点:节点在结构上与来自不同社区的节点相连,即其结构邻域是不一致的。( b )属性离群点:节点的属性与来自不同社区的节点的属性相似,即其属性邻域不一致。( c )组合离群点:节点在结构上属于一个社区,但在属性相似度上属于不同的社区。
方法
网络预处理
- 因为现实中的网络大多是高度稀疏的,存在节点之间的连接缺失的情况。邻接矩阵的行只能捕捉到观测到的链接。受page rank的启发,作者使用重新启动的随机游走来获得更丰富的上下文,从而在拟提出的解决方案中保留更高阶的邻近性
- 重启随机游走。备注:D是度矩阵,对角阵;$(P^{t}_i)_j$ 是节点i开始经过t个步骤之后到达节点j的概率,由此可以组成$P^t$矩阵。对于$P^0_i$来说,初始状态$(P^0_i)_i = 1$
$$ P_i^t=rP_i^{t-1}[D^{-1}A]+(1-r)P_i^0 $$
最终得到新的X
$$ X=\frac{1}{T}\sum\limits_{\boldsymbol{t}=1}^TP^t $$
- 目前已有符号这里的A是邻接矩阵,D是度矩阵,P概率转移矩阵,X:输入的结构矩阵
架构图
本文第一个方法:DONE【上标s表示structural,上标a表示attribute】
- 两个并行的AE,分别处理结构和节点属性,前者接受的输入是X,后者接受的输入是C
- 每个节点都会有两个向量,分别代表结构向量和属性向量
- loss函数:重构损失部分会受到异常值的影响,所以作者将邻近损失proximity loss修改为【用的是X】:
$$ \mathcal{L}{str}^{Prox}=\frac1N\sum{i=1}^Nlog(\frac1{o_i^s})||\mathbf{x}_i-\hat{\mathbf{x}}_i||_2^2 $$
$o_i^s$是异常得分
- 同质性:loss的下一部分由同质性得到,有边连接的两个节点应该有相似的行为【用的是结构embedding】:
$$ \mathcal{L}{str}^{Hom}=\frac1N\sum{i=1}^{N}log(\frac1{o_{i}^{s}})\frac1{|\mathcal{N}(i)|}\sum_{j\in\mathcal{N}(i)}||\mathbf{h}{i}^{s}-\mathbf{h}{j}^{s}||_{2}^{2} $$
- 上面的是structure,对于属性也就是attribute AE,loss类似的,有如下两部分【用的分别是属性C和属性embedding】:
$$ \mathcal{L}{attr}^{Prox}=\frac1N\sum{i=1}^Nlog(\frac1{o_i^a})||\mathbf{c}_i-\mathbf{\hat{c}}_i||_2^2 $$
$$ \mathcal{L}{attr}^{Hom}=\frac{1}{N}\sum{i=1}^{N}log(\frac{1}{o_{i}^{a}})\frac{1}{|\mathcal{N}(i)|}\sum_{j\in\mathcal{N}(i)}||\mathbf{h}{i}^{a}-\mathbf{h}{j}^{a}||_{2}^{2} $$
- 最后一个结合了结构和属性的loss【用的是结构emb和属性emb】:
$$ \mathcal{L}^{Com}=\frac1N\sum_{i=1}^Nlog(\frac1{o_i^{com}})||\mathbf{h}_i^s-\mathbf{h}_i^a||_2^2 $$
- 总的loss,目标则是最小化它
$$ \begin{aligned} &\min_{\Theta,O}\mathcal{L}{DONE} \ &=\alpha{1}\mathcal{L}{str}^{Prox}+\alpha{2}\mathcal{L}{str}^{Hom}+\alpha{3}\mathcal{L}{attr}^{Prox}+\alpha{4}\mathcal{L}{attr}^{Hom}+\alpha{5}\mathcal{L}^{Com} \end{aligned} $$
- 作者提到,直接用Adam优化会非常慢,所以采用“alternating minimization technique,交替极小化”进行参数更新
补充:交替极小化算法
C 老师:
交替极小化算法(Alternating Minimization Algorithm)是一种优化算法,通常用于解决优化问题,特别是涉及多个变量的问题。该算法的核心思想是将一个大问题分解成多个小问题,然后交替地最小化这些小问题,以寻找全局最优解或局部最优解。 这种算法通常用于以下情况:
- 多变量优化问题: 当优化问题涉及多个变量时,交替极小化算法可以将问题分解成单变量或部分变量的优化子问题。
- 非凸优化问题: 交替极小化算法常用于处理非凸优化问题,因为这些问题可能有多个局部最优解,难以使用传统的梯度下降等方法找到全局最优解。
- 约束优化问题: 对于带有约束条件的优化问题,可以将约束条件纳入考虑,将问题分解为子问题,以便更容易处理。
交替极小化算法的基本步骤如下:
- 初始化: 初始化所有变量的值。
- 交替迭代: 在每个迭代步骤中,选择一个或多个变量来最小化目标函数的一个部分。这可以是使用解析解或其他优化技巧来完成的。
- 更新变量: 在每个子问题的优化步骤之后,更新相应的变量的值。
- 收敛检查: 检查收敛条件,通常是目标函数值的变化小于某个阈值或达到一定的迭代次数。
- 重复: 如果未满足收敛条件,则重复步骤2至步骤4,直到达到停止条件为止。
优化结果不一定是全局最优解
本文第二个方法:AdONE方法——用GAN
- DONE中存在一个缺陷:当时用了L2范数,但是L2范数直接最小化限制较大,因为L2将结构和属性嵌入的很近(欧式空间中),如果属性和结构表达了不一致的信息时,就会导致学习到的信息不准确的问题。
- 核心思想:用一个判别器来对齐结构和属性embedding,它的任务是分类,判断一个embedding是来自结构的还是属性的,两个AE则要欺骗过判别器,使其无法区分到底是来自哪个的embedding
At the end of this game, the structural and attribute embeddings hs and ha may still not be very close to each other in the Euclidean space. But, they would be aligned in the sense that the structure and attribute embeddings are very close to the decision boundary of the discriminator at equilibrium. 最后,在欧氏空间中,结构和属性的embedding可能仍然不是很接近。但是,它们会在结构和属性嵌入非常接近判别器在均衡时的决策边界的意义上对齐。 这段话有点无法理解
- 判别器结构:两层神经网络,结构AE当作positive,属性AE的当作negative,前者希望输出1,后者希望输出0。对于判别器来说,优化目标为最大化这个:
$$ \mathcal{L}^{Disc}(\Theta_D)=\frac1N\sum_{i=1}^N\left(log(D(\mathbf{h}_i^s))+log(1-D(\mathbf{h}_i^a))\right) $$
对于两个AE来说,则需要欺骗判别器,它们的优化目标为最小化这个式子:
$$ \mathcal{L}^{Alg}=\frac1N\sum_{i=1}^Nlog(\frac1{o_i^{com}}){\left(log(D(\mathbf{h}_i^s))+log(1-D(\mathbf{h}_i^a))\right)} $$
- 在AdONE中,损失函数最终为:
$$ \begin{aligned} &\min_{\Theta,O}~\mathcal{L}{AdONE} \ &=\beta{1}\mathcal{L}{str}^{Prox}+\beta{2}\mathcal{L}{str}^{Hom}+\beta{3}\mathcal{L}{attr}^{Prox}+\beta{4}\mathcal{L}{attr}^{Hom}+\beta{5}\mathcal{L}^{Alg} \end{aligned} $$
区别主要在于最后一项,AdONE通过GAN的方式获得了一个新的loss,优化效果具有比DONE有更好的理论性能
实验
对比方法
| 名称 | | | | --------- | - | - | | node2vec | | | | Line | | | | SDNE | | | | GraphSage | | | | DGI | | | | SEANO | | | | ONE | | | | Dominant | | |
实验细节
上面的方法主要都是为了获得embedding h,所以很多基于表示学习的下游工作都可以基于这个来进行,不仅限于异常检测,本文还做了社区检测、节点分类等其他实验
评论区