P2V

# $P2V$: Learning the Representation for Academic papers ## Implementation You can download our implementation [here](./p2v.tar.gz). The codes are written in Cython, we recommand Intel Python distribution for better performance. You can compile the code using the following commond: `python3 setup.py build_ext --inplace` ### Filelist The provided implementation has five files: 1. cora.ipynb -- An example in Jupyter Notebook using cora dataset. 2. coraDataIterator.py is the data iterator that loads the data. 3. evaluate_utils.py Evaluation utils to perform classification evaluation on the model. 4. model.pyx The P2V model in Cython. 5. setup.py The compliation configuration file for Cython model. ## Dataset More dataset can be found [here](http://zhang18f.myweb.cs.uwindsor.ca/datasets/) | Dataset | # Papers | Avg word per paper | Vocabularity Size | # Citation | Avg degree | # Classes | |----------------------------------:|-----------:|-------------------:|------------------:|------------:|-----------:|----------:| | (Deprecated)[Cora](cora.tar.gz) | 2,708 | 18.17 | 1,432 | 5,429 | 2.00 | 7 | | [Cora-enriched](cora.tar.gz) | 2,708 | 937.83 | 26,007 | 5,429 | 2.00 | 7 | | [CS-title(Aminer)](aminer.tar.gz) | 2,150,158 | 7.23 | 181,006 | 4,191,677 | 1.95 | 10 | | Health Data(Avaliable on s1) | 42,294,432 | 69.19 | 4,660,525 | 272,694,431 | 6.45 | - | Note: The orignal Aminer data can be download from [here](https://static.aminer.org/lab-datasets/citation/dblp.v8.tgz). ## $P2V$ $P2V$ learns from three components from academic paper: word-word relations, paper-word relations and paper-paper relations. Follow the idea of Skip-gram model, we can obtain the embeddings by maximizing the log probability of observed training pairs ### Unweighted $P2V$ The most straightforward method is to maximum the average log probability of all observed training pairs from word-word, paper-word, paper-paper relations. The objective function is: $$ \begin{align} O =& \frac{1}{S}[\sum^{T}_{i=1} \sum_{-c_w\leq j \leq c_w,j\neq0} \log p(w_{i+j} | w_{i}) + \sum^{D}_{i=1} \sum_{w_j \in d_i} \log p(w_j | d_i) + \sum^{D}_{i=1} \sum_{d_j \in Sampling(d_i)} \log p(d_{j} | d_{i})], \label{eq:objective-unweighted-nol2} \end{align} $$ where $S$ is the number of observed training pairs, $T$ is the size of the corpus, $D$ is the number of documents. $c_w$ is the dynamic window sizes for word-word relations. $Sampling(n)$ is the star-sampling, which returns a set of neighbors of node $n$. $\log p(\cdot,\cdot)$ is the log probability defined using negative sampling as following: $$ \begin{align} \log \sigma(u^\top_{j} \cdot v_{i}) + \sum_{k=1}^{K} \mathbb{E}_{u_{k} \sim P_{n}(u)} [ \log \sigma(-u^{\top}_{k} \cdot v_{i}) ] , \end{align} $$ here $P_n(\cdot)$ is the noise distribution. ### Equal-weight $P2V$ However, due to the different distribution of word-word, word-documents and document-document training pairs, optimize the Unweighted $P2V$ will puts more weights on the components that have more training samples. To balance the weight learned from each component, we can learn the embeddings by maximizing the average log probability from each component. The objective function is: $$ \begin{align} O =& \frac{1}{S_w}\sum^{T}_{i=1} \sum_{-c_w\leq j \leq c_w,j\neq0} \log p(w_{i+j} | w_{i})\\ +& \frac{1}{S_d}\sum^{D}_{i=1} \sum_{w_j \in d_i} \log p(w_j | d_i)\\ +& \frac{1}{S_n}\sum^{D}_{i=1} \sum_{d_j \in Sampling(d_i)} \log p(d_{j} | d_{i}) \label{eq:objective-equal-weight-nol2} \end{align} $$ where $S_w$ is the number of observed word-word pairs, $S_d$ is the number of observed word-document pairs, $S_n$ is the number of observed document-document pairs. ### weighted $P2V$ In some cases, we may want to control the weights learned from each component, say $\alpha$ for word-word relations, $\beta$ for word-document relations, $\gamma$ for document-document relations. We then can rewrite the objective function into: $$ \begin{align} O =& \frac{\alpha}{S_w}\sum^{T}_{i=1} \sum_{-c_w\leq j \leq c_w,j\neq0} \log p(w_{i+j} | w_{i})\\ +& \frac{\beta}{S_d}\sum^{D}_{i=1} \sum_{w_j \in d_i} \log p(w_j | d_i)\\ +& \frac{\gamma}{S_n}\sum^{D}_{i=1} \sum_{d_j \in Sampling(d_i)} \log p(d_{j} | d_{i}) \end{align} $$ ### L2-Regularization Since SGNS suffers from norm overfitting problem as described in [here](../w2v/index.html), we need to add L2-regularization for both embeddings vectors and output (context) vectors. $$ \begin{align} O =& \frac{\alpha}{S_w}\sum^{T}_{i=1} \sum_{-c_w\leq j \leq c_w,j\neq0} \log p(w_{i+j} | w_{i})\\ +& \frac{\beta}{S_d}\sum^{D}_{i=1} \sum_{w_j \in d_i} \log p(w_j | d_i)\\ +& \frac{\gamma}{S_n}\sum^{D}_{i=1} \sum_{d_j \in Sampling(d_i)} \log p(d_{j} | d_{i})\\ -& \lambda\sum_{i=1}^V {\Vert v_{w_i}\Vert _2}^2 - \lambda \sum_{i=1}^V {\Vert u_{w_i}\Vert _2}^2 - \lambda\sum_{i=1}^D {\Vert v_{d_i}\Vert _2}^2 - \lambda \sum_{i=1}^D {\Vert u_{d_i}\Vert _2}^2, \end{align} $$