Research:Convolutional Graph Embeddings for article recommendation in Wikipedia

Created
Contact
Oleksii Moskalenko
article recommendation

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


In this work thesis, we were solving the task of a recommendation system to recommend articles to edit to Wikipedia contributors. Our system is built on top of articles’ embeddings constructed by applying Graph Convolutional Network to the graph of Wikipedia articles. We outperformed embeddings generated from the text (via Doc2Vec model) by 47% in Recall and 32% in Mean Reciprocal Rank (MRR) score for English Wikipedia and by 62% in Recall and 41% in MRR for Ukrainian in the offline evaluation conducted on the history of previous users’ editions. With the additional ranking model we were able to achieve total improvement on 68% in Recall and 41% in MRR on English edition of Wikipedia. Graph Neural Networks are deep learning based methods aimed to solve typ- ical Machine Learning tasks such as classification, clusterization or link prediction for structured data - Graphs - via message passing architecture. Due to the explo- sive success of Convolution Neural Networks (CNN) in the construction of highly expressive representations - similar ideas were recently projected onto GNN. Graph Convolutional Networks are GNNs that likewise CNNs allow sharing weights for convolutional filters across nodes in the graph. They demonstrated especially good performance on the task of Representation Learning via semi-supervised tasks as mentioned above classification or link-prediction.


Offline Preparation edit

Our model consists of two steps: Candidate Generation and Ranking.

Our solution for Candidate Generation is based on Graph Convolutional Network that learns to generate articles embeddings to measure the similarity between them. Having user's previous history of edited articles, we calculate user representation (we are planning to compare several aggregator mechanisms - currently only averaging is available). This user representation then acts as query to Nearest Neighbors search. N most similar articles to given query are being produced as candidates.

During offline (preparation) we train our GCN (GraphSAGE) model on the task of link prediction (internal links between articles) with text representations obtained from Doc2Vec model as initial state for each node.

Prediction API edit

Our API aimed to produce realtime recommendations based on user previously edited articles. Input can be passed as articles' IDs or User ID.

As the result we return recommended Articles' IDs.

Methods edit

POST /api/<Wikipedia Edition>/v1/recommend

Parameters edit

accepts application/json as input payload

  • user_id: integer, optional [Must be specified one of user_id/user_history]. Id of user in the Wikipedia database.
  • user_history: list[Article], optional [Must be specified one of user_id/user_history]. List of article objects previously edited by user (see definition below).
  • top_n: integer, required. Amount of recommendations that API will return
  • ranking: string, optional (default=deep-ranking). Ranking model that is used to sort candidates. Available options
    • deep-ranking: Our ranking model based on pointwise ranking approach
    • cosine: Sort by cosine distance between user representation and candidates
    • tf-ranking: Another Deep Rank model but groupwise in this case, based on paper https://arxiv.org/pdf/1811.04415.pdf
  • aggregator: string, optional (default=average). Aggregator used for generating user representation (currently only average is available)

Article Object

  • id: either id or title must be specified
  • title: either id or title must be specified

Response edit

returns application/json

  • recommendation
    • articles: list[Article]: list of recommendations

Example edit

curl -X POST \
    http://<host>/api/en/v1/recommend \
    -H 'Content-Type: application/json' \
    -d '{
        "user_history": [
            {"title": "Graph"},
            {"title": "Convolutional_Neural_Network"}
        ],
        "top_n": 2,
        "ranking": "cosine"
    }'
    
{
    "recommedation": {
        "articles": [
            {
                "title": "Deep_learning",
                "page_id": 32472154
            },
            {
                "title": "Artificial_neural_network",
                "page_id": 21523,
            }
        ]
    }
}