Research talk:Characterizing Wikipedia Reader Behaviour/Demographics and Wikipedia use cases/Work log/2019-09-11

Thursday, September 12, 2019 edit

This work log documents my current progress at expanding the ORES drafttopic model, which predicts topics for a given English Wikipedia article, to other languages on Wikipedia. The goal is to map any given Wikipedia article to one or more, human-interpretable labels that identify what high-level topics relate to that article. These topics can then be used to understand reader behavior by mapping page views from millions of articles to a much much smaller set of topics. In particular, for the debiasing and analysis of the reader demographic surveys, I have well over one million unique article page views across more than one hundred languages that need evaluated.

Example edit

If someone were to read the article for the Storm King Art Center, an outdoor sculpture garden in Hudson, NY, USA, we would want to map this page view to topics such as Culture (it is an outdoor art museum) and Geography (it has a physical location). Furthermore, we may want more fine-grained labels within Culture such as Art, and within Geography such as North America. One approach to doing this is through WikiProjects -- there exist many hundreds of WikiProjects in English Wikipedia that are all associated with a specific topic, and each of these WikiProjects has added their template to articles that they believe to be important to their topic. In this case, the talk page for Storm King has been tagged with templates from WikiProject Museums, WikiProject Visual arts, WikiProject Hudson Valley, and WikiProject Public Art. Based on the WikiProject Directory, a mapping can be built between these specific WikiProjects and higher-level categories that they belong to -- in this case: "Culture.Arts", "Culture.Plastic arts", "Culture.Visual arts", and "Geography.Americas".

In practice, mapping an article to the WikiProjects that have tagged it and then a list of topics is not straightforward. In some cases, the Directory is not well-formed and WikiProjects can be inadvertently left out (e.g., WikiProject Europe and several others appear outside of the sections for Geography.Europe) or assigned to odd categories (e.g., a broken link for Cities of the United States can assign all WikiProjects under Geography.Americas to Geography.Cities instead). The templates used by a given WikiProject can be many as well. For WikiProject Public Art, the template used with Storm King is actually WikiProject Wikipedia Saves Public Art as opposed to Template:WikiProject Public Art. And finally, while most English Wikipedia articles have at least one WikiProject tagging them, any given WikiProject is likely to not have tagged many articles that reasonably might be associated with their topic area.

Why WikiProjects? edit

No taxonomy of topics will be perfect and the nature of mapping all of Wikipedia to ~50 topics is an incredibly reductive task. A taxonomy of topics based on WikiProjects from English Wikipedia is being used in this work. This naturally raises concerns about whether these topics are appropriate for other language editions, but I have not encountered a clearly superior taxonomy for Wikipedia articles and appreciate that the WikiProjects taxonomy is easily derived and modifiable. Further details can be seen in the initial paper written about the drafttopic model[1].

Additional models based purely on Wikidata were also explored but abandoned as it was deemed that the Wikidata instance-of / sublcass taxonomy does not map closely enough to more general topic taxonomies. Wikipedia categories are famously difficult to map to a coherent taxonomy and do not readily scale across languages as well. Outside taxonomies such as DBpedia introduce additional data processing complexities. More details are contained within the See Also section below for those who are interested in exploring these alternatives.

Modeling edit

While looking up the WikiProjects that have tagged an article works well for long-standing articles on English Wikipedia, some method is needed for automatically inferring these topics when articles are new or outside of English Wikipedia. That is, a model needs to be built that can predict what topics should be applied to any given Wikipedia article.

Existing drafttopic model edit

The existing ORES drafttopic model takes the text of a page, represents each word via word embeddings, and makes predictions based on an average of these word embeddings. This allows it to capture the nuances of language while not requiring the article to already have much of the structure (Wikidata, links) that established articles on Wikipedia often have. This approach was taken so that the model would be applicable to drafts of new articles. It has the drawback, however, of being difficult to scale to other languages. Approaches such as multilingual word embeddings are still largely unproven and bring about many other challenges around preprocessing, loading the rather large word embeddings into memory, and needing the text for each article (which is nontrivial when analyzing over one million articles across more than 100 language editions).

Wikidata Model edit

For my particular context of representing page views to existing Wikipedia articles, I am not restricted to just article text. So as not to build separate language models for each Wikipedia, we choose to represent a given article not by its text but by the statements on its associated Wikidata item. This is naturally language independent and, intuitively, many Wikidata statements map directly to topics (e.g., an item with the occupation property and physician value should probably fall under STEM.Medicine). We treat these Wikidata statements like a bag of words.

Gathering training data edit

I wrote a script that loops through the dump of the current version of English Wikipedia, checks pages in the article talk page namespace (1) and retains any talk page that has templates whose name includes "wp" or "wikiproject".

Get English Wikipedia articles tagged with WikiProject templates
import bz2
import json
import re

import mwxml
import mwparserfromhell

def norm_wp_name(wp):
    return re.sub("\s\s+", " ", wp.strip().lower().replace("wikipedia:", "").replace('_', ' '))

dump_fn = '/mnt/data/xmldatadumps/public/enwiki/20190701/enwiki-20190701-pages-meta-current.xml.bz2'
output_json = 'wp_templates_by_article.json'
articles_kept = 0
processed = 0

with open(output_json, 'w') as fout:
    dump = mwxml.Dump.from_file(bz2.open(dump_fn, 'rt'))
    for page in dump:
        # talk pages for existing articles
        if page.namespace == 1 and page.redirect is None:
            # get templates from most recent revision
            rev = next(page)
            wikitext = mwparserfromhell.parse(rev.text)
            templates = wikitext.filter_templates()
            # retain templates possibly related to WikiProjects
            possible_wp_tmps = []
            for t in templates:
                template_name = norm_wp_name(t.name.strip_code())
                if 'wp' in template_name or 'wikiproject' in template_name:
                    possible_wp_tmps.append(template_name)
            # output talk pages with at least one potential WikiProject template
            if possible_wp_tmps:
                page_json = {'talk_page_id': page.id, 'talk_page_title': page.title, 'rev_id':rev.id,
                             'templates':possible_wp_tmps}
                fout.write(json.dumps(page_json) + '\n')
                articles_kept += 1
                
            processed += 1
            if processed % 100000 == 0:
                print("{0} processed. {1} kept.".format(processed, articles_kept))
                
print("Finished: {0} processed. {1} kept.".format(processed, articles_kept))

These templates are mapped to topics via the existing drafttopic code (with a few adjustments to clean the directory).

Map WikiProject templates to topics
def get_wp_to_mlc(mid_level_categories_json):
    """Build map of WikiProject template names to mid-level categories.
    
    NOTE: mid_level_categories_json generated as outmid from: https://github.com/wikimedia/drafttopic

    Parameters:
        mid_level_categories_json: JSON file with structure like:
        {
            "wikiprojects": {
                "Culture.Music": [
                    "Wikipedia:WikiProject Music",
                    "Wikipedia:WikiProject Music terminology",
                    "Wikipedia:WikiProject Music theory",
                    ...
    Returns:
        dictionary of wikiproject template names mapped to their mid-level categories. For example:
        {'wikiproject music': ['Culture.Music'],
         'wikiproject cycling: ['History_And_Society.Transportation', 'Culture.Sports'].
         ...
        }
    """
    with open(mid_level_categories_json, 'r') as fin:
        mlc_dir = json.load(fin)

    wp_to_mlc = {}
    for mlc in mlc_dir['wikiprojects']:
        # for each WikiProject name, build standard set of template names
        for wp in mlc_dir['wikiprojects'][mlc]:  # e.g., "Wikipedia:WikiProject Trains "
            normed_name = norm_wp_name(wp)  # e.g., "wikiproject trains"
            short_name = normed_name.replace("wikiproject", "wp")  # e.g., "wp trains"
            shorter_name = short_name.replace(" ", "")  # e.g., "wptrains"
            flipped_name = normed_name.replace("wikiproject", "").replace(" ", "") + "wikiproject"  # e.g., "trainswikiproject"
            wp_to_mlc[normed_name] = wp_to_mlc.get(normed_name, []) + [mlc]
            wp_to_mlc[short_name] = wp_to_mlc.get(short_name, []) + [mlc]
            wp_to_mlc[shorter_name] = wp_to_mlc.get(shorter_name, []) + [mlc]
            wp_to_mlc[flipped_name] = wp_to_mlc.get(flipped_name, []) + [mlc]

    # common templates that do not fit standard patterns
    one_offs = {'wpmilhist':'wikiproject military history',
                'wikiproject elections and referendums':'wikiproject elections and referenda',
                'wpmed':'wikiproject medicine',
                'wikiproject mcb':'wikiproject molecular and cell biology',
                'wikiproject palaeontology':'wikiproject paleontology',
                'u.s. roads wikiproject':'wikiproject u.s. roads',
                'wpjournals':'wikiproject academic journals',
                'wpcoop':'wikiproject cooperatives',
                'wpukgeo':'wikiproject uk geography',
                'wikiproject finance':'wikiproject finance & investment',
                'wpphilippines':'wikiproject tambayan philippines',
                'wikiproject philippines':'wikiproject tambayan philippines',
                'wptr': 'wikiproject turkey',
                'wikiproject molecular and cellular biology':'wikiproject molecular and cell biology',
                'wp uk politics':'wikiproject politics of the united kingdom',
                'wikiproject nrhp':'wikiproject national register of historic places',
                'wikiproject crime':'wikiproject crime and criminal biography',
                'wpj':'wikiproject japan',
                'wikiproject awards':'wikiproject awards and prizes',
                'wikiprojectsongs':'wikiproject songs',
                'wpuk':'wikiproject united kingdom',
                'wpbio':'wikiproject biography'}

    for tmp_name, mapped_to in one_offs.items():
        wp_to_mlc[tmp_name] = wp_to_mlc[mapped_to]

    return wp_to_mlc

Another script then loops through the Wikidata JSON dump and maps each talk page and its topics to a Wikidata item (joining on the title or QID if available) and associated claims.

Get Wikidata properties for set of QIDs or articles
import bz2
import json

def get_wd_claims(qids=None, titles=None):
    """Get Wikidata properties for set of QIDs.

    NOTE: This process takes ~24 hours. It will return all QIDs that were found,
    which maybe be less than the total number searched for.

    All properties are retained for statements. Values are also retained if they are Wikidata items.
    In practice, this means that statements like 'instance-of' are both property and value while
    statements like 'coordinate location' or 'image' are just represented by their property.

    Args:
        qids: dictionary or set of QIDs to retain -- e.g., {'Q42', 'Q3107329', ...}
        titles: dictionary or set of English Wikipedia titles to retain -- e.g., {'Svetlana Alexievich', 'Kitten', ...}
    Returns:
        dictionary of QID to list of claims tuples and corresponding dictionary of QID to title. For example:

        {'Q42': [('P31', 'Q5'), ('P18', ), ('P21', 'Q6581097'), ...],
         'Q3107329': [('P31', 'Q47461344'), ...],
         ...
        },
        {'Q42':'Douglas Adams', 'Q3107329':'The Hitchhiker's Guide to the Galaxy (novel)', ...}

    """
    dump_fn = '/mnt/data/xmldatadumps/public/wikidatawiki/entities/20190819/wikidata-20190819-all.json.bz2'
    items_found = 0
    qid_to_claims = {}
    qid_to_title = {}

    print("Building QID->properties map from {0}".format(dump_fn))
    with bz2.open(dump_fn, 'rt') as fin:
        next(fin)
        for idx, line in enumerate(fin, start=1):
            # load line as JSON minus newline and any trailing comma
            try:
                item_json = json.loads(line[:-2])
            except Exception:
                try:
                    item_json = json.loads(line)
                except Exception:
                    print("Error:", idx, line)
                    continue
            if idx % 100000 == 0:
                print("{0} lines processed. {1} kept.".format(
                    idx, items_found))

            qid = item_json.get('id', None)
            if not qid or (qids is not None and qid not in qids):
                continue
            en_title = item_json.get('sitelinks', {}).get('enwiki', {}).get('title', None)
            if titles is not None and en_title not in titles:
                continue

            claims = item_json.get('claims', {})
            claim_tuples = []
            # each property, such as P31 instance-of
            for property in claims:
                included = False
                # each value under that property -- e.g., instance-of might have three different values
                for statement in claims[property]:
                    try:
                        if statement['type'] == 'statement' and statement['mainsnak']['datatype'] == 'wikibase-item':
                            claim_tuples.append((property, statement['mainsnak']['datavalue']['value']['id']))
                            included = True
                    except Exception:
                        continue
                if not included:
                    claim_tuples.append((property, ))
            if not claim_tuples:
                claim_tuples = [('<NOCLAIM>', )]
            items_found += 1
            qid_to_claims[qid] = claim_tuples
            qid_to_title[qid] = en_title

    return qid_to_claims, qid_to_title

Building supervised model edit

I use fastText to build a model that predicts a given Wikidata's topics based on its claims. A more complete description of fastText and how to build this model is contained within this PAWS notebook. Notably, there is some pre-processing to go from the JSON files outputted above and fastText-ready files.

JSON -> fastText format
import argparse
import os
from random import sample

import pandas as pd


def to_dataframe(data_fn):
    """Parses Wikidata claims for fastText processing"""
    print("Converting {0} -> fastText format.".format(data_fn))
    data = pd.read_json(data_fn, lines=True)
    data.set_index('QID', inplace=True)
    data = data.sample(frac=1, replace=False)
    return data

def wikidata_to_fasttext(data, fasttext_datafn, fasttext_readme):
    """Write xy-data and associated metadata to respective files."""
    qid_to_metadata = {}
    potential_metadata_cols = [c for c in data.columns if c not in ('claims', 'mid_level_categories')]
    if potential_metadata_cols:
        print("Metadata columns: {0}".format(potential_metadata_cols))
        for qid, row in data.iterrows():
            qid_to_metadata[qid] = {}
            for c in potential_metadata_cols:
                qid_to_metadata[qid][c] = row[c]

    if 'mid_level_categories' in data.columns:
        y_corpus = data["mid_level_categories"]
    else:
        y_corpus = None

    x_corpus = data["claims"].apply(
        lambda row: " ".join([' '.join(pair) for pair in sample(row, len(row))]))

    write_fasttext(x_corpus, fasttext_datafn, fasttext_readme, y_corpus, qid_to_metadata)


def write_fasttext(x_data, data_fn, readme_fn, y_data=None, qid_to_metadata={}):
    """Write data in fastText format."""
    written = 0
    skipped = 0
    no_claims = 0
    with open(readme_fn, 'w') as readme_fout:
        with open(data_fn, 'w') as data_fout:
            for qid, claims in x_data.iteritems():
                if not len(claims):
                    no_claims += 1
                    claims = '<NOCLAIM>'
                if y_data is not None:
                    lbls = y_data.loc[qid]
                    if len(lbls):
                        mlcs = ' '.join(['__label__{0}'.format(c.replace(" ", "_")) for c in lbls])
                        data_fout.write("{0} {1}\n".format(mlcs, claims))
                    else:
                        skipped += 1
                        continue
                else:
                    data_fout.write("{0}\n".format(claims))

                if qid_to_metadata:
                    readme_fout.write("{0}\t{1}\n".format(qid, qid_to_metadata.get(qid, {})))
                else:
                    readme_fout.write("{0}\n".format(qid))
                written += 1

    print("{0} data points written to {1} and {2}. {3} skipped and {4} w/o claims.".format(written, data_fn, readme_fn,
                                                                                               skipped, no_claims))

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--data_fn")
    parser.add_argument("--train_prop", type=float, default=1.)
    parser.add_argument("--val_prop", type=float, default=0.)
    parser.add_argument("--test_prop", type=float, default=0.)
    args = parser.parse_args()

    if args.train_prop + args.val_prop + args.test_prop != 1:
        raise ValueError("Train/Val/Test proportions must sum to 1.")

    data = to_dataframe(args.data_fn)
    base_fn = os.path.splitext(args.data_fn)[0]

    if args.train_prop == 1:
        fasttext_datafn = '{0}_{1}_data.txt'.format(base_fn, args.join_approach)
        fasttext_readme = '{0}_{1}_qids.txt'.format(base_fn, args.join_approach)
        wikidata_to_fasttext(data, fasttext_datafn, fasttext_readme)

    else:
        train_idx = int(len(data) * args.train_prop)
        val_idx = train_idx + int(len(data) * args.val_prop)

        if train_idx > 0:
            train_data = data[:train_idx]
            print("{0} training datapoints.".format(len(train_data)))
            fasttext_datafn = '{0}_{1}_train_{2}_data.txt'.format(base_fn, args.join_approach, len(train_data))
            fasttext_readme = '{0}_{1}_train_{2}_qids.txt'.format(base_fn, args.join_approach, len(train_data))
            wikidata_to_fasttext(train_data, fasttext_datafn, fasttext_readme)

        if val_idx > train_idx:
            val_data = data[train_idx:val_idx]
            print("{0} validation datapoints.".format(len(val_data)))
            fasttext_datafn = '{0}_{1}_val_{2}_data.txt'.format(base_fn, args.join_approach, len(val_data))
            fasttext_readme = '{0}_{1}_val_{2}_qids.txt'.format(base_fn, args.join_approach, len(val_data))
            wikidata_to_fasttext(val_data, fasttext_datafn, fasttext_readme)

        if val_idx < len(data):
            test_data = data[val_idx:]
            print("{0} test datapoints.".format(len(test_data)))
            fasttext_datafn = '{0}_{1}_test_{2}_data.txt'.format(base_fn, args.join_approach, len(test_data))
            fasttext_readme = '{0}_{1}_test_{2}_qids.txt'.format(base_fn, args.join_approach, len(test_data))
            wikidata_to_fasttext(test_data, fasttext_datafn, fasttext_readme)

Performance edit

As a baseline, this Wikidata model is compared to the existing drafttopic model for English Wikipedia (more drafttopic statistics here). Notably, this is not a apples-to-apples comparison because the Wikidata model is trained on a much larger, and different dataset that is much less balanced than the drafttopic dataset. For both models, the false negative rate is much higher than it really is for many classes due to the sparsity of WikiProject templating -- i.e. there are generally many articles that reasonably fit within a given WikiProject that are not labeled as such.

Grid search was used to determine the best choice for fastText hyperparameters. This demonstrated that model performance is largely robust to specific choices, though higher dimensionality of embeddings, learning rates, and number of epochs led to greater overfitting.

Results from grid search (fastText for hyperparameter definitions)
dim epoch lr minCount ws train micro f1 val micro f1 train macro f1 val macro f1
50 10 0.05 3 5 0.815 0.809 0.64 0.627
50 10 0.05 3 10 0.815 0.808 0.64 0.624
50 10 0.05 3 20 0.815 0.808 0.64 0.626
50 20 0.05 3 5 0.822 0.812 0.661 0.635
50 20 0.05 3 10 0.822 0.811 0.661 0.634
50 20 0.05 3 20 0.822 0.811 0.66 0.634
50 30 0.05 3 5 0.827 0.813 0.674 0.64
50 30 0.05 3 10 0.826 0.813 0.674 0.639
50 30 0.05 3 20 0.826 0.812 0.673 0.638
50 10 0.05 5 5 0.813 0.808 0.637 0.624
50 10 0.05 5 10 0.814 0.808 0.637 0.625
50 10 0.05 5 20 0.814 0.808 0.637 0.625
50 20 0.05 5 5 0.82 0.811 0.655 0.633
50 20 0.05 5 10 0.82 0.811 0.655 0.632
50 20 0.05 5 20 0.82 0.811 0.655 0.632
50 30 0.05 5 5 0.823 0.812 0.666 0.638
50 30 0.05 5 10 0.823 0.812 0.666 0.637
50 30 0.05 5 20 0.823 0.812 0.666 0.638
50 10 0.05 10 5 0.811 0.807 0.632 0.62
50 10 0.05 10 10 0.811 0.807 0.632 0.62
50 10 0.05 10 20 0.811 0.807 0.632 0.621
50 20 0.05 10 5 0.816 0.809 0.647 0.629
50 20 0.05 10 10 0.816 0.809 0.647 0.628
50 20 0.05 10 20 0.816 0.809 0.647 0.629
50 30 0.05 10 5 0.818 0.81 0.655 0.631
50 30 0.05 10 10 0.818 0.81 0.655 0.632
50 30 0.05 10 20 0.818 0.81 0.655 0.632
100 10 0.05 3 5 0.815 0.809 0.641 0.625
100 10 0.05 3 10 0.815 0.808 0.64 0.626
100 10 0.05 3 20 0.815 0.808 0.641 0.627
100 20 0.05 3 5 0.822 0.811 0.662 0.635
100 20 0.05 3 10 0.822 0.811 0.662 0.634
100 20 0.05 3 20 0.822 0.812 0.662 0.636
100 30 0.05 3 5 0.827 0.813 0.675 0.639
100 30 0.05 3 10 0.827 0.813 0.675 0.64
100 30 0.05 3 20 0.826 0.812 0.674 0.639
100 10 0.05 5 5 0.814 0.808 0.639 0.624
100 10 0.05 5 10 0.814 0.808 0.638 0.625
100 10 0.05 5 20 0.814 0.808 0.638 0.624
100 20 0.05 5 5 0.82 0.811 0.656 0.633
100 20 0.05 5 10 0.82 0.811 0.657 0.633
100 20 0.05 5 20 0.82 0.811 0.656 0.632
100 30 0.05 5 5 0.823 0.812 0.667 0.638
100 30 0.05 5 10 0.823 0.812 0.667 0.637
100 30 0.05 5 20 0.823 0.812 0.667 0.637
100 10 0.05 10 5 0.812 0.807 0.634 0.622
100 10 0.05 10 10 0.811 0.807 0.634 0.622
100 10 0.05 10 20 0.811 0.807 0.633 0.62
100 20 0.05 10 5 0.816 0.809 0.648 0.629
100 20 0.05 10 10 0.816 0.809 0.648 0.629
100 20 0.05 10 20 0.816 0.809 0.647 0.629
100 30 0.05 10 5 0.818 0.81 0.656 0.633
100 30 0.05 10 10 0.818 0.81 0.655 0.632
100 30 0.05 10 20 0.818 0.81 0.656 0.632
50 10 0.1 3 5 0.817 0.809 0.65 0.631
50 10 0.1 3 10 0.817 0.81 0.65 0.631
50 10 0.1 3 20 0.817 0.809 0.65 0.631
50 20 0.1 3 5 0.824 0.812 0.67 0.638
50 20 0.1 3 10 0.824 0.812 0.67 0.638
50 20 0.1 3 20 0.824 0.812 0.671 0.638
50 30 0.1 3 5 0.828 0.813 0.683 0.639
50 30 0.1 3 10 0.829 0.813 0.683 0.64
50 30 0.1 3 20 0.828 0.813 0.683 0.64
50 10 0.1 5 5 0.815 0.809 0.646 0.63
50 10 0.1 5 10 0.815 0.809 0.646 0.629
50 10 0.1 5 20 0.815 0.809 0.647 0.63
50 20 0.1 5 5 0.821 0.811 0.663 0.636
50 20 0.1 5 10 0.821 0.811 0.663 0.636
50 20 0.1 5 20 0.821 0.811 0.663 0.635
50 30 0.1 5 5 0.825 0.812 0.673 0.637
50 30 0.1 5 10 0.825 0.812 0.674 0.638
50 30 0.1 5 20 0.824 0.812 0.673 0.636
50 10 0.1 10 5 0.813 0.808 0.64 0.625
50 10 0.1 10 10 0.813 0.808 0.641 0.625
50 10 0.1 10 20 0.813 0.808 0.641 0.626
50 20 0.1 10 5 0.817 0.81 0.654 0.631
50 20 0.1 10 10 0.817 0.809 0.653 0.631
50 20 0.1 10 20 0.817 0.81 0.653 0.632
50 30 0.1 10 5 0.82 0.81 0.661 0.633
50 30 0.1 10 10 0.819 0.81 0.661 0.633
50 30 0.1 10 20 0.819 0.81 0.661 0.633
100 10 0.1 3 5 0.817 0.809 0.651 0.631
100 10 0.1 3 10 0.817 0.809 0.651 0.631
100 10 0.1 3 20 0.817 0.809 0.651 0.631
100 20 0.1 3 5 0.824 0.812 0.671 0.639
100 20 0.1 3 10 0.824 0.812 0.67 0.636
100 20 0.1 3 20 0.824 0.812 0.671 0.637
100 30 0.1 3 5 0.828 0.813 0.683 0.641
100 30 0.1 3 10 0.828 0.813 0.682 0.641
100 30 0.1 3 20 0.829 0.813 0.684 0.641
100 10 0.1 5 5 0.816 0.809 0.648 0.632
100 10 0.1 5 10 0.815 0.809 0.648 0.631
100 10 0.1 5 20 0.816 0.809 0.648 0.63
100 20 0.1 5 5 0.821 0.811 0.664 0.636
100 20 0.1 5 10 0.821 0.811 0.664 0.636
100 20 0.1 5 20 0.821 0.811 0.664 0.637
100 30 0.1 5 5 0.825 0.812 0.674 0.637
100 30 0.1 5 10 0.825 0.812 0.674 0.637
100 30 0.1 5 20 0.825 0.812 0.674 0.637
100 10 0.1 10 5 0.813 0.808 0.641 0.626
100 10 0.1 10 10 0.813 0.808 0.641 0.625
100 10 0.1 10 20 0.813 0.808 0.642 0.627
100 20 0.1 10 5 0.817 0.81 0.654 0.633
100 20 0.1 10 10 0.817 0.81 0.654 0.631
100 20 0.1 10 20 0.817 0.809 0.653 0.631
100 30 0.1 10 5 0.819 0.81 0.661 0.634
100 30 0.1 10 10 0.819 0.81 0.661 0.634
100 30 0.1 10 20 0.819 0.81 0.661 0.633
50 10 0.2 3 5 0.82 0.811 0.658 0.636
50 10 0.2 3 10 0.819 0.81 0.658 0.635
50 10 0.2 3 20 0.819 0.81 0.657 0.636
50 20 0.2 3 5 0.826 0.812 0.677 0.639
50 20 0.2 3 10 0.827 0.813 0.678 0.641
50 20 0.2 3 20 0.827 0.813 0.678 0.641
50 30 0.2 3 5 0.831 0.813 0.69 0.64
50 30 0.2 3 10 0.831 0.813 0.689 0.641
50 30 0.2 3 20 0.831 0.813 0.689 0.64
50 10 0.2 5 5 0.817 0.81 0.653 0.633
50 10 0.2 5 10 0.817 0.81 0.654 0.634
50 10 0.2 5 20 0.818 0.81 0.653 0.633
50 20 0.2 5 5 0.823 0.812 0.67 0.637
50 20 0.2 5 10 0.823 0.812 0.67 0.637
50 20 0.2 5 20 0.823 0.812 0.67 0.636
50 30 0.2 5 5 0.826 0.812 0.679 0.637
50 30 0.2 5 10 0.826 0.812 0.679 0.637
50 30 0.2 5 20 0.827 0.812 0.68 0.637
50 10 0.2 10 5 0.814 0.809 0.646 0.63
50 10 0.2 10 10 0.814 0.808 0.646 0.63
50 10 0.2 10 20 0.815 0.808 0.646 0.629
50 20 0.2 10 5 0.819 0.81 0.658 0.633
50 20 0.2 10 10 0.819 0.81 0.658 0.633
50 20 0.2 10 20 0.818 0.81 0.658 0.632
50 30 0.2 10 5 0.821 0.81 0.665 0.631
50 30 0.2 10 10 0.821 0.81 0.665 0.632
50 30 0.2 10 20 0.821 0.81 0.665 0.631
100 10 0.2 3 5 0.819 0.81 0.658 0.638
100 10 0.2 3 10 0.819 0.81 0.658 0.636
100 10 0.2 3 20 0.819 0.81 0.659 0.637
100 20 0.2 3 5 0.827 0.813 0.678 0.639
100 20 0.2 3 10 0.827 0.813 0.678 0.64
100 20 0.2 3 20 0.827 0.813 0.679 0.639
100 30 0.2 3 5 0.831 0.813 0.69 0.639
100 30 0.2 3 10 0.831 0.813 0.69 0.641
100 30 0.2 3 20 0.831 0.813 0.69 0.639
100 10 0.2 5 5 0.817 0.81 0.653 0.634
100 10 0.2 5 10 0.817 0.81 0.653 0.633
100 10 0.2 5 20 0.818 0.81 0.655 0.635
100 20 0.2 5 5 0.823 0.812 0.671 0.637
100 20 0.2 5 10 0.823 0.812 0.671 0.638
100 20 0.2 5 20 0.823 0.812 0.67 0.636
100 30 0.2 5 5 0.826 0.812 0.679 0.637
100 30 0.2 5 10 0.826 0.812 0.679 0.637
100 30 0.2 5 20 0.826 0.812 0.679 0.638
100 10 0.2 10 5 0.814 0.808 0.647 0.632
100 10 0.2 10 10 0.814 0.808 0.647 0.63
100 10 0.2 10 20 0.814 0.808 0.646 0.63
100 20 0.2 10 5 0.819 0.81 0.659 0.633
100 20 0.2 10 10 0.819 0.81 0.658 0.631
100 20 0.2 10 20 0.818 0.81 0.658 0.632
100 30 0.2 10 5 0.821 0.81 0.665 0.632
100 30 0.2 10 10 0.821 0.81 0.666 0.633
100 30 0.2 10 20 0.821 0.81 0.666 0.634

Based on the grid search, a model was built with the following hyperparameters was evaluated on the test set: 0.1 lr; 50 dim; 3 min count; 30 epochs:

Model Micro Precision Macro Precision Micro Recall Macro Recall Micro F1 Macro F1
drafttopic 0.826 0.811 0.576 0.554 0.668 0.643
Wikidata 0.881 0.809 0.762 0.560 0.811 0.643

Qualitative edit

To explore this Wikidata-based model, you can query it via a local API as described within this code repository: https://github.com/geohci/wikidata-topic-model

See Also edit

References edit

  1. Asthana, Sumit; Halfaker, Aaron (November 2018). "With Few Eyes, All Hoaxes Are Deep". Proc. ACM Hum.-Comput. Interact. 2 (CSCW): 21:1–21:18. ISSN 2573-0142. doi:10.1145/3274290. 
Return to "Characterizing Wikipedia Reader Behaviour/Demographics and Wikipedia use cases/Work log/2019-09-11" page.