Research:Language-Agnostic Topic Classification/Wikidata model performance
This page provides details about the current (as of May 2020) model for Wikidata topic classification based on statements:
- Code: https://github.com/geohci/wikidata-topic-model
- Model architecture: multi-label fastText supervised model
- Training data: 4,911,836 Wikidata items that have English sitelinks (and corresponding topic labels)
- Epochs: 25
- Learning rate: 0.1
- Window size: 20
- Min count (under which QID is not retained in vocab): 3
- No pre-trained embeddings used
- Embeddings dimension: 50
- Total number of model params: 3200 (50 x 64)
- Vocab size: 429,206
- Total number of embeddings params: 21,460,300 (vocab size * embeddings dimension)
- Model size on disk: 89.1 MB
Test Results
editRegarding the test results below, there is a strong limitation in that there is no labeled data for Wikidata items that do not have English Wikipedia sitelinks. It is very possible that the performance degrades for these items (assuming they also have different types of properties and values).
Overall performance:
- Precision: 0.886 micro; 0.837 macro
- Recall: 0.764 micro; 0.629 macro
- F1: 0.817 micro; 0.712 macro
- Avg pre.: 0.872 micro; 0.746 macro
Label-specific:
topic | n | TP | FP | TN | FN | precision | recall | f1 | avg_pre |
---|---|---|---|---|---|---|---|---|---|
Culture.Biography.Biography* | 85659 | 83162 | 2455 | 175965 | 2497 | 0.971 | 0.971 | 0.971 | 0.990 |
Culture.Biography.Women | 10669 | 7693 | 2531 | 250879 | 2976 | 0.752 | 0.721 | 0.736 | 0.801 |
Culture.Food and drink | 1738 | 823 | 166 | 262175 | 915 | 0.832 | 0.474 | 0.604 | 0.586 |
Culture.Internet culture | 2326 | 1646 | 154 | 261599 | 680 | 0.914 | 0.708 | 0.798 | 0.816 |
Culture.Linguistics | 4705 | 3454 | 246 | 259128 | 1251 | 0.934 | 0.734 | 0.822 | 0.863 |
Culture.Literature | 8879 | 6314 | 840 | 254360 | 2565 | 0.883 | 0.711 | 0.788 | 0.840 |
Culture.Media.Books | 2924 | 2448 | 251 | 260904 | 476 | 0.907 | 0.837 | 0.871 | 0.892 |
Culture.Media.Entertainment | 2500 | 1207 | 279 | 261300 | 1293 | 0.812 | 0.483 | 0.606 | 0.632 |
Culture.Media.Films | 8160 | 7409 | 398 | 255521 | 751 | 0.949 | 0.908 | 0.928 | 0.948 |
Culture.Media.Media* | 35326 | 29419 | 3174 | 225579 | 5907 | 0.903 | 0.833 | 0.866 | 0.935 |
Culture.Media.Music | 13236 | 11143 | 1384 | 249459 | 2093 | 0.890 | 0.842 | 0.865 | 0.932 |
Culture.Media.Radio | 1434 | 1108 | 47 | 262598 | 326 | 0.959 | 0.773 | 0.856 | 0.841 |
Culture.Media.Software | 754 | 190 | 117 | 263208 | 564 | 0.619 | 0.252 | 0.358 | 0.390 |
Culture.Media.Television | 5156 | 3765 | 493 | 258430 | 1391 | 0.884 | 0.730 | 0.800 | 0.839 |
Culture.Media.Video games | 1800 | 1505 | 43 | 262236 | 295 | 0.972 | 0.836 | 0.899 | 0.896 |
Culture.Performing arts | 1923 | 1024 | 232 | 261924 | 899 | 0.815 | 0.533 | 0.644 | 0.660 |
Culture.Philosophy and religion | 7148 | 3128 | 1164 | 255767 | 4020 | 0.729 | 0.438 | 0.547 | 0.576 |
Culture.Sports | 42560 | 39322 | 2485 | 219034 | 3238 | 0.941 | 0.924 | 0.932 | 0.970 |
Culture.Visual arts.Architecture | 7527 | 5418 | 888 | 255664 | 2109 | 0.859 | 0.720 | 0.783 | 0.858 |
Culture.Visual arts.Comics and Anime | 1418 | 790 | 127 | 262534 | 628 | 0.862 | 0.557 | 0.677 | 0.724 |
Culture.Visual arts.Fashion | 609 | 278 | 108 | 263362 | 331 | 0.720 | 0.456 | 0.559 | 0.576 |
Culture.Visual arts.Visual arts* | 12362 | 8219 | 1754 | 249963 | 4143 | 0.824 | 0.665 | 0.736 | 0.821 |
Geography.Geographical | 14848 | 11192 | 1561 | 247670 | 3656 | 0.878 | 0.754 | 0.811 | 0.877 |
Geography.Regions.Africa.Africa* | 5943 | 3570 | 815 | 257321 | 2373 | 0.814 | 0.601 | 0.691 | 0.740 |
Geography.Regions.Africa.Central Africa | 0 | 0 | 2 | 264077 | 0 | 0.000 | 0 | 0 | nan |
Geography.Regions.Africa.Eastern Africa | 352 | 179 | 38 | 263689 | 173 | 0.825 | 0.509 | 0.629 | 0.597 |
Geography.Regions.Africa.Northern Africa | 950 | 514 | 165 | 262964 | 436 | 0.757 | 0.541 | 0.631 | 0.626 |
Geography.Regions.Africa.Southern Africa | 915 | 532 | 151 | 263013 | 383 | 0.779 | 0.581 | 0.666 | 0.677 |
Geography.Regions.Africa.Western Africa | 555 | 233 | 106 | 263418 | 322 | 0.687 | 0.420 | 0.521 | 0.565 |
Geography.Regions.Americas.Central America | 2382 | 1286 | 242 | 261455 | 1096 | 0.842 | 0.540 | 0.658 | 0.674 |
Geography.Regions.Americas.North America | 41312 | 29991 | 5616 | 217151 | 11321 | 0.842 | 0.726 | 0.780 | 0.876 |
Geography.Regions.Americas.South America | 4847 | 3518 | 730 | 258502 | 1329 | 0.828 | 0.726 | 0.774 | 0.831 |
Geography.Regions.Asia.Asia* | 33925 | 25211 | 3444 | 226710 | 8714 | 0.880 | 0.743 | 0.806 | 0.885 |
Geography.Regions.Asia.Central Asia | 546 | 287 | 70 | 263463 | 259 | 0.804 | 0.526 | 0.636 | 0.613 |
Geography.Regions.Asia.East Asia | 8549 | 6137 | 1008 | 254522 | 2412 | 0.859 | 0.718 | 0.782 | 0.821 |
Geography.Regions.Asia.North Asia | 761 | 245 | 168 | 263150 | 516 | 0.593 | 0.322 | 0.417 | 0.441 |
Geography.Regions.Asia.South Asia | 11637 | 8775 | 853 | 251589 | 2862 | 0.911 | 0.754 | 0.825 | 0.877 |
Geography.Regions.Asia.Southeast Asia | 4326 | 2714 | 441 | 259312 | 1612 | 0.860 | 0.627 | 0.726 | 0.738 |
Geography.Regions.Asia.West Asia | 8414 | 6248 | 628 | 255037 | 2166 | 0.909 | 0.743 | 0.817 | 0.869 |
Geography.Regions.Europe.Eastern Europe | 9697 | 7692 | 844 | 253538 | 2005 | 0.901 | 0.793 | 0.844 | 0.903 |
Geography.Regions.Europe.Europe* | 53802 | 40392 | 7230 | 203047 | 13410 | 0.848 | 0.751 | 0.796 | 0.891 |
Geography.Regions.Europe.Northern Europe | 20509 | 13631 | 2302 | 241268 | 6878 | 0.856 | 0.665 | 0.748 | 0.833 |
Geography.Regions.Europe.Southern Europe | 9731 | 6624 | 1340 | 253008 | 3107 | 0.832 | 0.681 | 0.749 | 0.823 |
Geography.Regions.Europe.Western Europe | 14305 | 10560 | 2201 | 247573 | 3745 | 0.828 | 0.738 | 0.780 | 0.858 |
Geography.Regions.Oceania | 11602 | 9044 | 614 | 251863 | 2558 | 0.936 | 0.780 | 0.851 | 0.885 |
History and Society.Business and economics | 6855 | 2908 | 1427 | 255797 | 3947 | 0.671 | 0.424 | 0.520 | 0.561 |
History and Society.Education | 5418 | 2947 | 855 | 257806 | 2471 | 0.775 | 0.544 | 0.639 | 0.663 |
History and Society.History | 7601 | 3247 | 1162 | 255316 | 4354 | 0.736 | 0.427 | 0.541 | 0.593 |
History and Society.Military and warfare | 10000 | 6068 | 1008 | 253071 | 3932 | 0.858 | 0.607 | 0.711 | 0.760 |
History and Society.Politics and government | 11151 | 6158 | 1136 | 251792 | 4993 | 0.844 | 0.552 | 0.668 | 0.738 |
History and Society.Society | 9341 | 4801 | 960 | 253778 | 4540 | 0.833 | 0.514 | 0.636 | 0.680 |
History and Society.Transportation | 10462 | 8078 | 648 | 252969 | 2384 | 0.926 | 0.772 | 0.842 | 0.881 |
STEM.Biology | 23603 | 21686 | 332 | 240144 | 1917 | 0.985 | 0.919 | 0.951 | 0.974 |
STEM.Chemistry | 978 | 498 | 151 | 262950 | 480 | 0.767 | 0.509 | 0.612 | 0.618 |
STEM.Computing | 1893 | 858 | 255 | 261931 | 1035 | 0.771 | 0.453 | 0.571 | 0.552 |
STEM.Earth and environment | 3239 | 2141 | 247 | 260593 | 1098 | 0.897 | 0.661 | 0.761 | 0.765 |
STEM.Engineering | 3870 | 2470 | 323 | 259886 | 1400 | 0.884 | 0.638 | 0.741 | 0.745 |
STEM.Libraries & Information | 496 | 225 | 56 | 263527 | 271 | 0.801 | 0.454 | 0.579 | 0.588 |
STEM.Mathematics | 314 | 74 | 32 | 263733 | 240 | 0.698 | 0.236 | 0.352 | 0.337 |
STEM.Medicine & Health | 3591 | 1745 | 337 | 260151 | 1846 | 0.838 | 0.486 | 0.615 | 0.650 |
STEM.Physics | 697 | 172 | 114 | 263268 | 525 | 0.601 | 0.247 | 0.350 | 0.357 |
STEM.STEM* | 41446 | 33119 | 1809 | 220824 | 8327 | 0.948 | 0.799 | 0.867 | 0.943 |
STEM.Space | 1625 | 1208 | 70 | 262384 | 417 | 0.945 | 0.743 | 0.832 | 0.835 |
STEM.Technology | 3399 | 1127 | 342 | 260338 | 2272 | 0.767 | 0.332 | 0.463 | 0.498 |