Research talk:Measuring article importance/Work log/2014-10-25
Saturday, October 25, 2014
editWoo! Data. So, the query I set up for getting inlinks ended up getting killed, so I had to do it in two queries.
- proto_inlink_count
SELECT
pl_namespace AS page_namespace,
pl_title AS page_title,
COUNT(pl_from) AS inlinks
FROM pagelinks
GROUP BY pl_namespace, pl_title;
- inlink_count
SELECT
page.page_id,
IFNULL(inlink_count.inlinks, 0) AS inlinks,
redirect.page_id AS redirect_id
FROM page
LEFT JOIN staging.proto_inlink_count inlink_count USING
(page_namespace, page_title)
LEFT JOIN redirect redirectlink ON
page_is_redirect AND
redirectlink.rd_from = page_id
LEFT JOIN page redirect ON
redirect.page_title = redirectlink.rd_title AND
redirect.page_namespace = redirectlink.rd_namespace
WHERE page.page_namespace = 0;
- resolved_inlink_count
CREATE TABLE staging.resolved_inlink_count
SELECT
article.page_id,
article.inlinks + SUM(redirect.inlinks) AS inlinks,
article.inlinks AS direct_inlinks,
SUM(redirect.inlinks) AS inlinks_from_redirects,
COUNT(redirect.page_id) AS redirects
FROM staging.inlink_count article
LEFT JOIN staging.inlink_count redirect ON
redirect.redirect_id = article.page_id
WHERE article.redirect_id = 0
GROUP BY article.page_id;
Now, I get to play around a bit. Let's check out the most inlinked articles:
> select page_title, links.* from resolved_inlink_count links INNER JOIN enwiki.page USING (page_id) ORDER BY inlinks DESC LIMIT 10; +-----------------------------------------------+----------+---------+----------------+------------------------+-----------+ | page_title | page_id | inlinks | direct_inlinks | inlinks_from_redirects | redirects | +-----------------------------------------------+----------+---------+----------------+------------------------+-----------+ | IP_address | 14921 | 1434099 | 1432770 | 1329 | 59 | | United_States | 3434750 | 1064447 | 995446 | 69001 | 166 | | Geographic_coordinate_system | 48361 | 908460 | 905848 | 2612 | 31 | | Fair_use | 10772 | 539840 | 538743 | 1097 | 15 | | International_Standard_Book_Number | 14919 | 513159 | 510836 | 2323 | 24 | | Copyright_infringement | 18948365 | 507796 | 506533 | 1263 | 76 | | Copyright_law_of_the_United_States_of_America | 18938663 | 493085 | 37 | 493048 | 27 | | Wikipedia | 5043734 | 473676 | 472489 | 1187 | 194 | | Tilde | 212390 | 444556 | 442103 | 2453 | 22 | | Network_address_translation | 53036 | 405086 | 404836 | 250 | 41 | +-----------------------------------------------+----------+---------+----------------+------------------------+-----------+ 10 rows in set (1.62 sec)
Woah! en:IP_address? Really? I get that en:Geographic_coordinate_system is up there due to geo-tags. I bet that en:International_Standard_Book_Number comes from citations. The copyright links might be from references too. en:Tilde is another weird one. I don't event know what to think of that. I don't think that anything in this list is worth of top 10 Encyclopedic Importance.
How about we look at the articles with the most redirects:
> select page_title, links.* from resolved_inlink_count links INNER JOIN enwiki.page USING (page_id) ORDER BY redirects DESC LIMIT 10,10; +-------------------------------------------------+----------+---------+----------------+------------------------+-----------+ | page_title | page_id | inlinks | direct_inlinks | inlinks_from_redirects | redirects | +-------------------------------------------------+----------+---------+----------------+------------------------+-----------+ | List_of_minor_planets:_11001–12000 | 720156 | 5871 | 687 | 5184 | 640 | | List_of_minor_planets:_20001–21000 | 723102 | 4407 | 659 | 3748 | 637 | | List_of_Dragon_Ball_characters | 1001666 | 3706 | 1197 | 2509 | 620 | | List_of_Naruto_episodes | 2137845 | 1541 | 940 | 601 | 587 | | List_of_minor_planets:_4001–5000 | 719861 | 7410 | 654 | 6756 | 578 | | List_of_hundreds_of_Sweden | 7525873 | 1050 | 584 | 466 | 529 | | ANSI_escape_code | 65930 | 1458 | 1181 | 277 | 526 | | Afghan_detainees_at_Guantanamo_Bay | 19287734 | 5865 | 595 | 5270 | 508 | | List_of_unclassified_languages_of_South_America | 37445047 | 1172 | 496 | 676 | 479 | | List_of_bus_routes_in_London | 5179207 | 98671 | 972 | 97699 | 466 | +-------------------------------------------------+----------+---------+----------------+------------------------+-----------+ 10 rows in set (1.20 sec)
OK. What? Let's check out some examples. en:List_of_minor_planets:_11001–12000 (redirects). I smell a bot. It looks like a bit went through and created ~600 redirects -- one per "minor planet" on the page.
How about another? en:List_of_Naruto_episodes (redirects) Same deal. A redirect for every episode.
One more. This one has far more inlinks. en:List_of_bus_routes_in_London (redirects) Again. A redirect for every bus name. --Halfak (WMF) (talk) 02:39, 25 October 2014 (UTC)