An impending scalability problem

This page was previously called "An impending scalability problem", dates from late 2001 and is an early example of the concern for reviewing edits to articles. It is retained as part of the history of the project.

Monday, August 13, 2001, 11:04 AM -- After the Slashdotting of my Kuro5hin article, Wikipedia has enjoyed a huge influx of new writers who, like the old ones, are mainly doing a good job, writing reasonably good articles, avoiding topics of which they're mostly ignorant, and generally keeping each other honest. But the influx of all these solid, and very welcome, hands has raised a scalability problem in some people's minds--and the problem is, arguably, already here, or will be with the next major influx of Wikipedia people. It is perhaps not immediately obvious that it is even a problem at all--and perhaps we ought to conclude, after further consideration, that it will never really be a serious problem. But let's take a look, anyway.

The problem, or alleged problem, is difficult to state precisely; generally and vaguely, it is that it is increasingly difficult for "us" to monitor all the activity that's going on. This makes it all too easy for bad edits to creep in and pass quickly down the Recent Changes uncaught.

To clarify the problem, it's best to work with an example. Suppose we doing screaming business on Wikipedia, with the number of edits, according to the Recent Changes page, clocking it at an average 100 per hour. Suppose I write an article about epistemic circularity--a recherche topic in philosophy. Then, two months later, a philosophy student makes a bunch of really bad edits to the article--rendering it a morass of misinformation, worse than no article at all. Being an expert on epistemic circularity, I would be able to see that they're bad edits, but hardly anyone else could see that--if they're paying attention at all, which they probably aren't. Due to the rate of edits on Recent Changes, though, and maybe I'm spending time in Russia or Ireland (who knows?), I overlook the edit, and the bad edits remain.

Multiply this situation somehow--considering, anyway, the many recherche topics there are in the existence--and the fact that edits are going by so fast that no one can catch all the problems, and we have a problem of creeping poor quality, not because we're incapable of dealing with each instance of poor quality by itself, but because we're incapable of dealing with all the instances.

There's an obvious reply to this, which I commend to you. If the number of edits increases, then so does the number of people looking at the edits. So where's the problem? Sure, mistakes get through--that's always been the case. How would increasing the rate of production increase the proportion of mistakes made? Sure, it increases the number of mistakes, but not (necessarily, anyway) the proportion.

This is an excellent point, and we must bear it in mind when we think about the scalability of Wikipedia: increased activity implies both increased productivity and increased amounts of bad work, but that by itself doesn't imply that there will be increased proportions of bad work.

Radically increased production does mean that it's impossible for any one person to monitor all the activity on Wikipedia. But what need is there for any one person to monitor all the activity on Wikipedia?

So if we're going to find and solve a scalability problem, we've got to identify it precisely. A more easily-delineated problem is aesthetic and ergonomic: suppose I'm interested in a few select topics in philosophy, music, literature, and geography, and I really don't care what people are writing about chemistry, gardening, art history, and sports. On the Recent Changes page, I have to scan the whole blasted thing--which, as we all know, is (delightfully enough) getting longer and longer on average as the months go by--in order to find edits that are relevant to my interests. It seems it would increase my productivity if I were able to identify topics somehow by general category. Of course, this raises a can of worms; but I think we can all agree that it would be at least somewhat useful, and some people might find it extremely useful.

So I'd like to put this question to you: am I somehow not understanding what scalability problems are lurking about here? Or am I right to suspect that greater amounts of activity will not mean any problems for quality? If not, please identify, precisely, what the problem is, please.

--Larry_Sanger


It would, of course, help the scanning the Recent Changes page if people would (a) admit to themselves when a change is a minor edit and (b)write a summary note giving some idea about what they've just done. I am guilty of breaking both of those (the first by carelessness more often than not), but it would help.

--MichaelTinkler


Yes, some means of officially "registering" with the software that I want to montior certain pages or groups of pages might be useful. I also agree that the "minor edit" box is useful--but I see it abused in both directions: people check the "minor edit" box after they have made an actual content change, thereby "hiding" their change from view. It would be good to get into the habit of change summaries--I am guilty of not doing that. It is possible that the problem will be solved by community standards, but I think some software changes will be needed as well. --LDC

Yep, I think this is a good idea, too, and I think it's a good idea to fill out that "summary" line--I do pretty often but not as often as I should. --LMS

It seems to me that adding a box to "notify of changes to this article," followed by an automatic email when the page is changed, would be simpler to implement than some sort of metadata system--but I might be wrong, as I have been many times before; and just because it might be simpler does not mean that it will be simple at all. And so I'm not sure if that would be the correct approach either.

And I'll admit to abusing the "minor edit" box myself; I check it now by habit and have unintentionally "hidden" many new entries that way, which then do not show up on "new topics," among other things. --KQ


I agree with LDC and KQ, there should be a list basically identical to Recent Changes, but only topics that you subscribe to are listed there, that would the most effective in my opinion. Also some education is needed in exactely when you need to check the "This change is a minor edit" box. I read the articles on how to edit a page and the FAQ, but didn't see it explained, IIRC. --Creaktop


oh, hell! I could have avoided the whole 'baroque' debate if I'd just deleted the 'intricate detail' and called it a minor edit! NOW I know!

The only thing I think I systematically elide with 'minor edit' is new revisions to the "biographical listings" page. --MichaelTinkler


1. A "personalised" Recent Changes page showing only changes in articles you subscribed to. 2. Email notification of changes to articles you subscribed to; can be done with or without the actual diff in email.

Both of these require maintaining a list of subscribed articles in one's Preferences which, however, is not that big a deal and can be implemented in the same script with a new action "editsubscr", to which there would be a big shiny link on the "editprefs" page.

3. The same as in 1. or 2. but done not to subscribed articles but rather to any articles containing a keyword or keyphrase you registeres as your interest.

This is more CPU-intensive, but still doable. It might be an overkill; heck, even 1. or 2. might be overkills, but I'm just trying to lay out the possibilities.

4. Saving with "minor edit" set may trigger a warning if more than 10 lines are detected as changed in the article, or if the article is new.

What do y'all think? -- AV

P.S. I volunteer to help with the actual programming, should help be needed.

This all sounds great, but perhaps there are other similar ideas we should consider as well? See below for another idea. I don't know which is best. I do like the article subscription idea, but I think that doesn't do the same thing that would be done by allowing people to select categories into which edits can be placed. See below... --LMS

Something that might be desirable is a "review" system, where users who review a particular change can register their approval or disapproval of a particular change, and have the review status appear in the recent changes display. So you can see what has and hasn't had at least some level of review.--Belltower

As any seasoned Wikipedian knows, the simple ability to change what problems any potential reviewers would spot constitutes a sort of review. See Peer review and the Wikipedia process for elaboration. --LMS
Yes, but the original point is that no one may review a particular article. There's no way to register that I *like* a particular revision. Meanwhile, reviewing every change to every article I have some knowledge of is time-consuming and I'm liable to miss a few -- isn't that the crux of the problem? If recent changes also listed a "rating", say a 1.0-5.0 value with a number of votes, then I can see at a glance what hasn't had much review. I've also avoided doing one change because someone explicitly changed it (U-boat to U-boot), and while I could change it, they can just change it back again, and I could see potential Wiki "wars" in the future. If a change got a low rating, that might discourage the person from changing it to the less desired version. --Belltower

And now when the scalability issue emerges, recall the subject I raised some time ago about metadata, categories, changes to the software.
All of my opinions were either rejected outright, played down or it was said "too early for that" or "rules are a bad thing".
It's a pity that we should have known better.
Discussion and changes are needed soon before Wikipedia collapses under her own weight and turn our pet project into a laugh project.
Save our souls.
--Kpjas

You seem to be ignoring the content of my essay, Kpjas. :-) In the essay, I am not supporting your view; I am taking the skeptical position that there isn't a scalability problem as far as quality of content goes. Do you care to give an argument that there is one? I haven't seen one yet. Everyone seems to assume that there is one. There is one as far as ease of use goes--so that I could easily check changes to all philosophy articles, for instance--I'll agree with that. (Although I don't think that's a very serious problem yet.) Also, it is not the case that all of your ideas were played down or rejected--give me break. After all, plenty of your ideas have been rather common ideas. --LMS, slightly exasperated

I think the crux of the scalability problem is that there's currently no way to scale the editorial process (informal as it is) because there's no method for collaboration.

Could you elaborate, please? I'm not sure what you mean here. I mean, of course there's a method for collaboration--the wiki format is brilliant at it. --LMS
Well sometimes brevity is the soul of confusion, not wit. I guess I wasn't clear. The wiki format is brilliant a some kinds of collaboration. My claim is just that editorial process is not one of those things. It would be better if there were clear ways for a group to collaborate as editors.
Some of this would require software changes. For example there ought to be a way to mark pages so a group or individual can tell if they've been looked at. Preferably there would be multiple kinds of marks so a page could be marked wikified, peer validated, typographical errors, work needed, or even bad. . I also think it's important to have a way to break down the recent changes page so that several people together could look at every new page today, or this week, or whatever.
Other changes likely will not require software updates. We could create the above pages, and collaborate to update them with links to the pages that need specific kinds of work. Teams of people could be formed to work on each of those pages. I'm not sure if it's possible to search for every page which does not contain a link to wikified, but if it is, we could just add the above links to pages search on the pages which do not contain those links, and use that as a starting point for update work. --Mark Christensen

It would be nice to break up the recent changes into manageable chunks and get an informal group together to work together to look at all the changes, but there's no good way to do this. I think this is just as necessary as a simplified way to keep track of articles in your field. -- Mark Christensen

I think this is a sensible suggestion. One idea is for us to decide on some set of "top level categories," make a drop-down box, which must select an area whenever an edit is made; the default area is whatever the last "editor" selected. This would be for the ease of sorting the Recent Changes page--not metadata about the article. Jimbo has talked about doing this for months. I think that if someone were to present some clean code that does this (this=a set of more specific requirements I could provide you with on request :-) ) to Jimbo and to CliffordAdams (it sounds to me like it would be rather easy...), they could upload the code--without uploading the page name crunch-inducing latest version of the software. --LMS

It just occurred to me that it'd be nice to have the Recent Changes remember what I've seen already, and perhaps put horizontal rules at each point in the list where I've visited the page. -J

Look at the hint I put on my personal page (hornlo) yesterday; this sort of gives you what you want -- just drag the List new changes starting from link from the Recent Changes page to your desktop or quick link toolbar each time you leave or close the RC page -- that gives you one-click-away access to just those changes made since you last looked at RC. --loh

Delete me anyone if I repeat a known idea (this page is too long to read it all). I would like to see "My Recent Changes" with the list of changes made to articles I have edited, sorted by the date of changes made by others with an option "Skip this article now and for ever". This way I could follow the trail of destruction, e.g. Larry's coup de grace on the "theory" of the role of sleep for learning :) -- Piotr Wozniak


I have implemented something on my underpopulated usemod wiki [1] that I think will satisfy many people. It is a recent changes page that only reports changes to pages that are linked to from a given page. So, for example, you can go to http://www.projectmosaic.org/wiki.pl?action=rc&page=ChronicFatigueSyndrome&days=90 and find recent changes for only those pages linked to from http://www.projectmosaic.org/wiki.pl?ChronicFatigueSyndrome. Or, Piotr, you could maintain a list of pages you want to review at Piotr Wozniak/Watch List, then click on Recent Changes for that page, and see all recent changes to those pages alone. --DanKeshet


This sounds like an excellent idea. I'd definitely use it to keep an eye on articles that interest me. I also think that the idea above of some type of categorization could be useful, but at the same time it would be constricting if it was only possible to give each page one categorization. --Pinkunicorn


There are at least two ways of categorising, it seems to me. One is to attach the categories with an article (like the see also: links) The other is to make pages for each of the categories and then link the articles to that page. Granted, this method may turn into a page o' links, but that would be the point.

To manage either method, being able to save the pages found by searching for a word phrase as a 'category' page would be useful. As would the ability to 'merge' two such pages...

Being able to make a 'recent changes' type page for a given page o' links would be useful too. (is that what DanKeshet was offering ?) If you could build one based on 2 links away or n links away, then you could have pages of categories of categories etc.


I don't think there will be a quality issue with more editors. The key is that over time articles will be shaped by many hands, resulting in an optimal balance of being mostly satisfactory for most people, while being truly objectionable to very few people, if anyone. There will be short-term fluctuations in quality, but the long-term trend will be a smooth curve towards improvement. How long 'over time' lasts will depend on the obscurity of topic. Larry's example of epistemic circularity might indeed be a morass for many months, until a qualified person finally shows up to fix it. An inappropriate edit to the Ku Klux Klan page will likely be fixed in less than an hour. It will take longer for an obscure topic to be improved, but if we measure in terms of page views rather than days, it will be about equal.

I had an idea about the other issue - filtering out certain topics of interest. Perhaps every page in Wikipedia could be embedded with a drop down menu that lists major topics. When someone chooses a topic, a javascript app shows a list of subtopics. So if you choose biology - botany, zoology, evolution, etc. appear in the new list. An reader can pick the relevent topic and subtopic, and pages will be categorized piecemeal be users. Then these topics could be used as the basis of creating a tree structure for the encyclopedia, as well as for the filtered recent changes pages.

TS


Thank you for the column, Larry.

Here is what I need:

  1. The ability to subscribe to an existing page.
  2. Recent Changes showing (on request) only:
    • pages I have subscribed to that have been edited in the past n days, or
    • pages created in the past n days.

That would let me monitor a limited set of pages that I decide to contribute to in a major way. Being presented with new pages as well would allow me to add to my pageset those that I find relevant.

--- One thing people should note is that some browsers (Mozilla and Internet Explorer, I believe) have settings for automagically notifying you that a page has changed. So people might want to look into that as a monitoring method.--Belltower

Another option is the free Mind-it service. It's my "monitoring method" of choice for any page on the Web. <>< tbc

I just have one little caveat: I hope I didn't give anyone the impression that I think there is no scalability problem. Mainly, I wrote the above to get people thinking about what the scalability problem(s) is (are).

I think there are pretty obviously four scalability problems we will have to deal with:

  1. For the sake of aesthetics and ergonomics, we should change the Recent Changes page so that it is possible to view changes that are only in certain categories. I like the idea of allowing people to multi-select categories in which their changes will appear. This categorizes changes without categorizing whole articles, which is something we should avoid, I think. (Actually, I'd like to hear some good arguments that we should have some hard-wired category scheme, and even wrote a column on this subject, but I haven't seen any yet. I mean, I love category schemes myself, but I don't like hard-wiring them.)
  2. Edit lock problems might increase if traffic increases radically (I have no idea what to do about this--right now there's a once-per-five-minute cron job that's done, but I still encounter edit locks myself).
  3. The search feature has become essentially useless (I virtually never use it, because it takes forever), and if I'm not mistaken, it's a burden on the server as well. This obviously hasn't scaled well, and we'll be working on a solution to this soon.
  4. My understanding is that UseModWiki saves articles in text format, and past a certain point, this is unworkable. So eventually we're almost certainly going to have to make Wikipedia database-driven. Again, this is something we might be doing soon.

--LMS


Seems to me that making the source and data available as tarballs is a step in the right direction, towards a distributed web presence. Supporting various functions (browsing, searching) can be done with sufficiently up-to-date mirrors.

Imagine a simple mirror scheme, with one primary mirror being the definitive source of an article (the current www.wikipedia.com). Secondary mirrors would fetch Recent Changes and associated diffs on some regular basis and apply them to the local copy.

A client would begin a session by browsing a secondary mirror. Secondary mirrors could reasonably take on the bulk of the burden of serving most searches and browsing requests. When a request for an edit page comes into a secondary mirror, it gets redirected to the primary. That way, the text entry box will contain the latest text, minimizing conflicts.

This might require Recent Changes, or a derivative thereof under a new name, to take a timestamp argument and to return diffs relative only to that timestamp.

This assumes that the ratio of read-only hits to editpage hits is high enough for the offloading to pay off.

The other advantage of this doesn't have as much to do with scalability as it does to the maintainence of geographically (and politically) diverse, living archives (ie, a separate copy of the Wikipedia content is more likely to be kept up to date and in good working, accessible order if it is actively being used by some invested population of users).

--dja