This page deals with a proposed semi-automatic solution to the copyright issues Wikipedia is suffering from:

(...)

I have done that a few times, sometimes deleting the offending paragraphs. But I think that this may just be the tip of the iceberg. Has anyone ever thought of automating this process a little bit? I think the answer might be a bot that picks an article at random, then selects for instance three paragraphs, then picks ten (Google limit) consecutive words from each paragraph and strips the phrases of their wiki/HTML markup. It then performs Google searches for these phrases, checking if any of the first ten documents returned by Google contain the word, "copyright". This approach might not find all offending paragraphs/articles and may also give some false positives, but it would be vastly more efficient than hoping that someone, someday, just out of curiosity, checks an article for copyright problems... --Martin

I've something like this in place. It does nearly exactly what you're suggesting. It's input is the "new articles" list and it checks for a) hits on Google b) signs of grafitti (SHOUTING, Put your new article here, no capital letters in article etc.) c) Signs of OCR scannings. It's not really "production level" coding but it works for me. You can have a look at it's output at http://jeluf.mine.nu/jf/wikicheck.html.gz . The source can be found at http://jeluf.mine.nu/jf/wikicheck.pl.gz . It expects the recent changes on stdin and produces output on standard out. You have to obtain a SOAP key from Google (it's free for 1000 queries/day). -- JeLuF
Yes, very nice script. Good to see all is right with my submissions. ;-) A few enhancements to the script might be necessary, such as not reporting on phrases with more than 20 hits or none at all, not trying to check single words, putting the most likely offenders at the top of the list, etc. Maybe you should also refer to google.com instead of google.de, as google.de supposedly blocks some content. But apart from that it looks rather promising. The next logical step would be to not just check new submissions, but all articles in the database. I don't know if the SOAP limit is a problem for doing that in a reasonable amount of time, but using WWW::Search or multiple SOAP keys could remedy that. Also, the URI of the results page should be made widely known on Wikipedia, because this approach does not work if nobody looks at the results. (I see no mention of it on your user page, for example.) --Martin


I wrote it mainly for myself. I'm doing a little gardening by calling the script and checking it's output myself. That's why it produces such a big output (130 kbytes zipped) - I use it via local area network. And the quality of the results is OK for me. It's not ment to be perfect, it's just for easing housekeeping. At first, it only looked for simple signs of vandalisms. Copyright checking was added later on. The result I presented to you is from a run checking only anonymous edits. That's the way I usualy call the script - I seem to trust logged-in users to a certain degree.
Using multiple SOAP keys would violate the usage agreement google requires you to accept before receiving a key. Perhaps one could ask them for a free key if one would really want to scan all articles.
But there is a general problem scanning all articles: First, you would have to discard wikipedia itself in the list of hits. Second, there are sites citing wikipedia, they would trigger false alerts, too. A lot of manual work would have to be done after receiving the results of the bot.
The advantage of having this kind of bot activity running on the wikipedia server would be direct database access which would be ways faster than the current way of crawling wikipedia's pages. -- JeLuF
Eliminating all pages from the Google search results that contain the word "wikipedia" should do the trick. Unless those other sites plagiarize Wikipedia, that is.
Thinking that all logged-in users will do the right thing and that only the anonymous users insert bad content is dangerous thinking. I do submit all my work anonymously, so that others will have at least have a look at it. Blindly trusting someone is never such a good idea...
I thought using the SOAP keys for multiple persons from the same server was not a violation, or is it? Frankly, I think XML-RPC is overhyped anyway, there are modules for doing HTML searches all over CPAN.
I agree that the finished script should be hosted in the US, and since we two are on the "wrong" side of the Atlantic, maybe we should look for someone to host it, if it is in usable state. I don't know if the main server has enough capacity or bandwidth to also run this bot all the time.
So, what do we do now? A script for your personal use is fine, but unless a more complete check is performed, we cannot be sure just how much of Wikipedia content is illegal. --Martin

--- Some questions:

Is a copyright bot really the right solution, or is this just tech fetishism? A balanced assessment of these theoretical copyright problems might show that there is just as much of a problem in editorial practices; For instance there are people (see 'edit wars') accusing pseudonymous or anonymous authors of 'being' certain other people - opening wikipedia to lawsuits by those people, since they can easily claim that others have copied their work into the wiki. It seems these practices should be cleaned up before we care about technology.

Is the US the right place to host this? US copyright law is considered to be ridiculous by most familiar with global copyright conventions, and the more so after the 'digital millenium copyright act' and RIAA lobbying. Accordingly, it seems inevitable that the wikipedia host sites will be forced out of the USA - although it makes technical sense to locate the scripts near the search engines to support massive comparisons, it makes little legal sense to set something up that won't work if wikipedia has to suddenly move to New Zealand or wherever...

What other 'bots' might be useful? A slander bot to keep eyes on accusations? A trademark bot to watch usage of trademarked terms? A trade secret bot to watch for disgruntled employees revealing oh say the formula for Coca-Cola?

As I see it, the need for such a bot arises not so much because of the infringements themselves, but rather because of human nature. People are more willing to just use Copy-and-Paste if they know they can get away with it, as most other editors will be primarily interested in their own work. Checking for NPOV is far easier than checking for copyright violations, with the result that the latter will be seldom done.
Of course you are right, if for instance a Microsoft lawyer decides to upload massive Encarta content to sue the project later, the bot won't be much help detecting that. If people upload text from copyrighted books they scanned, again the bot is useless.
But if we can at least stop those people who, either out of ignorance or stupidity, submit content from other copyrighted sites, we should definitely do that.
As far as server moves are concerned, I presume both the main database and the bots can easily be moved to another server. If I had a flatrate DSL account, I would host the bot, but those accounts are rare here in Germany except in the major cities.
I agree that this is just another technological solution to a human-caused problem, but given Wikipedia's rapid growth, human eyeballs alone will have difficulty doing all the work required... --Martin
Here's a recent reference I found in internet legal issues and copyright. Internet Legal issues while testing a context-based search facility and using copyright as one of terms to feed the heuristics engine. Good luck... --Larry T
There are at least one quite good algorithm which uses steaming and correlation to test for similarities between articles on the web. I didn't find the perl script mentioned so I couldn't check out if it was better or worse.
Basic idea is to use a few words with multiple occurences, use this for a Google search and check the found articles for correlation with pairs of words in the source article. --John