Talk:Data dumps/Archive 1

mwdumper memory error

I am running mwdumper on a debian machine with kaffe as JRE. When I try to import any of the XML dumps (pages, articles, full) from the German wikipedia to my MySQL 4.0x database I get a 'java.lang.OutOfMemoryError' at around 50.000 inserted rows. It is a 2 Ghz machine with kernel 2.6.8 and 2GB of RAM. Anyone else having these problems? Will changing to Sun´s JRE help?

Thank you, --OpenHaus 09:33, 2 November 2005 (UTC)

For OutOfMemoryError (while using Sun's JRE, kaffe doesn't work for me) try this using the -Xmx200M argument like so:
/usr/java/jre1.5.0_05/bin/java -Xmx200M -server -jar mwdumper.jar --format=sql:1.4 20051020_pages_articles.xml.bz2 | mysql enwiki
Good luck -- psi29a Mindwerks.Net

1.5 table format

Are the 1.5 tables (page, revision etc.) defined somewhere? The SQL output of mwdumper doesn't contain a CREATE TABLE command like the old SQL dumps did. --Tgr 15:22, 16 October 2005 (UTC)

You can get the db schema from the Mediawiki 1.5 install. You don't need to install anything, just create a new database and import the tables.sql file under the maintenance directory. There might be an easier way, but that's my solution.
I guess you mean something like "cat /usr/share/mediawiki/maintenance/tables.sql | mysql db_name", which would be easier than what I did (I used SQL commands to delete data from all tables, then imported using mwdumper. However, I expect to have to restore at least the user accounts.) --Jrvz 02:54, 10 January 2006 (UTC)

1.4 old problems

When trying to read the old table of the 1.4 database format into MySQL, i got the following error:

ERROR 1062 (23000) at line 28: Duplicate entry '1' for key 1

(I tried with the 20051003 and 20051012 dumps of HuWiki, using mwdumper to convert them to SQL, and the CREATE TABLE commands from the old dump to create the old table; I got the same error message both times.)

Any ideas? --Tgr 15:32, 16 October 2005 (UTC)

Please check with the 20051020 dump. Some of the older ones contained leftover non-normalized Unicode titles. --brion 05:48, 24 October 2005 (UTC)

I get the same message. --Tgr 21:03, 26 October 2005 (UTC)

Did you ever figure out the problem? I also get this error message when I try to import the current (December 11) de.wikipedia dump into MediaWiki 1.5. The program then continues to print text to the console, but leaves the database empty when it exits. --84.153.247.50 23:08, 18 December 2005 (UTC)

I'm having the Same problem , when i make a sql file , and try to update my database it give me duplicate error issues .

i did empty the table : text and even this give me duplicate error ...

Importing Images

I have downloaded the content of the Italian wikipedia and installed it on my computer, so I have the Italian wikipedia working locally on my PC, but there are no images. I have downloaded the update.tar file(http://download.wikimedia.org/images/wikipedia/it/) (about 1G for Italian), but my question is:

  • Where should I untar this file?
  • Is it enough to untar files to see them locally on my computer?

Thanks

This is not an authoritative answer, just found after some fiddling. If you know the correct/exact answer, please add it to the article. Also some images are missing, like the icons of main page categories; dunno if I'm doing something wrong or if they should be grabbed elsewhere (commons ?).
  • un-tar the image archive in mediawiki/images/ directory. It should create a lang/ (ie. fr/ or something) directory.
  • move all files and directories from that subdirectory one directory upwards, that is directly in images/
  • check you have a correct AdminSettings.php file
  • go in maintenance/ directory and launch rebuildImages.php --missing
Hope it helps...

mwdumper doesn't work

i maintained a wiki on my pc(windows XP), and tried to add data to it. but after i used the mwdumper, the database is still the same. what's the problem? my command: c:\borland\jbuilder2006\jdk1.5\bin\java -server -jar mwdumper.jar --format=sql:1.5 20051105_pages_full.xml.bz2 | c:\wamp\mysql\bin\mysql -u root -p wikidb

BTW, the importDump.php said i don't have the permission.

Need more info

Are there any guides on importing dumps for n00bs? I'm comepletely lost and there is a lot of stuff I would like to grab and send up to my site. I don't know much php, I don't know much about MySql. I only know enough to install websites, create databases and do minor editing. Words like "untar file here" "create this command" and "requires piping like so" are basically all chinese to me.

Of course, it's nobodies fault but my own. I have downloaded some .xml dumps and when I go to my Wikisite, then open special pages>import> and try to import a file larger the 700k, the upload fails. I wanted to upload the whole 1.0 gig xml but I know that will be a waste.

I installed mwdumper for WinXp and have no idea how to use it, let alone find it on my system. When I do find it, I try to open it and all I see are manuals.pdf and weird files. If I go to the "How to use" manual, it really doesn't tell someone that hasn't had a lot of education in this area how to actually use it. I basically need "Importing xml files to your wikisite for dummies." Does anything like this exist?

How does one use the Perl importing script? Do you save it as a .sql fileand upload it via FTP to your server? See, these are the things I would LOVE to find out. If I find out how to do this, I'm writing a manual on this info, because not everyone knows php or their way around MySql without extensive training. Thanks for any help, guys ! bucklerchad@comcast.net is my email if you can help me out.

Sniggity 16:44, 24 December 2005 (UTC)

IOException When Importing

I tried importing the 2005-12-13 en dump with

java -Xmx200M -Xss150000000 -server -jar mwdumper.jar --format=sql:1.4 /c/info/20051213_pages_articles.xml.bz2 | mysql -u wikiuser -p wikidb

However, I get this error after several hours:

1,480,000 pages (26.966/sec), 1,480,000 revs (26.966/sec)
Exception in thread "main" java.io.IOException: unsupported five or six byte UTF-8 sequence (character code: 0x93)
   at org.mediawiki.importer.XmlDumpReader.readDump() (Unknown Source)
   at org.mediawiki.dumper.Dumper.main(java.lang.String[]) (Unknown Source)
   at gnu.java.lang.MainThread.call_main() (/usr/lib/libgcj.so.6.0.0)
   at gnu.java.lang.MainThread.run() (/usr/lib/libgcj.so.6.0.0)

I assume this is from something in the dump rather than a programming error. Is this a problem for more recent dumps? Is there a workaround? --Jrvz 03:14, 10 January 2006 (UTC)

Sounds like a buggy version of Java. Details please. --brion 04:34, 23 January 2006 (UTC)

Also, it is probably wise to install (for example) JCharset, furthermore it seems as if 'java' thinks you are running a different os (i.e. linux, not Windows XP), it might also be wise to install java from java.sun.com, Cola4ever 21:48, 24 January 2006 (UTC)

For the above, I was using the Debian linux package kaffe-1.1.5-3, which identifies itself like this:

$ java -version
Kaffe Virtual Machine

Copyright (c) 1996-2004 Kaffe.org project contributors (please see
  the source code for a full list of contributors).  All rights reserved.
Portions Copyright (c) 1996-2002 Transvirtual Technologies, Inc.

The Kaffe virtual machine is free software, licensed under the terms of
the GNU General Public License.  Kaffe.org is a an independent, free software
community project, not directly affiliated with Transvirtual Technologies,
Inc.  Kaffe is a Trademark of Transvirtual Technologies, Inc.  Kaffe comes
with ABSOLUTELY NO WARRANTY.

Engine: Just-in-time v3   Version: 1.1.5   Java Version: 1.1

I suppose it lacks some needed functionality, and I should install the "real" java from Sun. --Jrvz 03:45, 24 March 2006 (UTC)

A month without dump

I wanted to complain a little bit. Indeed, in the last month, the French Wiktionary has more than double but infortunately, we are still working with a rather incomplete basis. Moreover, we are close to had 30.000 items (basically, we will multiply by 4 the articles about French words). This work is delayed by the identifications of duplicates. Laurent Bouvier 11:52, 10 January 2006 (UTC)

I too would very welcome if a dump of other projects than only the English Wikipedia would appear sometime soon. I have just created a bot for the Esperanto Wikipedia which will help identify all the biography articles we have and place them into relevant categories according to dates of birth and death. But to be able to identify the articles, I need a recent dump of the database - particularly since a lot of articles has been processed in this manner manually during the recent days, but the old dump still makes the bot identify them as needing to be taken care of. Blahma 02:41, 13 January 2006 (UTC)

See [1] for answer. Erik Zachte 08:19, 13 January 2006 (UTC)
Thanks for the info. In spite of that, one more week has already passed since the announced time the job will be done, and still no new dumps out there. I would really welcome one, even if that would be still done using the old technology, if that is possible. Blahma 16:02, 21 January 2006 (UTC)
Something is going on already, finally. Thanks for the progress. Blahma 00:14, 28 January 2006 (UTC)

User namespace dump

Dunno where to post or who to talk to, so I'll just post here. Can we not make a database dump of the user pages? While looking at Wikipedia's mirrors and forks section, many people found it annoying that the user pages were getting mirrored as well, and database dump would be the best way to prevent it. If they really need them, they can get it off each user pages. Most mirrors and forks copy only the articles, and I don't see any reason why we should be dumping user pages as well. (as well as user talk pages) Thanks. -- WB 09:50, 15 January 2006 (UTC)


importDump.php problem

I have downloaded several of the xml dumps at download.wikimedia.org onto my linux box. I have then used the command:

cd wiki
bzip2 -dc <dumpfile> | php maintenance/importDump.php

However, no matter which type of dump I used (full or current or articles) my import stops with a 'Killed' message after several megabytes of loading into my MySQL database. I am at a lose. I have increased my php memory limit to 50MB and the timeout to 640 seconds but to no avail. Does anyone have any ideas as to what is going wrong? My Webhost provider says it is a wiki problem and they can find nothing wrong with the server. Please help!?? Thanks in advance!

--- It could be a mysql problem, please check if /var/log/mysql/ files are causing the filesystem to become 100% full.

Dump stalled on af, failed on ar.

<snip>

Regards, Rich Farmbrough 19:46, 22 January 2006 (UTC) P.S. if I can help, let me know on en: or by email. Rich Farmbrough 19:46, 22 January 2006 (UTC)

Not stalled, that was the dump running after a stop for debugging. :) It's way past that now. --brion 04:24, 23 January 2006 (UTC)

mwdumper: after (succesful) import MySQL shows no change

I downloaded '20051020_pages_current.xml', first it tells me mediawiki.text doesn't exist, after which everything goes fine(i.e. 'no errors'). Then when opening mediawiki again, 'no show', when viewing the tables in phpMyAdmin it shows nothing is imported. The only reason I can think of is all tables are prefixed with 'mw_' as 'advised', if there are specific requirements to importing the dump, please update.

Updates:
1. It seems the corresponding table, does exist, yet with the prefixed 'mw_', which seemingly should not, a possible solution would be an extra parameter for mwdumper which adds the prefixed string.

2. I have found a possible solution at Wikitech-l

1. mysql -u root -p wikidb
2. truncate page;
3. truncate revision;
4. truncate text;

3. After re-installing mediawiki with the table-names without 'mw_' prepended, during import the database is growing, so it seems to have solved the problem.

4. Although it seemed at first that it was successful, a great deal of pages which are present at the English Wikipediaseem to be missing in mine, also, the database is only 752mb in size.

5. Since it seemed to have gone wrong I re-imported the .xml file. I 'discovered' MySQL uses ibdata1 to store databse-updates, this file is now 683MB in size and the database which no longer seems to grow, is now 593MB.

I think these problems are/should not be normal, so a guide would be a great addition to the already available information. It seems unlikely that a 4GB file ends up being just one 1GB.

6. Currently I am trying to import using ImportDump.php, this is very, very slow, it does not seem to make a difference as to the size of ibdata1(currently 2.1GB), so this seems related to MySQL, ths does not solve my problem though, perhaps someone has an idea how I can force MySQL to import the data in to the database, instead of this file.

7. After a lot of searching, the terms 'purge', 'flush' and 'dump' seem to be important, yet it is hard to determine for sure what these terms mean, so it is (I think) unsafe, to simply 'do' these actions.

8. I just gave up(for now?), this seems hopeless, for some reason, it is impossible to import the data properly, since I am no DBA (yet), I have absolutely no idea where to go from here.

Performance Suggestion

I see the dumps are XML, which is fluffy. One of the things that makes XML fluffy is the inclusion of whitespace in the data. When I look at the XML dump for the full enwiki (20051213_pages_full.xml), I find that there are spaces used to indent each level of tags.

If the spaces are removed, then we have a massive savings; I haven't measured, but I bet it would cut the file size by more than 30%. If the spaces are made into tabs, we'd get a smaller, but still very worthwhile savings.

Sure, the file is compressed; but the compression could be working against much less data. And compression doesn't eliminate the storage cost of this fluff; it just reduces it.

-- Mikeblas 07:05, 27 January 2006 (UTC)

Would you mind measuring it? :) Keep in mind that the vast majority of the file size is in the page text. --brion 00:24, 28 January 2006 (UTC)

Really Basic MWdumper help!

Hello,

I downloaded the DB dump from the website, for the En wikipedia, converted from Bzip2, and am ready to import, however, I have no clue how to use mwdumper.

I downloaded the .jar file, but when I double click, either nothing happens, or I extract a couple of folders with .class files. What do I need to do to these files to RUN the hava application, and use the commands to import the DB?

See the README file. A GUI version is in progress, but incomplete. --brion 01:51, 29 January 2006 (UTC)
Thanks I;ve read the README, but for beginners like me, I don't even know how to get to the point where I can enter in a command line!

Whenever I try to run mwdumper.jar, nothing happens. I've downloaded the latest Sun Java, but don't know what to do next~

In that case you probably should wait for the GUI version; I'll try to have something basic out next week. --brion 03:35, 29 January 2006 (UTC)


SWEET! GUI is good for newbies like moi!

Brion... any updates? BTW , does MWdumper actually upload the XML file to my SQL DB? If my SQL DB lives on a webhost will MWdumper still do the job?

once a week?

Hi, looking at the current dump creation, it looks to me that dumping all the wikis will take about 2 weeks. Is there a secret plan to still be able to make it weekly (as written in the page)? Maybe:

  • dumping several wikis at the same time
  • dumping only a diff version weekly, and dumping a full edition monthly or so.

Can this be done? Any other trick? Kipmaster 17:25, 29 January 2006 (UTC)

Once locking is in we may go back to using two or three machines instead of one to run the compression, or other optimization junk. --brion 01:31, 31 January 2006 (UTC)
Might be worth simply creating a pile of semaphored tasks, when they're taken a new pile is created. 213.48.182.7 19:40, 8 February 2006 (UTC)
Only the pages-articles.xml.bz2 once a week and the full /Compressed dump once a month. That would be great.--84.169.235.36 11:29, 19 February 2006 (UTC)

how long?

could a time estimation be added to the download report? the en wiki is running for 5 days now, but I have no way of knowing if it's normal. adding the last run time should be great... Felagund 08:12, 30 January 2006 (UTC)

That's pretty much in line, yes. (It's done, by the way.) Since this is the first run on the new system, not all the bells and whistles are in place and there's no previous dump on the same system to compare it to automatically. --brion 01:30, 31 January 2006 (UTC)

older dumps

What's the reason for http://download.wikimedia.org/wikipedia/en/? Just wondering. Having that page just seem to take few people away from the newer dumps. I realize it's automatically generated, but yeah. Perhaps a note that there are newer ones? etc. Thanks. -- WB 10:55, 1 February 2006 (UTC)

What's the reason for it what? It still exists because there's no obvious reason to delete the files yet. --brion 18:33, 2 February 2006 (UTC)
Somewhat confused about your comment. Anyway, I think there should be note about new ones. That's all. What would be a reason to have them deleted then? -- WB 07:14, 3 February 2006 (UTC)

A basic load balance

Total data dump process took approximately 10 days. English WP itself took 5 days of it. FYI. --Dbl2010 00:42, 3 February 2006 (UTC)

Stalled?

2006-02-10 01:20:42 in-progress All pages with complete page edit history
2006-02-13 10:56:03: enwiki 843086 pages (2.870/sec), 21272800 revs (72.426/sec), ETA 2006-02-16 06:55:38 [max 39000750]

Sinc it is now 01:57, 15 February 2006 (UTC) it seems likely. If not I can't understand why the message is dated 1the 13th. Rich Farmbrough 01:57, 15 February 2006 (UTC)

Perhaps the problem with Zwinger had something ot do with it? Rich Farmbrough 01:57, 15 February 2006 (UTC)
Still same message. If I cna help, let me know on my en:talk or by email. Rich Farmbrough 13:31, 16 February 2006 (UTC)

Everything's on hold until fixed since last week's crash. --brion 22:16, 19 February 2006 (UTC)

Ok, fixed up the dump machines, they're now running. English Wikipedia and evverything else are running on two machines, so it should hopefully finish faster this time too. --brion 02:53, 20 February 2006 (UTC)

Don't expect responses on this page

This page is for documentation, not discussion; if you have some immediate problem with the dumps please contact me by email directly at brion at wikimedia.org. --brion 22:17, 19 February 2006 (UTC)

download.wikimedia.org/images/

There is no link from http://download.wikimedia.org/ to http://download.wikimedia.org/images/. Why doesn't one exist? I had a bear of a time finding the link to the images directory (in fact, I found the link on this talk page). Thanks. Arichnad

Perhaps because they are about three months old. I don't know if any more up-to-date dump exists or is planned. Rich Farmbrough
The content page says "At the moment uploaded files are dealt with separately and somewhat less regularly, but we intend to make upload dumps more regularly again in the future." Rich Farmbrough

dump machines stopped?

Are the dump-machines stopped? --Balû 06:08, 16 March 2006 (UTC)

Add hash or signature of the dumps

Could you add a MD5 or SHA-1 sums together with the description of the file on download.wikimedia.org ? Example :

2006-02-25 04:39:16: frwiki 551495 pages (115.193/sec), 551495 revs (115.193/sec), ETA 2006-02-25 04:54:24 [max 656171]
MD5 : 4191385963ce76573bcdce894fc17cec - SHA-1 : cbfccd500131169d14a9befd458d4c4b77ed496a

Thanks. 83.77.79.20 19:03, 22 March 2006 (UTC)

Please can we have a dump of Apache Access Log?

A dump of the access logs would allow us to calculate page hits. This is an incredibly important statistic for Wikibooks (books are not wildly edited like Wikipedia articles). See http://httpd.apache.org/docs/1.3/logs.html#accesslog . Even a dump of page URL accesses without the IP address would be of huge value. RobinH 15:11, 18 April 2006 (UTC)

(See http://en.wikibooks.org/wiki/Wikibooks:Staff_lounge#Hit_Counting for a discussion )

Idle?

I notice the current status of the dumps is "idle", and prior to that several of the larger dumps for the larger wikis were failing. Are we on hiatus while the proverbial hammer is fetched, or what? (Personally I'm waiting for a couple of the smaller dumps from the very largest wiki, which I'm hoping won't necessarily depend on a global solution to whatever the recent problems have been.) Alai 23:27, 30 April 2006 (UTC)

This seems to have been overtaken by events, so is rather of academic interest only now. Alai 01:31, 2 May 2006 (UTC)

Data format documentation, id

No links from this page seem to document the XML data format. Where can I find such documentation?

As an example, I had assumed that the ids for pages were fixed between dumps, which they don't seem to be.

Erl 11:16, 4 May 2006 (UTC)

I haven't looked at the XML data, so I'm assuming here that page ids are as the "page" SQL table, which is dumped separately. I believe that these stay attached to a given page, though not necessarily page title, if there's a page move. That is, if A is moved to B, B's page id is now what A's used to be, and the redirect at A now has a new id. Alai 23:15, 6 May 2006 (UTC)


The XML dump format for a page is like this:

<page>
    <title>Main Page</title>
    <id>1</id>
    <restrictions>...</restrictions>
    
    <revision>
       <id>1</id>
       <timestamp>2004-09-13T22:06:39Z</timestamp>
       <contributor>
          <username>...</username>
          <id>2</id>
       </contributor>
       <comment>...</comment>
       <text xml:space="preserve">...
       </text>

    </revision>

    <revision>
       <id>2</id>
       <timestamp>2004-09-14T08:55:27Z</timestamp>
       <contributor>
          <ip>...</ip>
       </contributor>
       <text xml:space="preserve">...
       </text>
      ....
</page>	

So I think the page ID is the same as title ID, and is fixed. The revision IDs go incrementally with each new edit. --Wanchun

enwiki dump cycle

Am I right in thinking that en.wikipedia is now effectively on a two week dump cycle, on the basis that the current dumps started on the second, and have an ETA on the 16th? Is there any chance of doing the very largest dumps less often, and the smaller ones moreso? Alai 18:42, 9 May 2006 (UTC)

Besides that, the dump process, although much improved during last year is no so robust yet. Most dumps for en (and every now and then for other large wikipedias) are incomplete, without archive dump, hence unsuitable for e.g. wikistats. Erik Zachte 21:17, 9 May 2006 (UTC)
Indeed so, I've noticed that the complete-edit-history dump for en, which is what makes the whole cycle so slow, is in any event is failing quite often. Might be all the more reason to use a modified cycle. Alai 22:46, 13 May 2006 (UTC)
Two weeks may have been unduly optimistic, of course: it's currently looking likely to be 18 days+. Alai 04:44, 16 May 2006 (UTC)

Using the bzip2 program.

I have downloaded the .bzip2 file for the enwiki dump, but I have no idea how to work the program. Could someone run me through how to unzip the file? Thanks a lot! -Jared|talk

There's documentation on the bzip2 program here. Alai 03:56, 15 May 2006 (UTC)
Thanks, but I have read all the documentation, and it seems that I am supposed to know how it works before I read it. If anyone has a clue on how to do it (actually give me a step by step for a Windows computer), please enlighten me. -Jared | talk 20:05, 15 May 2006 (UTC)
As there's really only one "step", to wit typing "bunzip2 <filename>", which is clearly stated in the documentation, and you don't say anything about what specific problem you may or may not be having (installation issues?), I'm somewhat at a loss. Perhaps you'd be better-off at a bzip-specific forum. Alai 04:42, 16 May 2006 (UTC)
It's just that I'm not sure what program to open (CMD prompt, is there another program that should have installed, too, etc.) first of all. Then, if it is command prompt, what do I type, because bunzip2 is "not an internal or external command". I really do appreciate your help though. It's not too big of a deal. -Jared | talk 19:52, 16 May 2006 (UTC)
I think I may see your difficulty. You run it from a command prompt (or from a run dialog, bash shell, or other equivalent). Make sure it's in a location your cmd (etc) will pick up (or the current directory, or use a fully-qualified path). When you d/l the windows version, it has a filename such as "bzip2-102-x86-win32.exe"; you may wish to rename that to something a little more concise. It has a default action based on the name of the command used, so if you're using it as-is, you'd need to type something like "bzip2-102-x86-win32.exe -d <filename>" to indicate a decode action. Hope that helps. Alai 23:28, 16 May 2006 (UTC)
Thank you so much! Sorry I was being so ambiguous before; I just had no Idea what I was doing. I got the cmd to run, but somehow I lost the download. Haha. Once I download it again, I should be golden! ☺ Thanks, again! -Jared | talk 01:02, 17 May 2006 (UTC)
Now I'm getting error messages saying that the file I'm trying to decompress is "not a bzip2 file". It obviously it. Maybe I'm just doing it wrong. Whatever. -Jared | talk 02:52, 17 May 2006 (UTC)

I've unzipped it; What next?

I've got everything dumped and unzipped. When I went to open the xml, only code came up, and that's what I expected. What do I do to actually view the pages? Can I use IE or FireFox, or must I use some sort of Media Wiki program? Thanks. -en:User:JP06035

en dump in progress, or not?

For a while, the download page has had a line that looks something like this for enwiki:

  • 2006-06-25 17:41:00 enwiki Dump in progress

But there's no sign of an archive at /20060625 or '26. More problematico technicale? Alai 05:42, 26 June 2006 (UTC)

Never mind, I think I see what's been going on: there's a comment on the '0619 dump for "All pages with complete page edit history (.bz2)" that "This dump was run experimentally with a speedup after an earlier database glitch killed it." on the 25th, which I assume is what caused the updated message on the front page, though apparently it's not in fact currently in progress; the two status messages seem to have become inconsistent at some point. Alai 02:43, 27 June 2006 (UTC)

images problems

I have downloaded and installed locally the dump file itwiki-20060530-pages-articles.xml.bz2. Then I have downloaded and installed MediaWiki (1.6.6), Mysql5 etc. Seeing that I didn't have any images

  1. I have downloaded and imported on my db itwiki-20060530-image.sql, itwiki-20060530-imagelinks.sql, itwiki-20060530-oldimage.sql.
  2. I have modified the LocalSettings.php for enabling ImagesUpload
  3. I have downloaded and installed imageMagik

After these changes, while loading an article on browse, some subdirectories of imags dir have been created but they were empty.

How can I have the article's images?

thks

Dumps nlwiki in progress since 20060626

The dumps for the Dutch Wikipedia are already for 10 days in progress since june 26, so maybe the process is hanging ([2]). The dumps are partly done, however the for me important pages-articles.xml is still waiting. JePe 11:05, 6 July 2006 (UTC)

That's exactly the same file I've been waiting for also JurgenG 06:25, 8 July 2006 (UTC)
They seem to have started again JurgenG 20:41, 9 July 2006 (UTC)
Yes, the dump from 20060709 is ready now. However, the dump from 20060626 is still in progress, maybe forever until someone manually kills the process. JePe 14:20, 10 July 2006 (UTC)

Duplicate dumps in different formats?

It appears that (for en. at least), two separate dumps are being produced for "All pages with complete edit history" -- one as a .bz, and another as a .7z. If that's indeed the only difference, is this a transitional arrangement, or is it likely to be long-standing? Alai 19:30, 24 July 2006 (UTC)

Creation of tables

OK. I managed to convert the XML file to a SQL file. But how do I create the needed tables? Do you provide a script just to create the tables? Malafaya 16:45, 5 August 2006 (UTC)

Hmm... if you install MediaWiki that should create the tables for you. Or you could just fish the table definitions out of the MediaWiki code. -- Visviva 12:27, 11 August 2006 (UTC)

Categories

Argh! How is it possible that the categorylinks dump does not include links from Category: space? Is there any dump besides the main one that contains category-structure information? -- Visviva 12:27, 11 August 2006 (UTC)

If I'm understanding you correctly... The categorylinks dump includes only the category inclusion links i.e. in the style of [[Category:Blah]]. But it does include all of them, including category inclusion in other categories (I've used this to perform assorted offline manipulation of the complete category structure, and I can confirm that it's all there). Non-inclusion style links to or from categories (i.e. [[:Category:Blah]], or [[<ns>:Blah]] for namespaces other than Category and Template) work exactly like page-to-page links, and will be included in pagelinks.sql.gz (I infer, I can't confirm that directly). Alai 07:57, 21 August 2006 (UTC)
Thanks -- I guess I'll need to look again. My initial check -- based on counting the number of links for each category -- seemed to indicate that only links from article space were included. -- Visviva 08:46, 29 August 2006 (UTC)

Just to be clear about what I'm saying: for example, one row of categorylinks is:

+---------+------------------+------------+---------------------+
| cl_from | cl_to            | cl_sortkey | cl_timestamp        |
+---------+------------------+------------+---------------------+
| 2529663 | Albums_by_artist | !!! albums | 2005-08-24 18:09:12 |
+---------+------------------+------------+---------------------+

And if one looks up page_id 2529663 in the "page" table (just showing the first three columns, this time):

+---------+----------------+------------+
| page_id | page_namespace | page_title |
+---------+----------------+------------+
| 2529663 |             14 | !!!_albums |
+---------+----------------+------------+

That conform to what you're looking for? Alai 23:16, 29 August 2006 (UTC)

weird tags?

Hi! I've been trying to import enwiki-20060810-pages-articles to the fresh MediaWiki 1.7.1 install. I followed steps described in this page, and used MWDumper. The process took many hours and showed no single error, but... Many pages does appear containing tags like "{{#if:|". Special:Statistics page is terribly wrong (it shows I have 0 articles). Special:Mostlinked is empty, too. Should I run maintenance/rebuildall.php? Should I download pagelinks.sql.gz, categorylinks.sql.gz, templatelinks.sql.gz, or something like that (as the post above suggests)? Thanks!

  • Just to be clear, I've never worked with any of the XML dumps, so don't assume anything one way or the other on the basis of my comments on the SQL table-dump information, which I've only ever used on their own. (I need to get a much bigger machine: my desktop is really struggling to cope with just pagelinks, never mind the pagetext info, and as for the all-histories dump, shudder...) Be handy if there were more developers and the like hanging around this page: maybe try their mailing list, or the technical section of the 'pump. Alai 00:44, 2 September 2006 (UTC)

LOL, my desktop took around 20 hours importing the SQL dump into MySQL (and couple of hours to convert XML => SQL)... Well, maybe you could just give me some direction... What does this mean?

||{{#if: 8 August 2006 | (8 August 2006) | {{#if: | {{#if: | ({{{month}}} {{{year}}}) | ({{{year}}}) }} }} }}

Looks like some kind of macro... I didn't even knew that Wikipedia contains such portions of script! However, my Wikipedia "copy" shows a lots of articles full of such a messy things. Other funny thing is <ref> & </ref> tags. I never saw them on the official Wikipedia. So, my guess is that my MediaWiki installation lacks something that transforms that tags in their human-readable form. Maybe the information on it is already contained in the dump, but it have to be "activated" somehow?! I'm pretty confused...

Yes, basically it's a built-in macro, AKA a ParserFunction, which see for more info. They're usable directly in wiki pages (most especially in templates), but you may be seeing some added by the software, too. I'm not surprised to see added tags, that's rather the nature of storing them in XML... Alai 03:52, 3 September 2006 (UTC)

Thanks Alai! Maybe I should use the importDump.php instead of MWDumper... I'll try that!

any Germans around?! :)

I was researching about importing Wikipedia database... It seems that de.wikipedia.org/wiki/Wikipedia:Download contains many interesting information (like hardware requirements and how to import links), that lacks at en.wikipedia.org/wiki/Wikipedia:Download. Could anyone trduce it please? ;)

That sounds like a very good plan. I'll wait a while and see if anyone with decent German steps in, but if all else fails, I might have a go with my "de-1" (more like de-0.5), and copious use of machine translation... Alai 16:08, 4 September 2006 (UTC)

Dumps idle for a week?

Anyone know why the dumps are still idle, over a week since the previous one finished? Seems a little harsh given that the "weekly cycle" is more like three these days (given the failures and restarts, and seeming duplication of the full edit history dumps). Alai 04:05, 3 September 2006 (UTC)

extensions

What extensions do I need to install on my MediaWiki to get the Wikipedia displayed properly? As far, I found that Wikipedia needs Cite.php & ParserFunctions.php... Anything else?

Incremental dumps, at least for enwiki?

It seems like the probability of a successful history dump of enwiki is pretty low. Perhaps it would be sufficient to do an incremental dump - ie, a dump, but only of revisions that have happened since the last successful dump. This could, at least, be a data product that's created *before* attempting the full dump. Is this practical/possible? Be happy to help write the job, if it's just a matter of development. Bpendleton 18:01, 22 September 2006 (UTC)

That sounds sensible to me, if there's the apparatus in place to make it happen. How many people actually are depending on the full dumps, anyway? How often do they need updates? The trouble is that these dumps (or rather, failed attempts at dumps) is really throwing the rest of the dump schedule out of whack. Alai 17:58, 12 October 2006 (UTC)
I don't know about others, but I've been using the dumps for full-history trend analysis.... hard to do with just the periodic "most recent" dumps, but also frustrating when you have to base your work on months-old data because the newer stuff doesn't exist. Not sure why the full ones are failing, but it really should be possible to generate incrementals at a much smaller system load. And, I'd still be happy to help. Bpendleton 22:24, 15 October 2006 (UTC)

Possible Solution

I have written simple a page for getting the basics working on Ubuntu Linux (server edition). This worked for me, but there are issues with special pages and templates not showing. If anyone can help with this it would be great. It's located at http://meta.wikimedia.org/wiki/Help:Running_MediaWiki_on_Ubuntu WizardFusion 19:52, 27 September 2006 (UTC)

Nice job! I've got the same problem with templates, I guess. The default MediaWiki lacks some extensions heavily used in Wikipedia. The complete list is available at http://en.wikipedia.org/wiki/Special:Version. But my own experience shows that Cite & ParserFunctions parser hook extensions are enough for most users.
Return to "Data dumps/Archive 1" page.