Grants:Project/Traverseda one/Web-archive based offline releases

statustest
Web-archive based offline releases
summaryplease add a 1-2 sentence summary
targetplease add a target wiki
amountplease add the amount you are requesting (USD)
granteeTraverseda one
this project needs...
contact
advisor
organization
grantee
volunteer
join
endorse
created on22:46, 23 September 2019 (UTC)


Project idea edit

What is the problem you're trying to solve? edit

What problem are you trying to solve by doing this project? This problem should be small enough that you expect it to be completely or mostly resolved by the end of this project. Remember to review the tutorial for tips on how to answer this question.

There's no way to use a wikipedia xml-dump to make your own ZIM file for the offline wiki reader app "kiwix". To the best of my knowledge creating a ZIM file relies on internal wikipedia infrastructure and can't reasonably be done by a member of the public.


What is your solution to this problem? edit

For the problem you identified in the previous section, briefly describe your how you would like to address this problem. We recognize that there are many ways to solve a problem. We’d like to understand why you chose this particular solution, and why you think it is worth pursuing. Remember to review the tutorial for tips on how to answer this question.


Write software to convert wikipedia xml dumps to HTML, and ultimately to write those to the internet-archive "WARC" format or to plain files suitable for use with zimwriterfs.

Project goals edit

What are your goals for this project? Your goals should describe the top two or three benefits that will come out of your project. These should be benefits to the Wikimedia projects or Wikimedia communities. They should not be benefits to you individually. Remember to review the tutorial for tips on how to answer this question.


Convert a wikipedia xml-multistream dump into multiple internet-archive web archive ("warc") files. "Warc" files can be used as an alternative to kiwix's ZIM files for offline wiki readers, but with a number of advantages. The main advantage is that there's an existing ecosystem for warc-compatible tools, but another big advantage is that users can actually create/edit/customize their own warc files, something that is very difficult to do using kiwix's ZIM files.

Project impact edit

How will you know if you have met your goals? edit

For each of your goals, we’d like you to answer the following questions:

  1. During your project, what will you do to achieve this goal? (These are your outputs.)
  2. Once your project is over, how will it continue to positively impact the Wikimedia community or projects? (These are your outcomes.)

For each of your answers, think about how you will capture this information. Will you capture it with a survey? With a story? Will you measure it with a number? Remember, if you plan to measure a number, you will need to set a numeric target in your proposal (i.e. 45 people, 10 articles, 100 scanned documents). Remember to review the tutorial for tips on how to answer this question.


You will be able to install a command line program using pip and any modern linux distribution that takes a wikipedia multi-stream XML dump and generates internet-archive compliant WARC files.

You will be able to browse an offline copy of wikipedia using the pywb WARC viewer, with no significant rendering issues.

Do you have any goals around participation or content? edit

Are any of your goals related to increasing participation within the Wikimedia movement, or increasing/improving the content on Wikimedia projects? If so, we ask that you look through these three metrics, and include any that are relevant to your project. Please set a numeric target against the metrics, if applicable.


Project plan

Activities edit

Tell us how you'll carry out your project. What will you and other organizers spend your time doing? What will you have done at the end of your project? How will you follow-up with people that are involved with your project?


  • Parsing wikipedia multistream xml documents quickly (completed)
  • Parsing wikitext from included documents
  • Rendering wikitext to html (challenging due to template transclusion, infrastructure already exists, essentially needs to be an "online parser", some infrastructure for this already exists)
  • Increasing the speed that pywb build indexes so that large datasets like wikipedia can run performantly
  • Stretch goal: Cluster rendering, use multiple computers to convert wikitext to html

Budget edit

How you will use the funds you are requesting? List bullet points for each expense. (You can create a table later if needed.) Don’t forget to include a total amount, and update this amount in the Probox at the top of your page too!

$3600

It's hard to estimate the exact scope of work, especially for the HTML-rendering portion. I've budgeted 60 hours for this project, and am assuming a rate of $60/hour. If it takes more than 60 hours I intend to complete it regardless. I will track my time.

Community engagement edit

How will you let others in your community know about your project? Why are you targeting a specific audience? How will you engage the community you’re aiming to serve at various points during your project? Community input and participation helps make projects successful.


Get involved edit

Participants edit

Please use this section to tell us more about who is working on this project. For each member of the team, please describe any project-related skills, experience, or other background you have that might help contribute to making this idea a success.

I'm Alex Davies, a self-taught software developer from Nova Scotia. I mostly do contract development work. I've already started work on this project, but it's become clear that it is going to require more time than I can currently afford to volunteer. This grant will allow me to work on this project instead of taking on more paid contract work.

Community notification edit

You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc.--> Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?

Endorsements edit

Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).