Community Wishlist Survey 2017/Editing/Converter from Latex and/or MSword

Converter from Latex and/or MSword

  • Problem: Despite VE, many people still end up writing content in googledocs, MSword or LaTeXand wanting to copy it over (especially in editathons and WikiJournal article submissions). Additionally, many possible maths-focussed contibutors would benefit from being able to copy over equations writtn in LaTeX.
  • Who would benefit: Complete novices. Those who have written content outside of wikimedia but now wish to import it. Editathon organisers. WikiJournal editors receiving article submissions in formats other than wikimarkup. Mathsy types who want to paste in equations.
  • Proposed solution: It's pretty easy to convert between MS word, Googledoc, LaTeX, and PDF, so being able to convert to wikimarkup from any of these would be extremely helpful, even if it needed manual tweaking afterwards to deal with references and images to import to commons.
  • More comments:
  • Phabricator tickets:

Discussion edit

The sentence "It's pretty easy to convert between MS word, Googledoc, LaTeX, and PDF" is just not true. PDF can't be converted to anything useful in most of the cases. (Tools like pdftotext or pdfimages just extract some parts of the file)

There are tools like w:en:pandoc that are able to convert many formats (except pdf, of course) to mediawiki wikitext. However this needs manual post-production. So imho it would be better to create a kind of centralized 'service' for those who have technical problems (no matter of what kind). -- seth (talk) 11:04, 3 December 2017 (UTC)[reply]

You're right to say that there's not true conversion from pdf, even by MSword itself. However even just pulling text, images, and basic formatting like header level would be good. Extracting references would, of course, be the most useful but also the most difficult. T.Shafee(Evo﹠Evo)talk 03:47, 6 December 2017 (UTC)

We have wikipedia:Wikipedia:Tools#Importing (converting) content to Wikipedia (MediaWiki) format. I grant that something more unified and consistently-maintained would be nice, but how feasible? This is a seriously difficult task. seth is right; I've used pandoc, it can cope with the bog-standard things well, but if you have, say, image captions, you will be doing a lot of manual editing. It could be useful to set up a program to learn from how humans manually correct automated conversions. But this request might basically be an AI problem.
Export functions are a similar problem. Orgs like PLOS are already using Mediawiki as a publisher's tool, and need to import author's copies, and produce other formats at the end of the processing; they might already have something specialized. But a lot of publishers seriously use hired typists for format conversion.
For equations, what modifications do you want to what we have? LaTeX is currently being very slowly updated; version three should be out any decade now. Stand-alone HTML 5 would be a nice format to have, and presumably easier.
If I've understood you correctly, extracting refs is the easy bit. Grab the DOIs and look them up, if there are any, or use more sophisticated scraping techniques if there are not DOIs. Zotero does this for me all the time, turning a downloaded PDF into a full citation database entry. It's open-source, I think we already use bits of its code. HLHJ (talk) 04:39, 8 December 2017 (UTC)[reply]
  • LaTex, maybe, but Google Docs and M$ Word? No way. We should not endorse proprietary software. I imagine one prominent usecase for this would be PR/marketing spammers preparing drafts offline. MER-C (talk) 05:03, 28 November 2017 (UTC)[reply]

Voting edit