コミュニティ技術/電子ブックへのエクスポートの改善
電子ブックへのエクスポートの改善プロジェクトでは、ウィキソースからブック形式でエクスポートする体験の改善を目指します。これは2020年コミュニティの要望アンケートの第1位の要望です。利用者は現行のシステムで不自由をしており、結果の信頼性、印刷設定、書式設定あるいは利用者体験などのいくつもの問題があります。そのせいでウィキソースを使う手順が複雑化して使い勝手を損ない、利用者によってはなるべくウィキソースと関わらないようにしようと思ってしまいます。
このページはウィキメディア財団 コミュニティ技術チームがこれまでに作業したもしくは中断したプロジェクトを解説します。技術面の作業は完了しました。
協議はトークページで展開しており参加をお待ちしています。
相対的にウィキソースから電子ブックへのエクスポートは奥行きの深い可能性があるはずなのに、より幅広い利用者層向けにもっと改良する必要があります。このプロジェクトの過程で主要な課題を調べて検証し、ウィキソースのさまざまなコミュニティと協働して解決策を実施、電子ブックのエクスポートをさらに持続して改善していきます。コミュニティの皆さんからのご意見ご提案をトークページでお待ちしています。
電子ブックにエクスポートしたいのはどんな場合
一般論として電子ブックへのエクスポートはウィキソース体験の中核でもあり、利用者がコンテンツを電子ブックにエクスポートする理由はさまざまです。第一にインターネットの接続障害の対策かもしれません。オフラインで使えるブック形式なら、利用者はインターネット接続のスピードが遅くても中断しがちでも繋がるかどうか不安があっても、多様なデバイスを使って素材を読むことができます。
第二に、利用者によっては電子ブック専用の端末で読みたい場合があります。例えばKindleやKoboを使うとインターフェイスを自分用に設定したり、メモを記入したり単語検索などの機能の他、バッテリーの寿命が長いという長所があります。あるいはまた、電子ブックをeReader機器で読めるように、互換性のあるファイル形式にエクスポートする利用者が大勢います。
第三に、電子ブックを作って簡単に使える形式で他の人と共有したい利用者もいるでしょう。こういう動機の場合、エクスポートするファイル形式は渡したい相手と同じものを選ぶはずです。単にウィキソースのリンク先を紹介するよりも、より柔軟な方法です。
第四に、利用者が教育機関や文書館、博物館など職業上ウィキソースを使っていて、一般向けもしくは教育用の参考文献としてオフライン版を保存したいのかもしれません。電子ブックにエクスポートできると、自分で用意したコンテンツや作業フローに取り込むことで、利用のたびにウィキソースを開く必要から解放されます。
エクスポートの仕組み
ウィキソースには、電子ブックのダウンロード方法は主に4種類あります。直接エクスポートするWSExportか、デスクトップ表示 左欄外のボタンか、メインページのリンクか (例えば特集ブックfeatured book)、本文上部のリンクか。実は5件目の方法 (ブックを生成) があるものの、技術的には可能とされながらほとんど使われていません。それぞれの方法を順にご説明しますが、留意点としてどのウィキでも全ての方法を提供するわけではなく、選択肢が1件だけか複数か、ウィキにより異なります。
その1: WSExport
1番目のWSExportはウィキソースで電子ブックのエクスポートに使うツールです。そもそもフランス語版ウィキソース向けに利用者のTptさんが開発したもので、ダウンロードする書式を EPUB や PDF などいろいろ選べます。WSExport にアクセスするには画面左欄外のパネルから「ダウンロード/印刷」→ 「書式を選ぶ」と進みます。あるいはまた直接、ws-export.wmcloud.org を開きます。
WSExport を利用する利用者は、電子ブックへのエクスポートの準備として条件をいくつか指定してください。すなわち言語コード、ページの題名、フォントを含めるかどうか (ある場合のみ)、画像を載せるか省略するかなどです。言語コードとページ題名は記入欄に入力し、ファイル形式とフォントはドロップダウンメニューから選択します。利用者が「Export」ボタンを押すと、ツールは指定されたファイル形式でページをダウンロード用に変換します。ツールの詳しい解説文書はWikisource:WSexportをご参照ください。
留意点として、「フォントを含める」節の使用法は選んだ言語により差があります。ラテン文字表記の言語の場合、一般に利用者 たとえば「フリー・セリフ」などフォントを選択します。 ところがインド語系言語の場合には利用者は電子ブックの適切なエクスポートに欠かせない実際のスクリプトを指定しなければなりません。
その2: 画面左のパネルを利用
2番目の方法では (訳注:デスクトップビュー画面で) 画面左欄外のパネル の「ダウンロード/印刷」を押し、電子ブックをエクスポートします。この場合、画面にダウンロード形式の選択肢が表示されます。下記の画面キャプチャの例だと「The Jungle」というページをダウンロードしようとしており、形式はPDF、EPUB、MOBIから選ぶことができます。
左欄外のパネルを経由すると選択肢は規定ではひとつだけで、単一の領域を PDF 形式でダウンロードするにはElectronPDFが使われます。複数の領域をブックにまとめるにはWSExportツールを行こうにする必要があり「個人設定」>「ガジェット」 >「Interface」>「Add a print/export link to download pages as EPUB files using the WSExport tool」と進み、チェックボックスを押してレ点を入れます。そうすると選択肢としてナビベーションメニューに EPUB と MOBI 形式が表示され、複数領域をまとめたブックを直接、ダウンロードできます。下記のサンプル画面では利用者が複数領域ブックのダウンロードを有効化した状態で、ページのコンテンツを全てまとめてダウンロードできる様子を示します。
その3: メインページのリンクを利用
3番目はメインページのリンクから、電子ブックをダウンロードする方法です。次の画面キャプチャのサンプルでは「April’s Featured Article」という記事に「Grab a download!」という見出しがあります。これをダウンロードするには4種類のファイル形式があり、それぞれのアイコンにカーソルを当てると特徴の説明が読めます。サンプル画面で利用者がカーソルを合わせたアイコンはWSExportで、クリックすると自動的にツールが起動してダウンロードが始まります。
その4: Export via links at the top of text
4番目の方法は、画面に表示された文章の上部にリンク ある場合に使えます。次の画面キャプチャはベンガリ語版ウィキソースの場合を示し、ダウンロードの選択肢を表示したところです。ダウンロードのファイル形式の選択肢が表示される点は、画面欄外のリンクの場合と似ていますし、サンプル画面ではWSExportツールを選んでいます。
その5「ブックを生成」
5番目は ODT か ZIM 形式でブックを生成する方法です。準備するには左欄外のパネルから「ブックの生成」をクリックしSpecial:Bookを開きます(リンク先は別称「Manage your book」)。すると手動でどのページをブックに載せるのか指定できますので、操作を繰り返し、すべてのページを追加してください。ただし注意点として、本来の開発目標はウィキペディアの利用者のうちソーステキストを作成したい人向けだったため、ウィキソース利用者には使い勝手は良くないため、ウィキソース関連の目的で生成されることは滅多にありません。
現状のエクスポートの基本的な問題
電子ブックへのエクスポートについて、現状でいくつもの問題があり、大きく分類して考えます。信頼性、出力設定と書式設定、利用者体験の3分類です。(reliability, formatting and styles, and user experience)
信頼性
WSExport ツールの信頼性は安定していません。利用者からダウンタイムとタイムアウトの報告が頻発し、ブックのエクスポートができないといいます。問題のいくつかは Phabricator にチケットとして文書化され、T250614やT219330#5060262などがその一例です。2019年には要望の第4位として2019年コミュニティ要望調査に認められました。その結果、コミュニティ技術部門で信頼性対策のEbook Export Reliability projectプロジェクトが立ち上がり、エクスポートの経験向上を目指すことになりました。プロジェクト終了時まで、WSExport は平均で月間更新率 99.42% を達成 (2019年6月20日付)、ダウンタイムは合計 941 分 (2019年5月1-15日) から同 179 分 (同年6月1-15日) に低減しています。対策を行なった結果、ウィキソースの信頼性に関する詳細なデータはT226136をご参照ください。
これらの改善をしてもなお、ウィキソースのコミュニティでは問題が継続しました。利用者は引き続きエクスポートの問題に直面し、たとえばフランス語版ウィキソースではWsexportを利用した電子出版の問題Problems detected in epub generated with WsexportをViticulumさんが報告しています。また一例として2019年9月30日のWSExport操作不能な時間はおよそ12時間でした。合計すると使用不能状態は13回報告があり、その多くが数時間継続しました。これらに加え、コミュニティ技術部門が行なった最近の試験では、MOBIファイルのダウンロードを試みると間欠的に問題が発生しています。現状ではライムアウトのエラーメッセージも、WSExport ページ限定で表示されます (ダウンロード用リンク経由を除外)。総合するとこの状況はひたすらウィキソースの利用者にとって苦痛でしかなく、また多くの場合、対処法が理解されていません。
出力設定と書式設定
電子ブックへのエクスポートはしばしば出力設定と書式設定でつまずきがちです。問題の状況はいろいろありますが、いくつか共通点があります。文字列が欠ける・化ける、文章が重複する、ページ割付が失敗する、表のタイトルが欠ける、大文字小文字の配置が間違っている、枠線の書式が正しくない、コンテンツの配置がおかしい、表の配置がおかしい、表の書式がおかしい。事例によっては単語が代わってしまうこともあります。どのエラーも利用者には原因がわからず、不安にさせます。あるいは原典をそのまま掲出するという、ウィキソースの方針そのものに反しています。下記にエラーの事例をいくつかご紹介します。完全版のリストではないため、よくあるエラーの参考情報とお考えください。
例その1:単ページが2ページに分裂
下記の画面キャプチャはこんてんつが2ページに分裂しています。ところが英語版ウィキソースで出力元の版をたしかめると、画面表示では単ページに収まるはずなのです。
例その2:フォントが文字化けする
下記の画面キャプチャは文字化けの例で、それぞれタミル語版ウィキソース(下左)とカンナダ語版ウィキソース (下右) の場合です。文字が白い四角 (通称tofu=豆腐) に化けており、原因は「フォントを埋め込む」欄がカンナダ語を設定していないためで、ほかにもこの欄に起因する問題は下記のように報告されています。とは言うものの、タミル語は内含フォントとして設定してあるのに、それでも問題は発生します。
例その3:結合子音文字の表示が正しくない
下記のサンプル画面では文字表示がおかしくなっています。ベンガリ語版ウィキソースの原文を見るときちんと「প্রথম」と表示されます。しかしながら電子ブックにエクスポートすると (サンプル画面)、表示は「পরথম」に代わっています。原因の文字列は結合子音文字といい、単語内の複数の子音を合成して使うものです。これらのサンプルではフォント処理に問題があり、子音がバラバラに切れてしまいました。インディック語系のさまざまな言語の利用者がこの問題をかかえています。
例その4:画像の回りこみが間違っている
下記のサンプルは文が画像を回りこんでいます。ところがベンガリ語の元の版を見ると、文章は回りこまず画像の下に配置してあります。
例その5:コンテンツの配置が正しくない
下のサンプルはPDF形式で、コンテンツは画面の左に寄っています。ところがアルメニア語版ウィキソースの元の版を見ると、コンテンツはページの中央揃えで配置してあります。
使いやすさと利用者体験
エクスポートの手順は、初心者にはあまり親切ではありません。いくつも付帯的な技術情報を学ばないと使いこなせないからです。 WSExport ツールはどこにあるのか見つけにくいばかりか、直感的に使える利用者体験 (UX) が欠けています。たとえば書体に関する「Include fonts」節に載っていないスクリプトがあります (例:ベンガル語。) あるいは言語スクリプトとして掲載しているのに、エクスポートしてみると文字化けが発生する場合があります。それでいてサイドバーのボタンを使おうとしても、初心者にはいくつかのパートで構成される電子ブックをダウンロードする手順が明白でないなどの問題があります。ウィキソースの拡充には、直感的に使いこなせる工夫が必要であり、初心者でも使いこなせるようにしなければなりません。以上の理由から電子ブックへのエクスポートに関して、やはりUX改良も検討課題とします。
問いかけるべきこと
- 利用者がなぜ電子ブックへエクスポートするか、主な理由を網羅できたか?
- エクスポートの主な方法を網羅できたか?
- エクスポートをしようとして経験する主な問題を網羅できたか?
- 個人の感覚として、どの出力設定や書式設定がいちばんよく利用され、イライラするのはどれか?
- 個人の感覚として、利用者経験のどの問題がいちばん一般的でイライラするのはどれか?
- 全般的にどの問題が突出していて改良が必要だと思うか、またその理由は?
- その他、提言したいことは?
ぜひトークページにご意見ご提案を投稿してください、読ませてもらうのを楽しみにしています! よろしくお願いします。
Status Updates
March 31, 2021: Final update
Hello, everyone! We are now thrilled to announce that the Ebook Export Improvement project is complete. This has been a tremendously rewarding experience, and we have really enjoyed collaborating with the diverse and passionate Wikisource community. Below, we will share a summary of the work we did and the impact it has had. Thank you again for all of your collaboration!
Reliability
We knew that, in order for our work to be impactful, Wikisource Export needed to be a more reliable and efficient tool. For this reason, we focused on improving the overall health of the code that supports the tool. We also tackled multiple issues that were contributing to slow or error-laden downloads. The technical details of this work can be found in previous updates. Now, we’re excited to share the final results:
- The Uptime Robot Dashboard shows that WS Export is nearly always up (as of 31 March 2021):
- Last 24 hours: 99.139%
- Last 7 days: 99.768%
- Last 30 days: 99.834%
- Last 90 days: 99.896%
- The median export time for books is 3 seconds, as found from one month of data, recorded between 17 February to 17 of March (source). We don’t know the median export time before our changes, since we only recently began logging this data.
- There are many more exports being successfully generated:
- Within a 30 day period, our data recorded 1,168,667 successful exports after our changes (source) vs. 71,613 successful exports before our changes (source). This shows 16 times as many successful exports after the changes.
- Within a 60 day period, our data recorded 1,513,486 successful exports after our changes (source) vs. 160,433 successful exports before our changes (source). This shows 9.4 times as many successful exports after the changes.
- Within a 90 day period, our data recorded 1,628,523 successful exports after our changes (source) vs. 249,924 successful exports before our changes (source). This shows 6.5 times as many successful exports after the changes.
- Note: It makes sense that the greatest growth happened within 30 days, since we have released the most changes within the last few months. However, it should be noted that some of the ebook exports (both before and after the changes) may be from bots, and we cannot easily determine which downloads are from bots.
- The error rate of ebook exports is significantly lower:
- We have decreased the likelihood of encountering an ebook error by approximately 12 times (within a 30 day period). Here is the breakdown of the data:
- Before our changes, we recorded a total of about 46,000 errors per 71,613 successful exports within a 30 day period. This was calculated because we had found a total of 22,982 errors in a 14-day period, which we can roughly double to be about 46,000 errors in a 30 day period (source). This means that, for every 1.5 downloads, there was one exception thrown.
- After our changes, we recorded a total of 63,921 exception errors per 1,168,667 successful exports within a 30 day period (source). This means that, for every 18 successful ebook exports, there was 1 exception thrown. This is a 12 times improvement over the previous ratio. This is based on the data recorded in a 30 day period after our changes.
- Note: There isn’t a 1:1 relationship between exports and exceptions. In other words, one unsuccessful export can generate multiple exception errors.
- We have decreased the likelihood of encountering an ebook error by approximately 12 times (within a 30 day period). Here is the breakdown of the data:
- We expect that the error rate will drop even lower very soon:
- In our new data, the most common error type is DriverException, which accounts for 46% of all exception errors. This is a new error type, which isn’t even present in our old data. This is because this error type was brought on by our replicas work.
- There is already a fix for DriverException errors that will be deployed soon. Once this fix is released, the total error rate should be even lower.
- We have added a new option for problematic exports: download without credits.
- We determined that one common cause of slow or problematic book downloads was the credits displayed in the “About” section. The credits can take a lot of time to generate and may cause ebook export issues. For this reason, you can now choose to “Exclude editor credits” in the Wikisource Export page.
Language Support
The original wish focused on WS Export reliability. However, as we consulted with Wikisource communities, we learned about serious language support issues. We made it a project priority to fix these issues. We believe in the power of Wikisource as a global, equitable tool of free knowledge, so it’s absolutely crucial that it’s inclusive of all languages. Here are the results:
- We upgraded font support in ebook exports: We fixed the issue that previously displayed boxes rather than characters in ebook exports, especially for non-Latin script languages. Now, everyone can download books in their chosen language!
- The Wikisource Export page has been internationalized: The page used to be in English only. We internationalized it, so it would be available for translation, and it’s now been translated into many languages already.
- Communities can pick their default fonts for the download button: When we added the new download button, we heard from community members that they wanted the ability to choose the default font that was most suitable for them. We understood the importance of this, so we made this possible.
User Experience
As we wrote above, the original wish focused on reliability issues. However, as we consulted with community members and conducted usability tests, we learned that Wikisource Export needed user experience improvements as well. This way, it could be used (and enjoyed!) by a larger, more diverse group of people. For this reason, we issued a series of improvements. Here are the results:
- There is now a user-friendly “Download” button: We simplified the user experience by providing a user-friendly “Download” button. We drew from community feedback and usability testing to develop the feature. We also incorporated updates to the tool after sharing the first iteration with community members, such as adding a link to “Other formats” on the WS Export page (thanks for the tip, InductiveLoad!).
- We simplified the language code process: In the old WS Export page, users needed to manually enter in the language code, which was confusing for newcomers. With our changes, no language code knowledge is required! We provide all of the available languages in the dropdown.
- We improved support when errors occur: We have implemented greater support for users who encounter ebook export errors. Now, when users encounter errors, they will see suggestions for next steps displayed on the Wikisource Export page. We also worked to ensure that the error messaging is appropriate for the type of error that has occurred.
- The Wikisource Export page has been revamped: The older version of the Wikisource page was in need of a visual make-over. For this reason, our designer provided a series of recommendations to improve readability, usability, and mobile accessibility. The end result is a much cleaner design, and we hope you love it as much as we do!
Overall, this project is now done. We thank you all for being such fantastic ambassadors and advocates for Wikisource, and we learned so much from all of you! If you have any questions or comments, we invite you to share them on the Talk page. We sincerely hope that we helped improve Wikisource Export, which is at the heart of the Wikisource experience. Up next, we’ll be working on another Wikisource project: OCR Improvements (which we encourage you to check out, if you haven’t already). Thank you everyone, and we look forward to further fruitful collaboration!
March 8, 2021: Updates on recent work
Hello, everyone! We are excited to share with you another update on our recent work on the project. Also, apologies for the delay since our last update! We have been very busy over the past few months with the holiday season, the 2021 Community Wishlist Survey, launching the Wikisource OCR Improvements project, and additional team work. Please check out our update below and share your feedback on the project talk page. Thank you in advance!
Recently completed work
- Implement a new download button: Now when you go to a book on Wikisource, you will see a blue button, labeled “Download.” When you click on the button, a pop-up will appear that enables you to download books in PDF, EPUB, and MOBI formats. You will see information below the file format name, which lets you know which file format is appropriate for your device. We decided to do this work after collecting feedback from the user talk page and on usertesting.com, which confirmed that many people found the current download experience not very user-friendly. This new download button is meant to increase the accessibility of ebook exports on Wikisource.
- Note: We implemented a simplified version of the initial mockups. Some parts of the original proposal turned out to be too big and complex, so we opted for a trimmed down version, which you can now see on the wikis.
- Internationalize the WS Export page: The Wikisource Export page used to be in English only, which was obviously a problem. With this change, the page can now be translated into many different languages. Translations have already begun. If you don’t see it translated into your wiki’s language, we invite you to help the translation effort on translatewiki!
- Replace ElectronPDF with WSExport PDF support: Some wikis previously used ElectronPDF for PDF downloads in their sidebar links. ElectronPDF does not work very well, especially for Wikisource, which led to many people having frustrating download experiences. We replaced ElectronPDF links with WSExport links, so now all wikis can download PDFs via the sidebar with a more reliable service.
- Migrate WSExport Gadget to to the Wikisource Extension: In the process of consulting with Wikisource users on the project talk page, we learned that some users did not have the ability to click download links (such as "Download as MOBI") via the side panel and other places. This was the case on Czech Wikisource, for example. This was because these wikis did not have the WSExport gadget enabled on their wiki. To fix this issue, we migrated the gadget over to the Wikisource extension, so that all users can now have the same download links available.
- Cache all API Requests: With this work, we have added another way to improve efficiency of ebook exports. Now, if someone downloads Book A and then someone else wants to download Book A shortly afterward, the generated ebook will already be cached. This means it doesn’t need to be newly regenerated for each request, and the download time will be much quicker the second time around.
- Migrate API to Parsoid API: This work was recommended by Tpt, and we agreed with the recommendation, after conducting our own investigation. With this work, we now have better long-term support of formatted text on Wikisource exports. Also, this work was inevitable, since Parsoid API will eventually replace the current MediaWiki's native parser. Now that this work is done, we can identify bugs and issues early on, which we can direct to the Parser team. Please note that we are aware that some bugs came out of this work, but since the migration was scheduled to occur either way, we think it’s best to identify the bugs early on.
- Reinstate OPDS Support: With this support reinstated for OPDS now reinstated, you can now access up-to-date catalogs of Wikisource books that are suitable for download on your device. It should be noted that OPDS export can be set up to work with any category on a wiki. It produces a daily updated file that can be used in software like Calibre, FBReader, and some e-readers to more quickly and easily browse and download available books. The existing categories are just the first ones; others can be made, even multiple per wiki (e.g. "Literature for export" or something).
- Investigation on how to improve error messaging: We know that many people feel frustrated when there are ebook export errors, since there is typically minimal information or support. The purpose of this investigation was to determine what potential improvements we could implement from a technical perspective. Now that this work is complete, we are now looking into implementing improved support (see next section for details).
- Various bug fixes: In software development, whenever new work is done, new bugs arise. As a team, we have analyzed and fixed many bugs, such as RTL language support issues, internal links issues, and other bugs. However, as mentioned above, some bugs may be handled by other teams in the future, such as Parsoid-related bugs (which may go to the Parser team).
What's next
We are in the last stages of this project. Here’s the remaining work:
- Improve language selection on the Wikisource Export page:
- STATUS: In development. Engineers already have work they are reviewing.
- SUMMARY: Right now, users need to know the language code of the book (such as “en” for English or “fr” for French) in order to export it on the Wikisource Export page. This is not user-friendly. For this reason, we are implementing a selector, so you can pick the appropriate language from a dropdown
- Allow communities to choose appropriate fonts for download:
- STATUS: In development. Engineers already have work they are reviewing.
- SUMMARY: Wikisource users want a way to select which font is best for their language when downloading books. So, for example, if I am on Hindi Wikisource and I click the blue download button and then choose to download a book, the book should be in the font that the community decided is best for the Hindi language. We are now working to make this possible. This work is already in development.
- Remove dependency on credit generation:
- STATUS: In development. Engineers already have work they are reviewing.
- SUMMARY: When this work is complete, wikis will be able to download books without editor credits, if they want. We decided to do this work because credit generation is often a very intensive process. It can significantly slow down ebook exports. For this reason, if a user is experiencing difficulties downloading a book, they may want to try downloading the book without the credits. We wanted to implement the technical support to make this possible, either for users now or in the future.
- Create option to disable credits:
- STATUS: Ready for development and in engineer's backlog. Will be worked on very soon.
- SUMMARY: Once we have removed the dependency on credit generation, we can create the option to allow ebook exports without credits on the Wikisource Export page. This could be useful in cases when books are particularly difficult to download due to lag time and errors. We could also create a separate page with a list of the book credits, so that even users who choose to download books without the credits can view them within Wikisource Export.
- Improve messaging about errors:
- STATUS: In planning stages.
- SUMMARY: Our investigation found that it would be too much work to implement specific messaging about errors within the pop-up. For this reason, we have decided to improve the error messaging on the Wikisource Export page. The basic idea is that, if you encounter an error via the download button, you will be redirected to the Wikisource Export page. There, you will be told that there was an error and be given proposed next steps. While this may not generate a successful ebook export for all use cases, it will provide more helpful messaging and support for users than the current default behavior.
Open Questions
- What do you think of the new "Download" pop-up? Do you find it easy to use?
- What do you think of our additional work to improve reliability? Are you still seeing improvements in the speed and performance of WS Export?
- What do you think of our remaining work to disable credits as an option?
- What do you think of our remaining work to improve messaging and support when there are download errors?
- Is there anything else you would like to add?
Please share your feedback on the project talk page!
November 18, 2020: Updates & Request for Feedback
Hello, Wikisourcers! We are pleased to share our November update. In this update, we’ll focus on the work we have done so far to improve the ebook export experience. We’ll also share our plans for what we hope to do next and how we plan to do it. We would like to thank everyone for the feedback so far, and we look forward to collecting feedback on this next stage of the project!
Language support improvements
You can now read Wikisource ebook exports that were previously unreadable. For example, issues such as this and this are now solved, due to our changes! Here’s how we did this: After investigating the issue, we realized that WS-Export only supported 4 main fonts. This meant that fonts required to properly render many scripts were not available.
For this reason, we upgraded font support, so that all fonts available in Debian would be available for WS-Export. The end result is that scripts in ebook exports are now largely supported, due to this change. One request: Now that we have made this change, please test out this issue for us. Do you think it is largely resolved? Do you still see boxes (instead of text) popping up anywhere? Please let us know, since we really want to fix this issue. Thanks in advance!
Reliability improvements
#1. Investigation findings
We have conducted four separate investigations to identify how we can improve ebook export reliability. These investigations were as follows: Parsoid HTML for WS-Export, Cache generation for ebooks, Preventing automated book downloads, and Implementing a job queue system. In each case, an engineer was assigned to deeply explore the proposal in question, determine if and how we could make such changes, and what level of improvement we could expect from such changes.
Then, we reviewed the findings as a team to determine next steps. From these investigations, we decided that two proposals should be focused on first: Parsoid HTML and Caching ebooks. In both cases, we felt that the proposed changes would improve reliability and that they were within scope for the team. Work on both of these projects has been launched, which we’ll explain below.
#2. Caching API requests
This work came out of one of our investigations, as described above. We have made great progress, and it is almost complete. When the work has been deployed, we will cache all API requests when exporting Wikisource ebooks. This means that, if someone downloads Book A and then someone else wants to download Book A soon afterward, the ebook will be generated much faster.
This could be helpful in a variety of cases, such as when many people download featured books of the month, when many people download books listed on wiki pages (such as those featured on the homepage on Bengali Wikisource), or when people sequentially download different formats of the same book. We believe that this work can help improve reliability, but we’ll need to conduct some analysis to determine the impact. We’ll share more on the release and impact in our next update.
#3. Integrate Symfony 5 with WS-Export
This work is now complete. Due to this work, we are now able to implement improvements to WS-Export, such as the caching work, on a much faster timeline. Furthermore, Symfony 5 is the same framework we use for other tools, so ongoing maintenance will be easier and quicker in the future. We can also now make use of Symfony components that are battle-tested and require very little work to enable, such as ErrorHandler, Cache, Console, DependencyInjection, DotEnv, and more. Overall, we are modernizing the app and making it easier on ourselves and volunteers in the long run.
#4. Migrate API to Parsoid API
This work is in progress and almost complete. Once complete, this work will help us support proper formatting, such as in footnotes and mathematical equations. It was first recommended to us by Tpt, and our investigations concluded that we should do the work. Here are the technical details: WS-Export was using the MediaWiki parser HTML output using ?action=render to generate its ePubs.
However, Parsoid HTML became available, and it provides much richer data. Additionally, Parsoid API will eventually replace the current MediaWiki's native parser. We want Ws-Export to be up-to-date with the latest parser sooner rather than later. Once this work is complete, HTML output will hopefully be simplified, which will make it easier for us to support formatting of text.
User experience improvements: feedback requested
As we shared in our last status update, we repeatedly heard from people that the download experience is confusing, inconsistent, and hard to understand. We agree, and we want to improve the user experience. For this reason, we conducted an investigation to determine how we could improve the user experience, which included team discussions and usability tests. From this investigation, we came up with the mockups (as shown below) and ideas for how we can improve the experience. We encourage you to check them out and share your feedback on the talk page.
Proposed changes
General proposal: We propose to add or replace (depending on the wiki) the top download links on the page with a simple “Download” button. When the user clicks on the download button, a window will open up, which will list the download options for the book (such as, EPUB, MOBI, and PDF) with download icon links. The user will also see information about which file format is recommended for different device types (since we found in our user research that many users don’t know which file format to pick). Once they pick a file format, the download process will begin, which will be indicated by some sort of “in progress” indicator. When the download is completed, the user will see a status indicator to display that it is complete. If there is an error in downloading the book, the user will see an error status indicator.
Potential enhancement for the future -- Auto-download: Once we have implemented the basic behavior, we may consider creating an auto-download function. Under this scenario, if the user has never downloaded a book before, they will manually pick the file format. After that, when the user clicks “Download,” the download will be automatically triggered in the file format that the user last selected. You can see an example of the user flow below.
Usability tests
After we developed the mockups, we conducted some usability tests (with readers who did not know about Wikisource) to see how people responded to the mockups. We'll be updating the mock-ups based on our findings from the tests, as well as the feedback from all of you provided on the talk page.
Open questions
- What do you think of our recent font support work? Does the issue seem to be largely resolved of boxes appearing rather than text?
- What do you think of our recent and upcoming reliability work? Do you have any thoughts, concerns, or suggestions to share?
- What do you think of our proposed improvement to the download user experience overall? Do you like the general idea and user flow?
- Do you usually download the same file format (e.g., PDF, MOBI, EPUB, etc) every time you download a book, or do you often pick a different format?
- Is there anything else you would like to share?
Please share your feedback on the project talk page!
October 23, 2020: Engineering work has begun
The engineers have begun taking on work related to improved reliability of the WS-Export tool and improved font support. This work is based off the proposals that we shared on August 11th, as well as the feedback from that update. We are also wrapping up our first stage of research related to improving the user experience, which we'll be sharing soon.
August 11, 2020: Early findings
Hello, everyone! We are very excited to share our first update on the project. We want to thank everyone who has shared feedback on the project talk page, GitHub, or Phabricator so far. Your insights have been critical, and we deeply appreciate them. Once you have read our August update, we invite you to share your feedback on the project talk page.
Lessons from the consultation (so far)
On the project talk page, we discussed with some people how to think about the project overall–in other words, what sort of principles should we be internalizing and following as we begin this work? Here’s what we have found so far:
- We should be continually mindful of both contributors and visitors. These groups may have some overlapping needs, but there may also be some important differences for us to investigate, explore, and address.
- We should think about user experience improvements (rather than only technical improvements). The process to export ebooks is currently not intuitive for many users. If we can improve this experience, we can potentially retain and nurture a larger base of readers and editors.
- We should try to investigate, document, and share best practices for Wikisource. As we have consulted with people, we have learned that some perceived errors are actually not technically “bugs.” Instead, they are rooted in confusion over formatting and best practices. As such, they can potentially be fixed by the community. This is great news, but more people need visibility into these best practices. For this reason, we plan to investigate best practices, and we’ll share our findings with the community.
- We won’t be able to fix everything (unfortunately!), but we’ll document all the issues we encounter. For this project, we’ll be focusing on improving core accessibility (i.e., WSExport functionality & reliability) and core readability (i.e., people’s ability to read exported books in their language of choice). If there are issues that don’t fall into these two categories, we’ll still document the issues in Phabricator–but we may not be able to fix them. It is our hope that other people may be able to fix the issues that we can’t get to in the future.
Work we have done so far
We have just launched the project, so we haven’t done much work yet. However, we have done some work, which we would like to share now:
- We moved WSExport to a Virtual Private Server (VPS): Before this work, the memory and CPU intensive processes of WSExport were too much for the resources on Toolforge. For this reason, we moved the tool to a VPS instance, so that we could improve overall performance.
- We upgraded Calibre on WSExport VPSs: The VPS was on Calibre 3.48.0, but the latest version was 4.10.1. For this reason, we performed an upgrade, which helped fix various bugs related to generating PDFs. We will also continue with upgrades when new versions are available.
- We conducted a technical analysis and font rendering investigation. Both of these produced lots of interesting proposals (some of which we’ll share in the section below). In addition, we were able to identify four common types of formatting issues for ebook exports, which are: 1) The original HTML of the wiki, 2) The process of converting the wiki HTML into an ePUB XHTML file, 3) The second output formats (such as PDFs, introduced by Calibre), and 4), The rendering of ePUB files by an eReader. While we can’t do much for the last two (since they involve external software/tools), we can try to improve the situation with the first two.
- We are currently collecting baseline data on issues related to uptime and errors associated with the WSExport tool, among other data points. This will enable us to compare the original data (i.e., at the beginning of the project) with the data later on (i.e., after changes have been made at the end of the project). This way, we can measure our impact in a meaningful way.
- We have begun to organize existing Wikisource tickets. We’re currently in the process of migrating tickets from GitHub to Phabricator, so they can be consolidated in one place. Once the migration is complete, we propose that we group these tickets into categories, so we have a sense of the general “buckets” of work that we can do. In the future, we ask that all new issues are reported on Phabricator. If you want assistance in how to do this, you can reach out and ask on the project talk page.
- We have begun to investigate how we can improve the user experience for all Wikisource users. This investigation is currently in development, and it will be the primary focus of our next status update (which is tentatively scheduled for September).
Potential next steps
General reliability
- Investigate cache generated ebooks: With this work, we propose to cache files that are produced, so they do not need to be freshly generated whenever someone wants to download them. This would potentially speed up the process of downloading books. We would need to investigate this proposal first, so we could determine the general impact and scope of the work.
- Investigate job queue for more efficient ebook generation: With this work, we propose implementing a queue-running system. This process would potentially speed up the current download process and give users more information on the download status of a book. One way it could be done is that a user would submit a request for an ebook, which would add it to the queue. The queue would first generate the ePUB, which would then be immediately available for download, and then it would generate the derivative forms (such as, PDF) and make those available when done. We would need to investigate this proposal first, so we could determine the general impact and scope of the work.
- Investigate how to prevent incomplete book downloads: The purpose of this work would be, if possible, to reduce the problem of downloading only a section of a book (when the user expects to download the whole book). We could possibly do this by using subpages from all pages (rather than just from ws-summary) when downloading books. We could also potentially follow subpage redirects, when they exist. We would need to investigate this proposal first, so we could determine the general impact and scope of the work.
Content rendering
- Through consulting on the talk page, we have learned that we should also be looking into improving support for mathematical equations. While we covered font rendering issues in our original investigation, we didn’t talk about numbers and equations. We’ll look into mathematical equations, as well.
- We have learned through the font rendering investigation that we can switch to a new system of fonts. This would enable us to use fonts from a host system (Debian), which has a much larger library of fonts than our own library. For example, this work would give us access to multiple Kannada fonts (while our current font set offers no specific Kannada support).
Open Questions
We would now love to read your feedback, especially in response to the questions below. Remember that we’ll also have a few round of collecting feedback, so we’ll engaging you again in the future as well. You can add your feedback on the project talk page. Thank you in advance!
- What are your general thoughts about the guiding principles that we have learned from the consultation so far (i.e., “Lessons from the consultation”)? Is there anything that you think we should add or change?
- Is there anything you would like to share about the work we have done so far (i.e., VPS work, Calibre upgrade, various investigations, and the consolidation of tickets)? We’re open to any thoughts or suggestions!
- What do you think of the proposal to investigate cache generated ebooks? Would this be useful and high-priority, in your view? Do you have any concerns?
- What do you think of the proposal to investigate job queue for more efficient ebook generation? Would this be useful and high-priority, in your view? Do you have any concerns?
- What do you think of the proposal to investigate how to prevent incomplete book downloads? Would this be useful and high-priority, in your view? Do you have any concerns?
- What do you think of the proposal to switch to a new system of fonts? Would this be useful and high-priority, in your view? Do you have any concerns?
- What work or investigations would you like to see that is *not* being addressed or is being addressed in a different way than you would expect? In other words, what do you think we’re overlooking, if anything?
- Anything else you would like to add?