Community Tech/Ebook Export Improvement
The Ebook Export Improvement project aims to improve the experience of exporting books from Wikisource. Under the current system, users struggle with a variety of issues, including reliability, formatting, styles, and user experience. These issues add complexity and frustration to the Wikisource process, and they discourage some users from deeply engaging with Wikisource. This project was the #1 request from the 2020 Community Wishlist Survey.
Overall, Wikisource ebook exports have tremendous potential, but they must be improved in order to serve a wider audience. In the course of this project, we’ll aim to investigate and identify the key issues, collaborate with various Wikisource communities, and implement solutions that further sustain and improve ebook exports. We look forward to community feedback on the Talk page.
Why Export EbooksEdit
Generally speaking, ebook exports are a core part of the Wikisource experience, and users export ebooks for a variety of reasons. First, they may export ebooks to avoid issues with internet accessibility. With offline books, users can easily read materials on a variety of devices, no matter if their internet connection is slow, intermittent, or unreliable.
Second, users may want to read the book on a device that is optimized for ebooks. For example, devices such as the Kindle or Kobo allow users customize the interface, add notes, look up words, and have a long battery life. Consequently, many users choose to export an ebook into a compatible file format, which can then be transferred to an eReader device.
Third, users may want to share an ebook with someone in an easily accessible format. For this reason, they may choose to export the ebook in a preferred format, which they can share with the recipient. This process is generally more flexible than sharing a web link to Wikisource.
Fourth, the user may work in a professional setting, such as an educational institution, archive, or museum, and they want to store an offline copy for general reference or educational purposes. The export of an ebook allows them to integrate the content into their own files and workflow, rather than having to consistently access Wikisource.
How Ebook Exports WorkEdit
In Wikisource, there are four primary methods to download an ebook: directly via WSExport, via the left side panel, via links on the main page (such as a featured book), or via the links above texts. There is also a fifth method (create a book), which is technically available, but very uncommonly used. We will discuss all of the methods below. However, it is important to understand that not all wikis have the same download options. Some have only one download option, while others have many options.
First, WSExport is the primary tool for exporting ebooks on Wikisource. This tool, originally developed by user Tpt for French Wikisource, permits downloads in a variety of formats, such as EPUB and PDF. To access WSExport, one can navigate to the left side-panel and click on “Choose format,” under the “Download/print” section. Alternatively, users can directly visit the URL: https://wsexport.wmflabs.org/tool/book.php
When using WSExport, the user must specify certain things to generate an ebook export. These include the language code, the title of the page, fonts included (if any), and whether images should be included. The user must manually type in the language code and page name, but they can select the file format and fonts via drop-down. Once the user clicks “Export,” the tool will provide a downloadable version of the page in the user’s specified format. Further documentation on the tool can be found on Wikisource:WSexport.
It should be noted that the “Include fonts” section has different use cases, depending on the language selection. For Latin-scripted languages, the user is generally selecting a font, such as “Free Serif.” However, for Indic languages, the user must often specify an actual script, which is required to properly export the ebook.
#2: Export via the left side panelEdit
Second, users can access ebook exports via the left side panel (“Download/print”). With this method, the user can see all file formats available for download. For example, in the screenshot below, the user has the option to download a PDF, EPUB, or MOBI file of The Jungle.
When using the side panel, the default option available is typically only single-part PDFs, which are downloaded via ElectronPDF. If you want to download a multi-part book in the side panel, you need to enable WSExport in Preferences > Gadgets > Interface > “Add a print/export link to download pages as EPUB files using the WSExport tool.” This will add the EPUB and MOBI option to the navigation menu, so that you can directly download multi-part books. In the example below, you can see that the user can download the entire contents, since they have enabled multi-part book downloads.
Third, users can download books via links on the main page. For example, in the screenshot below, you will see that “April’s Featured Article” has a section called “Grab a download!” One can choose from four main file formats, which are specified if the user hovers over each icon. This user can click an icon, which automatically triggers a download that uses the WSExport tool.
Fourth, users can sometimes access download links at the top of text. For example, in the screenshot below, download options are presented for a text in Bengali Wikisource. Like the side panel, the user sees the file formats that are available for download, and the downloads are conducted with the WSExport tool.
#5: “Create a book”Edit
Fifth, you can create a book as an ODT and ZIM file. To do this, go to the “Create a book” link in the side panel, which will redirect you to Special:Book, also known as “Manage your book.” On this page, you can manually specify each page that you want in your book, which you will need to do repeatedly until you have specified all the pages. It should be noted that this process was designed for Wikipedia users who want to create source texts. It is not convenient for Wikisource users, so it is rarely used for Wikisource-related purposes.
The Primary Issues with Current Ebook ExportsEdit
There are many issues with the current ebook export process, which we will divide into three categories: reliability, formatting and styles, and user experience.
The WSExport tool is not consistently reliable. Users report frequent downtime and timeout issues, which prevent them from exporting books. Some of these issues have been documented in Phabricator tickets, such as T250614 and T219330#5060262. In 2019, this issue became the #4 wish in the 2019 Community Wishlist Survey. As a result, the Community Tech team launched the Ebook Export Reliability project, which aimed to improve the export experience. By the completion of the project, WSExport had 99.42% average monthly update (recorded on June 20, 2019), and downtime went from 941 minutes total (between May 1-15, 2019) to 179 minutes (between June 1-15, 2019). Further data on Wikisource reliability after the team’s changes can be found in T226136.
Despite these improvements, issues persisted for the Wikisource community. Users continued to experience issues, such as those detailed Problems detected in epub generated with Wsexport, compiled by Viticulum on French Wikisource. For example, on September 30, 2019, WSExport was inoperable for about 12 hours. In total, 13 outages were reported, with many of them lasting for hours. Furthermore, recent tests conducted by the Community Tech team have identified intermittent problems when trying to download MOBI files. The timeout message currently only appears when on the WSExport page (not via download links), as well. In total, this situation can be very frustrating for Wikisource users, and they often don’t know how to respond to such issues.
Formatting & StylesEdit
Ebook exports often have formatting and style issues. These issues vary, but they may include: missing or altered text, duplicated text, poor pagination, missing table titles, incorrect capitalization, incorrect border styles, incorrect content alignment, incorrect table alignment, and incorrect table styles. In some cases, the words themselves are altered. These errors can be confusing and concerning to users. They also go against the Wikisource policy of mirroring the source text. Below, we have provided some examples of the issues we’re seeing. This isn’t an exhaustive list, but it can give some idea of common errors.
Example #1: Page split between 2 pagesEdit
In the screenshot below, you will see that the content is divided into two pages. However, in the original version in English Wikisource, it was displayed on one page.
Example #2: Fonts not renderedEdit
In the screenshots below, you will see that the files are not properly exported from Tamil Wikisource (bottom left) or Kannada Wikisource (bottom right). Rather, the text displays as rectangles. This is due to the fact that Kannada is not included in the “Include fonts” section, which creates various issues, such as the one below. Meanwhile, Tamil is included in “Include Fonts,” but there are still issues.
Example #3: Consonant conjuncts incorrectly renderedEdit
In the example below, you will see that the text is incorrectly displayed. In the original version in Bengali Wikisource, the user will see “প্রথম.” However, in the ebook export, the word is changed to “পরথম.” This particular error shows conjunct consonants, which is when two consonants are usually clustered together in a word. In these examples, the consonants are separated due to issues with font rendering. This issue occurs for users in many Indic languages.
Example #4: Incorrect text wrapEdit
In the example below, you will see that the text is wrapped around the image. However, in the original version on Bengali Wikisource, the text is displayed below the image with no text-wrap.
Example 5: Content alignment alteredEdit
In this example below, you will see that content is aligned to the left in the PDF. However, in the original text from Armenian Wikisource, the content is centered on the page.
Accessibility & User ExperienceEdit
The ebook export process is not very inviting to newcomers. There are many quirks and exceptions that one must learn. The WSExport tool is not easily discoverable, and it doesn't provide an intuitive user experience. For example, it doesn’t include all scripts in the “Include fonts” section (such as Bengali). Even if a language script is included in “Include fonts,” the export may still experience language errors. Meanwhile, in the sidebar, it’s confusing to determine how to download multi-part ebooks for new users, among other issues. If we want Wikisource to expand, we need the experience to be intuitive and accessible newcomers. For this reason, the UX considerations involved in the ebook export process may be investigated for improvement as well.
- Have we covered the main reasons why people export ebooks?
- Have we covered the main methods to export ebooks?
- Have we covered the main problems experienced when exporting ebooks?
- Which formatting and style issues are the most common and frustrating, in your opinion?
- Which user experience issues are the most common and frustrating, in your opinion?
- Which problems, overall, do you find the most critical to fix, and why?
- Anything else you would like to add?
We look forward to reading your feedback on the Talk page! Thank you!
November 18, 2020: Updates & Request for FeedbackEdit
Hello, Wikisourcers! We are pleased to share our November update. In this update, we’ll focus on the work we have done so far to improve the ebook export experience. We’ll also share our plans for what we hope to do next and how we plan to do it. We would like to thank everyone for the feedback so far, and we look forward to collecting feedback on this next stage of the project!
Language support improvementsEdit
You can now read Wikisource ebook exports that were previously unreadable. For example, issues such as this and this are now solved, due to our changes! Here’s how we did this: After investigating the issue, we realized that WS-Export only supported 4 main fonts. This meant that fonts required to properly render many scripts were not available. For this reason, we upgraded font support, so that all fonts available in Debian would be available for WS-Export. The end result is that scripts in ebook exports are now largely supported, due to this change. One request: Now that we have made this change, please test out this issue for us. Do you think it is largely resolved? Do you still see boxes (instead of text) popping up anywhere? Please let us know, since we really want to fix this issue. Thanks in advance!
#1. Investigation findingsEdit
We have conducted four separate investigations to identify how we can improve ebook export reliability. These investigations were as follows: Parsoid HTML for WS-Export, Cache generation for ebooks, Preventing automated book downloads, and Implementing a job queue system. In each case, an engineer was assigned to deeply explore the proposal in question, determine if and how we could make such changes, and what level of improvement we could expect from such changes. Then, we reviewed the findings as a team to determine next steps. From these investigations, we decided that two proposals should be focused on first: Parsoid HTML and Caching ebooks. In both cases, we felt that the proposed changes would improve reliability and that they were within scope for the team. Work on both of these projects has been launched, which we’ll explain below.
#2. Caching API requestsEdit
This work came out of one of our investigations, as described above. We have made great progress, and it is almost complete. When the work has been deployed, we will cache all API requests when exporting Wikisource ebooks. This means that, if someone downloads Book A and then someone else wants to download Book A soon afterward, the ebook will be generated much faster. This could be helpful in a variety of cases, such as when many people download featured books of the month, when many people download books listed on wiki pages (such as those featured on the homepage on Bengali Wikisource), or when people sequentially download different formats of the same book. We believe that this work can help improve reliability, but we’ll need to conduct some analysis to determine the impact. We’ll share more on the release and impact in our next update.
#3. Integrate Symfony 5 with WS-ExportEdit
This work is now complete. Due to this work, we are now able to implement improvements to WS-Export, such as the caching work, on a much faster timeline. Furthermore, Symfony 5 is the same framework we use for other tools, so ongoing maintenance will be easier and quicker in the future. We can also now make use of Symfony components that are battle-tested and require very little work to enable, such as ErrorHandler, Cache, Console, DependencyInjection, DotEnv, and more. Overall, we are modernizing the app and making it easier on ourselves and volunteers in the long run.
#4. Migrate API to Parsoid APIEdit
This work is in progress and almost complete. Once complete, this work will help us support proper formatting, such as in footnotes and mathematical equations. It was first recommended to us by Tpt, and our investigations concluded that we should do the work. Here are the technical details: WS-Export was using the MediaWiki parser HTML output using ?action=render to generate its ePubs. However, Parsoid HTML became available, and it provides much richer data. Additionally, Parsoid API will eventually replace the current MediaWiki's native parser. We want Ws-Export to be up-to-date with the latest parser sooner rather than later. Once this work is complete, HTML output will hopefully be simplified, which will make it easier for us to support formatting of text.
User experience improvements: feedback requestedEdit
As we shared in our last status update, we repeatedly heard from people that the download experience is confusing, inconsistent, and hard to understand. We agree, and we want to improve the user experience. For this reason, we conducted an investigation to determine how we could improve the user experience, which included team discussions and usability tests. From this investigation, we came up with the mockups (as shown below) and ideas for how we can improve the experience. We encourage you to check them out and share your feedback on the talk page.
General proposal: We propose to add or replace (depending on the wiki) the top download links on the page with a simple “Download” button. When the user clicks on the download button, a window will open up, which will list the download options for the book (such as, EPUB, MOBI, and PDF) with download icon links. The user will also see information about which file format is recommended for different device types (since we found in our user research that many users don’t know which file format to pick). Once they pick a file format, the download process will begin, which will be indicated by some sort of “in progress” indicator. When the download is completed, the user will see a status indicator to display that it is complete. If there is an error in downloading the book, the user will see an error status indicator.
Potential enhancement for the future -- Auto-download: Once we have implemented the basic behavior, we may consider creating an auto-download function. Under this scenario, if the user has never downloaded a book before, they will manually pick the file format. After that, when the user clicks “Download,” the download will be automatically triggered in the file format that the user last selected. You can see an example of the user flow below.
After we developed the mockups, we conducted some usability tests (with readers who did not know about Wikisource) to see how people responded to the mockups. We'll be updating the mock-ups based on our findings from the tests, as well as the feedback from all of you provided on the talk page.
- What do you think of our recent font support work? Does the issue seem to be largely resolved of boxes appearing rather than text?
- What do you think of our recent and upcoming reliability work? Do you have any thoughts, concerns, or suggestions to share?
- What do you think of our proposed improvement to the download user experience overall? Do you like the general idea and user flow?
- Do you usually download the same file format (e.g., PDF, MOBI, EPUB, etc) every time you download a book, or do you often pick a different format?
- Is there anything else you would like to share?
October 23, 2020: Engineering work has begunEdit
The engineers have begun taking on work related to improved reliability of the WS-Export tool and improved font support. This work is based off the proposals that we shared on August 11th, as well as the feedback from that update. We are also wrapping up our first stage of research related to improving the user experience, which we'll be sharing soon.
August 11, 2020: Early findingsEdit
Hello, everyone! We are very excited to share our first update on the project. We want to thank everyone who has shared feedback on the project talk page, GitHub, or Phabricator so far. Your insights have been critical, and we deeply appreciate them. Once you have read our August update, we invite you to share your feedback on the project talk page.
Lessons from the consultation (so far)Edit
On the project talk page, we discussed with some people how to think about the project overall–in other words, what sort of principles should we be internalizing and following as we begin this work? Here’s what we have found so far:
- We should be continually mindful of both contributors and visitors. These groups may have some overlapping needs, but there may also be some important differences for us to investigate, explore, and address.
- We should think about user experience improvements (rather than only technical improvements). The process to export ebooks is currently not intuitive for many users. If we can improve this experience, we can potentially retain and nurture a larger base of readers and editors.
- We should try to investigate, document, and share best practices for Wikisource. As we have consulted with people, we have learned that some perceived errors are actually not technically “bugs.” Instead, they are rooted in confusion over formatting and best practices. As such, they can potentially be fixed by the community. This is great news, but more people need visibility into these best practices. For this reason, we plan to investigate best practices, and we’ll share our findings with the community.
- We won’t be able to fix everything (unfortunately!), but we’ll document all the issues we encounter. For this project, we’ll be focusing on improving core accessibility (i.e., WSExport functionality & reliability) and core readability (i.e., people’s ability to read exported books in their language of choice). If there are issues that don’t fall into these two categories, we’ll still document the issues in Phabricator–but we may not be able to fix them. It is our hope that other people may be able to fix the issues that we can’t get to in the future.
Work we have done so farEdit
We have just launched the project, so we haven’t done much work yet. However, we have done some work, which we would like to share now:
- We moved WSExport to a Virtual Private Server (VPS): Before this work, the memory and CPU intensive processes of WSExport were too much for the resources on Toolforge. For this reason, we moved the tool to a VPS instance, so that we could improve overall performance.
- We upgraded Calibre on WSExport VPSs: The VPS was on Calibre 3.48.0, but the latest version was 4.10.1. For this reason, we performed an upgrade, which helped fix various bugs related to generating PDFs. We will also continue with upgrades when new versions are available.
- We conducted a technical analysis and font rendering investigation. Both of these produced lots of interesting proposals (some of which we’ll share in the section below). In addition, we were able to identify four common types of formatting issues for ebook exports, which are: 1) The original HTML of the wiki, 2) The process of converting the wiki HTML into an ePUB XHTML file, 3) The second output formats (such as PDFs, introduced by Calibre), and 4), The rendering of ePUB files by an eReader. While we can’t do much for the last two (since they involve external software/tools), we can try to improve the situation with the first two.
- We are currently collecting baseline data on issues related to uptime and errors associated with the WSExport tool, among other data points. This will enable us to compare the original data (i.e., at the beginning of the project) with the data later on (i.e., after changes have been made at the end of the project). This way, we can measure our impact in a meaningful way.
- We have begun to organize existing Wikisource tickets. We’re currently in the process of migrating tickets from GitHub to Phabricator, so they can be consolidated in one place. Once the migration is complete, we propose that we group these tickets into categories, so we have a sense of the general “buckets” of work that we can do. In the future, we ask that all new issues are reported on Phabricator. If you want assistance in how to do this, you can reach out and ask on the project talk page.
- We have begun to investigate how we can improve the user experience for all Wikisource users. This investigation is currently in development, and it will be the primary focus of our next status update (which is tentatively scheduled for September).
Potential next stepsEdit
- Investigate cache generated ebooks: With this work, we propose to cache files that are produced, so they do not need to be freshly generated whenever someone wants to download them. This would potentially speed up the process of downloading books. We would need to investigate this proposal first, so we could determine the general impact and scope of the work.
- Investigate job queue for more efficient ebook generation: With this work, we propose implementing a queue-running system. This process would potentially speed up the current download process and give users more information on the download status of a book. One way it could be done is that a user would submit a request for an ebook, which would add it to the queue. The queue would first generate the ePUB, which would then be immediately available for download, and then it would generate the derivative forms (such as, PDF) and make those available when done. We would need to investigate this proposal first, so we could determine the general impact and scope of the work.
- Investigate how to prevent incomplete book downloads: The purpose of this work would be, if possible, to reduce the problem of downloading only a section of a book (when the user expects to download the whole book). We could possibly do this by using subpages from all pages (rather than just from ws-summary) when downloading books. We could also potentially follow subpage redirects, when they exist. We would need to investigate this proposal first, so we could determine the general impact and scope of the work.
- Through consulting on the talk page, we have learned that we should also be looking into improving support for mathematical equations. While we covered font rendering issues in our original investigation, we didn’t talk about numbers and equations. We’ll look into mathematical equations, as well.
- We have learned through the font rendering investigation that we can switch to a new system of fonts. This would enable us to use fonts from a host system (Debian), which has a much larger library of fonts than our own library. For example, this work would give us access to multiple Kannada fonts (while our current font set offers no specific Kannada support).
We would now love to read your feedback, especially in response to the questions below. Remember that we’ll also have a few round of collecting feedback, so we’ll engaging you again in the future as well. You can add your feedback on the project talk page. Thank you in advance!
- What are your general thoughts about the guiding principles that we have learned from the consultation so far (i.e., “Lessons from the consultation”)? Is there anything that you think we should add or change?
- Is there anything you would like to share about the work we have done so far (i.e., VPS work, Calibre upgrade, various investigations, and the consolidation of tickets)? We’re open to any thoughts or suggestions!
- What do you think of the proposal to investigate cache generated ebooks? Would this be useful and high-priority, in your view? Do you have any concerns?
- What do you think of the proposal to investigate job queue for more efficient ebook generation? Would this be useful and high-priority, in your view? Do you have any concerns?
- What do you think of the proposal to investigate how to prevent incomplete book downloads? Would this be useful and high-priority, in your view? Do you have any concerns?
- What do you think of the proposal to switch to a new system of fonts? Would this be useful and high-priority, in your view? Do you have any concerns?
- What work or investigations would you like to see that is *not* being addressed or is being addressed in a different way than you would expect? In other words, what do you think we’re overlooking, if anything?
- Anything else you would like to add?