Grants:APG/Proposals/2017-2018 round 2/Wikimedia Indonesia/Proposal form/Content Creation Program

Wikimedia projects are still lacking Indonesian local content, mainly because we have more than 300 ethnic groups with more than 700 living languages and in fact we only have 10 languages that have their own Wikipedia projects: Acehnese, Minangkabau, Malay, Indonesian, Javanese, Javanese Banyumasan, Sundanese, Banjar, Buginese, and Tetum. Since last year, although we’re still focusing to grow Javanese and Sundanese projects, we also start to support Minangkabau and also Cirebon projects. This is part of content creation and diversification. Since 2016, we also try to improve Wikidata content with more than 90.000 administration level divisions in Indonesia. The data is important because it can serve as a basis to support linking data across multiple sources. Wikimedia Indonesia also plans to have collaboration with the educational institutions to bridging those local content gaps. In Content Creation Program we focus on three Wikimedia projects: Wikidata, Wikimedia Commons, and Wikisource.

Bridging the Semantic Gap between Wikidata and data.go.id

data.go.id (also called One Data Indonesia) is an official Indonesian open data portal. Statistically, data.go.id has around 2600 datasets, while Wikidata describes over 45 million entities. Even though both Wikidata and data.go.id share a common goal of publishing data, there is still a wide gap in how data provided by both parties can be linked to each other. This interoperability issue hinders not only data consuming but also data publishing itself. This project is supported by lecturers of Faculty of Computer Science, Universitas Indonesia, Wikidataists, and also supporters of open knowledge and open data, Adila A. Krisnadhi, Ph.D. nd Fariz Darari, Ph.D.

The main challenge of the project is how to provide a semantic bridge between Wikidata and data.go.id, providing linking between the two data publishers. Concretely, the project task is two-fold:

  1. Data Import: identify what data in data.go.id can potentially be imported to Wikidata, create a framework for such imports, and implement imports as showcases.
  2. Semantic Enrichment: analyze what data in data.go.id can potentially be enriched by linking with Wikidata entities, create a framework for such semantic enrichment, and as showcases, implement the proposed semantic enrichment over data.go.id datasets and demonstrate its added value (e.g., through the use of API or SPARQL with Web queries over data.go.id). More precisely, named entity linking is performed between tables (i.e., cell values and column names) in data.go.id, and entities (i=.e., items and properties) in Wikidata.
Users and uses

The beneficiaries of the project outputs are mainly Wikidata and data.go.id users themselves (ranging from application developers to government employees), as well as researchers and academics. Other parties interested in conducting similar projects (e.g., linking data.gov.uk to Wikidata, linking data.gov to Wikidata) may adopt our approaches.

Impact

The most tangible impact would be the creation of links between data in Wikidata and data.go.id through data imports and semantic enrichment: (i) Data publishing: data.go.id can be an excellent data import source for Wikidata, that is, increase the amount of (quality) data over Wikidata; and (ii) Data consuming: data.go.id may benefit from the interlinkedness of Wikidata for more fine-grained data search and discovery. So, this project provides values for both Wikidata and data.go.id. Below we list use cases from the linking results.

Use Case 1: Islamic high schools in West Java Province From the dataset of Islamic high schools in data.go.id, it is not directly known which are the provinces of the high schools. For example, the screenshot below shows a row in the dataset, which is about As-Solehhiyah. While the address (in ID: alamat) is provided, there is no direct information whether the Bandung Regency (in ID: Kabupaten Bandung) is located in West Java Province. Consequently, the query asking for all Islamic High Schools in West Java Province will miss this row in the query result.

As-Solehhiyah Islamic High School in the dataset of Locations of Islamic High Schools in Indonesia in data.go.id

Our approach enables adding background knowledge in the following way: the entities in the address value are extracted, giving (among others) the Wikidata entity of the Bandung Regency (Q10332). Now that the value is now linked to Wikidata, we may rely on the knowledge already existing in Wikidata: that Bandung Regency is located in West Java Province (Q3724). As a consequence, the query now includes As-Solehhiyah in the query result.

Bandung Regency in Wikidata

Use Case 2: How many percents is the mineral production of a company exported?

Content of The Mineral Exports Dataset(ID-EN Translations: nama_perusahaan = company name, komoditas = commodity, satuan = unit, tahun = year, nilai = value)

Consider the two datasets in data.go.id, first the dataset of Mineral Exports (as shown above), and second the dataset of Mineral Production (as shown below).

Content of The Mineral Production Dataset(ID-EN Translations: nama_perusahaan = company name, komoditas = commodity, satuan = unit, tahun = year, nilai = value)

Consider now the question: how many percents is nickel ore production of the Aneka Tambang company exported? Answering this question requires much manual work: we first have to find which datasets contain Aneka Tambang’s mineral production and export, and then compute the export percentage. With our approach to enriching texts in data.go.id with Wikidata annotations and lifting the structure of the texts, we will have a unified view of data. Aneka Tambang will be annotated with the Wikidata entity Q12472074. Moreover, Aneka Tambang’s commodities can be aligned with commodity entities in Wikidata such as nickel (Q744) and gold (Q897). The columns, such as komoditas (in EN: commodity) will be annotated with the Wikidata property P1056. The annotations are then put back to Wikidata. Looking for inter-dataset information like the question above can be performed within just a single SPARQL with Web query over Wikidata. Having a more structured and automated way of retrieving information would ease the creation of similar queries, say when we also want to know the export percentage of other companies than Aneka Tambang.

Another issue that can be solved by leveraging Wikidata is the aliasing issue, that is when an entity goes by several names. For example, Aneka Tambang is sometimes called Antam. Wikidata stores alias information of entities, as shown in the figure below. Thus, by using Wikidata as an aliasing mediator, whenever there is the text “Antam” occurring in a dataset in data.go.id, we may, therefore, infer that this is actually about the same entity which also goes by the text “Aneka Tambang”. Now, when looking for datasets about “Aneka Tambang” in data.go.id, the results would also include datasets about “Antam” and Aneka Tambang’s other aliases. Hence, the discoverability of data.go.id’s datasets are improved, thanks to Wikidata.

Aliases of Aneka Tambang in Wikidata
Target

We envision that by the end of the project, we will achieve the following outputs:

  • 1 bachelor thesis of Faculty of Computer Science, Universitas Indonesia (also serves as the project documentation)
  • 2 international conference/workshop publications
  • 4 instances of Wikidata data importing from data.go.id, which should create or improve around 20000 statements (or items)
  • 50 Wikidata-annotated data.go.id datasets, which should create around 1000 links between data.go.id and Wikidata
  • 1 live system for demonstrating usages (e.g., data visualization, data discovery) of created links (from data imports, and semantic enrichment) between Wikidata and data.go.id
  • 2 Wikidata tutorial (or workshop) sessions for brainstorming and result dissemination (to be aligned with WikiLatih, a project by Wikimedia Indonesia to provide training of Wikimedia products) with the expected number of participants of around 40 participants/session, which should create or improve around 100 Wikidata statements or items.
  • Intangible output: Collaboration between Faculty of Computer Science, Universitas Indonesia and Wikimedia Indonesia, opening paths for more collaboration in the future. Faculty of Computer Science, Universitas Indonesia’s research results will be published in an open manner, and publications (or their pre-prints) will be freely accessible.
Technology Stack

To realize our goal of linking between data.go.id and Wikidata (and making use of the linking results), we rely on Linked Data technologies, standardized by W3C (World Wide Web Consortium), as well as NLP (Natural Language Processing) and text mining. The W3C Linked Data technology stack is composed of: (1) RDF (Resource Description Framework) for data publishing and linking; (2) OWL (Web Ontology Language) for building vocabularies for data; and (3) SPARQL (SPARQL Protocol and RDF Query Language) for querying data. We use these technologies so that our results will be conformant to the existing open standards of linking data on the (Semantic) Web.

NLP and text mining techniques are used for processing the content of data.go.id, which, despite the table representation, is actually semi-structured only since table columns and values are still of the text form. In particular, these techniques extract entities and relations that exist (though still implicitly) in datasets. The extracted entities and relations will then be represented using Linked Data technologies as above and aligned to Wikidata entities and relations.


Wiki Cinta Budaya (Wiki Loves Culture)

Legong dance, traditional dance of Bali, one of culture example that we would like to have more in Wikimedia Commons

There are more than 300 ethnic groups in Indonesia with various culture, while we’re still lack of their free image documentation in Wikimedia Commons. Last year, we successfully hold Wiki Cinta Alam (Wiki Loves Earths, with more than 3,000 pictures from 25 provinces, 400 of them are being used in various Wikimedia projects. WLE Indonesia winners also were placed in 4th and 7th winners in WLE International Competition.

This project aims to document Indonesian culture through the first edition of Wiki Loves Culture photo contest. This photo contest will also be a good platform to socialize the Creative Commons license.

Users and uses
  • Wikimedia Commons: as the main platform where the pictures will be uploaded
  • Wikipedia Bahasa Indonesia: eligible pictures from the photo contest will be used as the illustration on relevant Wikipedia articles
  • Wikipedia and Wikivoyage projects in other languages having articles about Indonesian cultures
Impact
  • Increasing reach: inviting new contributors to Wikimedia Commons
  • Increasing quality: high-quality pictures which were previously unavailable in free license will enrich the archiving of Indonesian culture for wider use.
Target
  • 200 users participating in the contest
  • 100 newly registered users
  • 3,000 images uploaded
  • Photo submission of Indonesian culture from minimum 20 provinces.

Javanese Character Recognition for Preserving Historical Manuscripts

Cover of Serat Babad Surakarta Volume 1

The transition of mechanical printing to electronic information dissemination has triggered a digital renaissance, an era which is marked by the reborn of the primary sources and historical documents in digital forms. The main motivation of digitizing such documents is to preserve their content as well as their existence. Besides, such documents could be passed on to the next generation as a source of references on culture development, traditions, and identity of a nation on a specific period of time.

The process of digitizing historical documents and manuscripts will not stop when they have been scanned and saved in the image format (jpg or png). The disadvantages of saving document image lie on its size, which requires huge space of storage, and inflexible access. This problem could be solved by Optical Character Recognition (OCR) which turns a character image into a searchable character text. For this reason, this project focuses on the building of OCR engine for recognizing Javanese characters from scanned Javanese manuscripts.

The short-term objective of this project is to develop a software prototype which is capable of recognizing Javanese characters in optical forms and maps them into their Unicode symbols. In a long-term, this project aims to transliterate Javanese characters into Latin alphabets and to enhance the prototype’s recognition rate by elaborating the segmentation and pre-processing stage and finding a model for the post-correction process. This project is supported by lecturers of Informatics Department, Duta Wacana Christian University, Dr. phil. Lucia D. Krisnawati and Aditya W. Mahastama, S.Kom, M.Cs.

Users and uses

The beneficiaries of the project outputs are mainly for Wikisource projects and its volunteers, as well as researchers and academics. Wikisource users can input all Javanese characters from the sources effectively and efficiently. The researchers and academics or Javanese enthusiast can easily find their targeted information by input Javanese character Unicode, instead of reading all of the raw material.

Impact

The concrete impact of this project would be a web-based OCR-software prototype for recognizing Javanese characters. The software would be synchronized to the Wiki’s use case so that its outputs could be publicly accessible. Such use case would lead to the opening of essential Indonesian, especially, Javanese historical sources and enable historians and humanities experts to conduct their original researchers based on the publicly available primary sources. As the first project on Javanese to Javanese character recognition, this research project would surely stimulate researchers on the digitization of historical documents in Indonesia. Lastly, this project contributes to increasing the number of more accessible Javanese sources in terms of searchable character text on Wikimedia.

Target

Some deliverables include the followings:

  • The software prototype of OCR-engine for Javanese characters
  • The embedding of OCR-software prototype into Wiki’s User Interface
  • 2 volumes of OCRed manuscripts with CC licenses, ca. 400 pages as training data; eg. Serat Babad Surakarta Volume 1-2
  • 2 conference article disseminated in an international conference
  • 1 bachelor internship program which also serves as a documentation of this project
  • 1 bachelor thesis in Informatics Department of Duta Wacana Christian University on this topics.
  • 2 tutorial of prototype usage with 10 participants for each tutorial

Wiki Culture Writing Competition

Ganesha project team. Social science writing competition in Indonesian Wikipedia, collaboration with Goethe-Institut Indonesien.

As part of our commitment to creating more local contents in Indonesian Wikipedia, we will organize a writing competition specializing in Indonesian tangible and intangible cultural heritage.

Users and uses

This project will spearhead the written documentation of Indonesian’s many cultural practices, representations, expressions, knowledge, customs, and skills. It will benefit Indonesian Wikipedia’s reader as there is still a lack of content on local knowledge. The competition will make use of the images from Wiki Loves Culture to illustrate the encyclopedic content created by the participants.

Impact
  • Outreach: increasing new contributors to Indonesian Wikipedia
  • Increasing the number of quality contents in Indonesian Wikipedia
  • Increasing the number of local content in Indonesian Wikipedia
  • Increasing awareness of the Wikipedia
Target
  • 200 users registered for the contest
  • 150 newly registered Wikipedians
  • 500 new high quality articles written