This is a proposal for a new Wikimedia sister project.
Wikianswers
Status of the proposal
Statusunder discussion
Details of the proposal
Project descriptionWikianswers would be a large-scale, human-editable, multimodal Q&A system which utilizes one or more large language models and which tightly integrates with Wikipedia, Wikidata, and Commons.
Is it a multilingual wiki?Many language versions
Potential number of languagesMany languages
Technical requirements
New features to requireSee below

Introduction

edit

Wikianswers would be a large-scale, human-editable, multimodal Q&A system which utilizes one or more large language models (artificial intelligence) and which tightly integrates with Wikipedia, Wikidata, and Commons.

Integrations

edit

The Wikianswers system would answer questions while drawing content from Wikipedia, Wikidata, and Commons. Advanced topics include self-ask prompting and self-ask with search where large-language-model-based systems can decompose more complex questions into smaller follow-up questions. In this way, the Wikianswers system would be able to recursively invoke itself. For example, to answer a complex question "Who lived longer, Aristotle, Plato, or Socrates?" the system would ask itself sub-questions to obtain the age of each philosopher when they died and then integrate the resulting sub-answers into its context to generate its response.

Wikipedia

edit

With respect to Wikipedia integration, relevant artificial intelligence topics include retrieval-augmented generation, retrieval-augmented generation with guardrails, and agent-based approaches. In retrieval-augmented generation, questions are encoded into embedding vectors which are utilized to retrieve relevant chunks or excerpts of content using semantic-similarity-based search. The questions and these retrieved context data are then composed together into prompts. The resultant prompts are utilized to obtain answers from LLM systems.

As presently considered, parts of question-and-answer data which could be human-editable include: (1) templates of the prompts, (2) system roles or tasks, (3) retrieved context data, (4) questions, and (5) answers.

A template is the overall structure of prompts provided to LLMs. It includes natural language and slots where the other parts will be placed. The template should be locked so as to be editable only by administrators. Editing this would invalidate every cached and unlocked answer, meaning that every unlocked answer could be updated, refreshed, or regenerated.

A system role or task might resemble "You are a helpful system which will answer the user's question using the following information". The system role or task should be locked so as to be editable only by administrators. Editing this would invalidate every dependent cached and unlocked answer, meaning that every unlocked answer could be updated, refreshed, or regenerated.

Retrieved context data are chunks or excerpts, e.g., of Wikipedia articles, which enhance answering a particular question. Users could edit these chunks or excerpts and so doing would result in cascading invalidations to other cached and unlocked answers depending on the same chunks or excerpts. With respect to user experiences, editors could click on displayed chunks or excerpts to navigate to their contents as they occur in source pages and then edit them there, updates to underlying pages resulting in updates to the pages' chunks or excerpts and then to dependent cached and unlocked answers.

Questions would be unusual to edit, except in cases of correcting typographical errors.

Answers result from LLMs processing the prompts. Answers could be edited by users. As shown above, answers could be subsequently updated, refreshed, or regenerated by the system. Editors might want to edit an answer and then lock it or otherwise prevent subsequent revisions by AI subsystems, effectively manually overriding the artificial intelligence.

In conclusion, users would ordinarily want to edit: (1) those chunks or excerpts of content drawn from Wikipedia pages utilized in retrieval-augmented generation, and (2) those answers provided by systems.

Wikidata

edit

With respect to Wikidata integration, relevant artificial intelligence topics include text-to-SQL, text-to-SPARQL, and source-code generation. Natural-language questions can be processed into content with which to retrieve human-editable procedures of use for answering those questions. These procedures could be in the form of structured queries, source code, "wiki functions", or diagrams (e.g., extensible workflow diagrams). When a procedure cannot be found with which to answer a question, artificial intelligence (e.g., LLMs) could generate a new procedure. Thereafter, editors with adequate privileges could modify or edit that stored procedure.

Commons

edit

With respect to Commons integration, Wikianswers could create, store, search for, retrieve, and style multimedia resources (e.g., 3D models, animations, audio, charts, diagrams, figures, graphs, images, infographics, maps, mathematics, photographs, tables, and video). Wikianswers would search for existing multimedia resources when generating multimodal responses. If not found, Wikianswers would create and store new multimedia resources and accompanying metadata in Commons for subsequent reuse.

Large language models can generate source code, e.g., JavaScript and Python, which can be processed to produce multimedia resources. Such source code could accompany resources in Commons as metadata.

With respect to representing multimedia resources, a separation of style from structure, resembling the separation of CSS from HTML, would enhance the customizability and themeability of Wikipedia and other projects utilizing multimedia resources in Commons.

Services

edit

Question-answering

edit

The primary service provided to end-users involves providing AI-generated multimodal answers to natural-language questions, with contents initially drawn from Wikipedia, Wikidata, and Commons, such that end-users could collaboratively modify, correct, and otherwise update these answers.

Training data

edit

User-edited procedures and responses could be subsequently made available to AI systems as training data.

Change propagation

edit

Subscribers might desire real-time event streams as platform contents were created, modified, corrected, or otherwise updated by AI systems and end-users.

Enhanced consistency

edit

In theory, as users would utilize and curate Wikianswers and as to-be-developed tools would process the backing data, inconsistencies across Wikipedia articles could be detected. In the event of a detected inconsistency, users with any involved page on their watchlist could receive a notification in an alert, a notification, or email message.

Administration

edit

Content moderation

edit

Content moderation and anti-vandalism technologies could provide means of placing guardrails on questions and responses.

Content protection

edit

When Wikipedia articles were protected, would responses to natural-language questions involving those entities or topics be similarly protected?

Usage quotas

edit

Architectures could be considered which would allow end-users to ask only a limited number of questions per interval of time.

Usage data

edit

With usage data and related analytics, including information about which questions and responses were new, popular, and/or trending, editors could better prioritize which content to review.

Recommender systems

edit

In theory, recommender systems could recommend content for individual editors to review, e.g., based on their interests.

Alignment

edit

Drawing from the the current draft of the 2023-2024 Wikimedia Foundation Annual Plan's product and technology objectives, this proposal intends to:

  • "Support the growth of high-quality and relevant content"
  • Encourage "the satisfaction of and support given to moderators, patrollers, and functionaries" by "acting on promising hypotheses as we engage in deeper research, including ML/AI-enabled approaches to improving workflows"
  • Enhance the exploration of "ML-enabled natural-language search experience"
  • Enhance the exploration of "the future of media-related workflows and media-rich content, starting with Commons"
  • Contribute to producing "an effective and efficient knowledge production platform"

Technical discussion

edit

Database schemas

edit

Wikianswers database schemas would include one or more tables with vector columns for embedding vectors. A project goal, then, would be to efficiently combine into a database schema the existing concepts of revision tables, page tables, and text tables with the newer concepts of embedding vectors and vector databases. Relevant tools include pgvector, a database extension which provides open-source vector-similarity search to PostgreSQL.

URL-addressability

edit

Instead of requiring a new domain, e.g., https://en.wikianswers.org/, Wikianswers features could be integrated into the search systems of Wikipedia, Wikidata, and Commons. In this case, human-editable responses could still be URL-addressable, e.g.: https://en.wikipedia.org/qa/2b106ea8-4d1b-441f-9dc8-4555a9999ae9.

Datetime encoding

edit

Some questions have impermanent answers and others are volatile, meaning that their answers could vary each time that the question was asked. In these regards, date and time data could be encoded into URLs in a human-readable manner, e.g., https://en.wikipedia.org/qa/2023/09/21/21/29/00/2b106ea8-4d1b-441f-9dc8-4555a9999ae9. Some questions and answers might involve different granularities of time. For example, a natural-language question "Which teams are in the Super Bowl?" might have a number of URLs, one for each year, e.g., https://en.wikipedia.org/qa/2022/40a7338d-fe75-4897-aee6-ec87141020a6 and https://en.wikipedia.org/qa/2021/40a7338d-fe75-4897-aee6-ec87141020a6.

Dialogue

edit

In the event of vague questions, LLM dialogue systems could respond by asking questions of users to obtain more information with which to subsequently route users to answers for their clarified questions.

User experience

edit

In the approach where Wikianswers features are integrated into Wikipedia, Wikidata, and Commons search, user experiences could utilize the existing text search boxes atop pages. Perhaps the "magnifying glass" icon in those search boxes could be accompanied by a "question mark" icon. One of these two icons would be selected, or activated, by end-users. Which such icon was activated would toggle between using the existing keyword-based content search and the described Wikianswers human-editable question-answering subsystem. Still under consideration is whether and how end-users could specify when they desire for their question to have their current page, or selections thereof, as focal when responding to their question.

Timeline

edit

TBD

Background

edit
edit

There have been a number of related initiatives within and outside of Wiki, showing both the interest in and challenges of this proposal.

Wiki projects

Wiki proposals

Question-centric knowledge websites

Virtual assistants

  • Alexa is crowdsourcing answers

Proposed by

edit

Alternative names

edit
  • Wikiquestions
  • Wikiqna
  • Wikiqa

Domain names

edit

TBD

edit

People interested

edit

Comments

edit
  • As envisioned, the described wiki platform has a homepage where users type their questions. If their question has already been asked, then the user goes to an existing wiki page. Otherwise, the question is transformed and categorized, and, based on its categories or domains, one or more question-answering AI are delegated to to provide content. This machine-generated content is one or more answers for the question, each answer explained and argued for. As each answers page is wiki, users can edit them to correct and to train the AI question-answering systems. -- AdamSobieski (talk) 18:45, 16 December 2021 (UTC)[reply]
  • I think the "train the AI question-answering systems" should be firmly grounded as a tool to assist contributor in providing source and rationale in their elaboration of an answer and not a reader facing system -- SebastienDery (talk) 17:49, 3 January 2021 (UTC)[reply]
  • The wiki comes from that these are human curated and continually revised answers. Putting aside what AI can do, whereas Wikipedia shares knowledge with a "entity centric" lens, this would be from the entry point of a specific interrogation -- SebastienDery (talk) 17:49, 3 January 2021 (UTC)[reply]
Really? a) It's humans, and b) if the question isn't an article, you can a) make the article with the answers or b) press a button to show you related questions (using the same code that autocorrect uses). Is what your asking a genuine question? Username142857 (talk) 11:34, 1 March 2022 (UTC)[reply]
  • What a coincidence I was also thinking about bringing q&a to wiki :D It seems like as it is the proposal is taking a strong AI centric view on what the product and process should be; wdyt of "firmly grounding AI as a tool to assist contributor in providing source and rationale in their elaboration of an answer and not a reader facing system"? If it's okay with you i would make some edits to the proposal and make it more "human centric". -- SebastienDery (talk) 18:28, 3 January 2021 (UTC)[reply]
  • Thank you. Yes. Enhancing the human-centricity of the project proposal sounds good. Please take care with the existing document outline as the content contains some intradocument hyperlinks. Do you mean by "not a reader-facing system" that you are envisioning that AI systems and argument technology would be tools for computer-aided document authoring instead of producing initial content in response to users' questions?
  • I'll make sure to suggest incremental changes; if you feel its eventually taking too sharp of a turn we can always create an alternate project. -- SebastienDery (talk) 12:02, 3 January 2021 (PST)
  • I'm envisioning that the power of having this being a wiki (as opposed to a tech-owned virtual assistant or a for-profit website) is that's its written by humans, revised by humans, for humans. I think a simple Wikipedia-like format would do the trick where individual pages are its own question where an answer will evolve. Codified argumentation is often too strict and cumbersome to scale, I would be concerned if we tried to build new tech on that front; the simple edit mechanism of Wiki should be enough to record the evolution of an answer and it counter arguments. If we assume this simple premise, what are the tools we can build that would accelerate contributor? I can think of a few and this is where I see AI shine! -- SebastienDery (talk) 12:05, 3 January 2021 (PST)
  • Similar to how Wikipedia and Wikidata try to maintain some parity I think a version of this project would benefit from contributing to Wikidata. Think triplets like "X is_a_paraphrrase_of Y". -- SebastienDery (talk) 12:07, 3 January 2021 (PST)
  • I like that the original pitch tried to be very thorough and cover a lot of topic/ideas. As it is I would suggest that we boil down to the essential as one can easily get lost in the details of the original idea. What do you folks think? -- SebastienDery (talk) 15:07, 3 January 2021 (PST)
  • We should add to the Talk page Talk:Wikianswers
  • What do you think of moving the bulk of the technical ideas to the discussion page? I think it'll help us focus the essential concepts of this proposal --SebastienDery (talk)
  • There is also the option of creating a new subpage, e.g., https://meta.wikimedia.org/wiki/Wikianswers/Technical_discussion , with its own wiki content and discussion area. My initial thoughts are that attention to technical detail distinguishes this proposal from previous wiki Q&A proposals. Also, I think that this proposal would benefit from reviewing the most recent successful project proposal, Abstract Wikipedia, in terms of its structure and content. Perhaps we could keep the technical content on the main proposal page for now and, at some point in the future, move it to a subpage hyperlinked to from the main page? What do you think? AdamSobieski (talk) 22:28, 4 January 2022 (UTC)[reply]
  • 100% agree on the subpage! great idea. I would be keen to move it sooner than later mostly for the reason that there's a lot of material and it feel overwhelming at first. The strategy I would adopt is "Convince me in the first 30 sec of my reading through and then I'll click and poke around if interested" -- SebastienDery (talk) 14:37, 4 January 2021 (PST)
  • Ok. I moved the technical discussion to a new subpage. What do you think about moving the user-experience discussion to its own subpage? What do you think about moving this comments section to the discussion page?
  • 100% agree on taking a leaf from Abstract Wikipedia -- SebastienDery (talk) 14:37, 4 January 2021 (PST)
  • I agree there is a need to distill what is the essence of the proposal and how/why Wiki should take it on. My intuition is that this battle will not be fought on the technology itself but rather how does it position Wiki in the knowledge space going forward. -- SebastienDery (talk) 14:37, 4 January 2021 (PST)
The project: My first was the no. But now, yes. ✍️ Dušan Kreheľ (talk) 17:51, 7 January 2022 (UTC)[reply]
Yes! I would LOVE a Wikianswers! Username142857 (talk) 11:34, 1 March 2022 (UTC)[reply]
  •   Strong oppose as Reference Desk on Wikipedia already exists. --QuickQuokka [⁠talkcontribs] 19:43, 20 March 2022 (UTC)[reply]
    Don't know know (yet?) what's my opinion. But just wanted to say that this answer doesn't fit well, because there are wikis with reference desk and wikis without. As I can guess, it exists not on "Wikipedia", a wiki project that does not even there, but on "English Wikipedia", one of hundreds. IKhitron (talk) 01:13, 17 May 2023 (UTC)[reply]
    That's something entirely different. Seems like you misunderstood what this is about. Prototyperspective (talk) 23:08, 5 December 2024 (UTC)[reply]
  • I think LLMs are pretty unfit for anything where accuracy and truth matters (they are designed only so it sounds plausible) and using a bot doesn't necessarily have an advantage of just typing your question in normal query format into a search engine with wiki at the end or into the Wikipedia search and navigating to the relevant place manually. Nevertheless, it could be interesting/useful and maybe developed further. There is already some activity regarding that I think. For example DuckDuckGo seems to have had this implemented for a while. I think there was a newer one...there also is a bot for wikidata like this. I think different technology may be better and it would probably not be as useful as people think it is. The complex question example of who lived longer seems to be what Wikifunctions is about. I don't see much of a need or use for this but it could nevertheless be or turn into something that is but maybe that doesn't need a new project and if it would benefit from that then this concept here may still need to be overhauled or substantially altered & extended. Few people wonder about who of several people lived longer (for other question types other sites work better) and when they do they can also quickly find out by doing a Web search of each person's age one after another. For answers where info is in and taken from Wikipedia, LLMs, again, are not designed to create accurate reasonable outputs but plausible-sounding ones. --Prototyperspective (talk) 23:24, 5 December 2024 (UTC)[reply]
    Here's the Wikidata bot / tool I meant: d:User:SpinachBot This bot is an auto-responding SPARQL generation and question answering bot created by User:HTriedman (WMF) and Stanford's Open Virtual Assistant Lab (OVAL) […] When tagged and prompted by you, SpinachBot will try to write a SPARQL query based on your input request and send it to Wikidata Query Service. It will then post its response. Discussion about it is here. Prototyperspective (talk) 12:01, 11 December 2024 (UTC)[reply]