Community Wishlist Survey 2022/Larger suggestions/Prompt users to replace article text with Wikidata-derived data

Prompt users to replace article text with Wikidata-derived data

  • Problem: Data in articles is stored in raw text and often unspecifically referenced. That same data is or could be stored and cited from Wikidata and be directly referenced.
  • Proposed solution:
    • When editing an article, a system automatically recognizes textual data in articles that could be represented on Wikidata and shows small prompt dialogs by the statements asking the user to convert the text to a template that derives data from Wikidata such as Template:Wikidata (Q8478926). If the statement does not exist on the article's or a related item yet, the system will prompt the user to create the statement and will provide a UI in the article to do so. If reference(s) are nearby the statement, the system will also be able prompt the user if they want to use that reference for the statement.
    • Example usage: A user is editing Barack Obama. A small dialog box appears above the text/wikitext He graduated with a [[Bachelor of Arts]] degree in 1983 and a 3.7 [[Grading (education)#United States|GPA]]. stating, "This phrase can be converted to be sourced from Wikidata." If the user selects to do so, the phrase will reference this statement on Obama's item and the text/wikitext will be replaced with: He graduated with a {{Wikidata|qualifier|linked|P69|Q49088|P512}} degree in {{Wikidata|qualifier|P69|Q49088|P582}} and a 3.7 [[Grading (education)#United States|GPA]].
  • Who would benefit: All projects. When statements with references are added to Wikidata, anyone on any wiki can use that data.
  • More comments:
  • Phabricator tickets:
  • Proposer: Lectrician1 (talk) 21:01, 11 January 2022 (UTC)[reply]

Discussion

  • Identifying wikidata statements from natural language might be a huge project. I think it is better to create a tool to make referencing statement from wikidata more easier. Integrating SPARQL or Wikidata Query Service into VisualEditor or WikiEditor may also be a good idea. --Steven Sun (talk) 01:31, 12 January 2022 (UTC)[reply]
    Yes, that is a good first-step solution. Lectrician1 (talk) 03:00, 12 January 2022 (UTC)[reply]
    @Steven Sun I have now created another proposal for exactly that: Tool to add Wikidata to an article Lectrician1 (talk) 22:04, 12 January 2022 (UTC)[reply]
  • Your example will probably work for English, but the text would sound unnaturally in other languages. The goal of the Abstract Wikipedia project, which is in development, is a mutlilingual solution to this problem. --Matěj Suchánek (talk) 09:23, 12 January 2022 (UTC)[reply]
    @Matěj Suchánek Yes, I know that the system would have to be optimized for other languages and their grammatical structure.
    Also, I don't see Abstract Wikipedia as replacing Wikipedia anytime soon or maybe even ever (I still love it though :) ). Human-edited Wikipedia articles are still going to exist. Having Wikidata-sourced data wherever possible will benefit them because that way we know the data mentioned in the article and between articles of different languages is consistent and has a source. Lectrician1 (talk) 19:34, 12 January 2022 (UTC)[reply]
  • This would not really be accepted in all projects. In the majority of cases where the issue of lack of wikidata use has come up on wikidatas own Project chat, the issue has been a lack of actions against vandalism.--Snævar (talk) 11:43, 12 January 2022 (UTC)[reply]
    @Snævar That's why I proposed this :) Community Wishlist Survey 2022/Wikidata/Automated page protector Lectrician1 (talk) 16:33, 12 January 2022 (UTC)[reply]
  • (Edit conflict.) I'm struggling to see the benefit of this. from an editing perspective the original text is explicitly clear - say that you needed to change 1983 to 1984, even if this is someone's first time ever editing a wiki it's fairly obvious what they need to change. The replacement is a complete mess with templates plonked in the middle of sentences which make the text realy hard to read, and doing something as simple as altering a date requires you to understand the property and item structure of wikidata and to go to an entirely different project. The data you've given as an example here is completely static and is never going to change, so I don't see the need to load it from a database rather than just hardcoding it. You also state that this would improve referencing, but in the original article this statement is properly sourced to his profile on a university website (at the end of the sentence, right next to the text you quote) while wikidata is sourced to "Imported from the English Wikipedia", this seems to be about normal for wikidata entries, even in an entry like Barak Obama most of the data is unsourced. This also opens up another vector for vandalism to creep in to pages as information in the article is now dependant on another project. Per above if you want to write articles pulling all their information from wikdata Abstract Wikipedia is the project to look at. As a final point the use of wikidata in articles is highly contentious, at least on the English project, something that the proposer should be well aware of given that they have spent the last year trying to get wikidata integrated into wikipedia only for their proposals to be repeatedly rejected (e.g. [1] [2] [3] and others) This tool would almost certainly not be allowed to run on the English Wikipedia (other projects may think differently) and the text of this proposal (mentioning adding wikidata info to articles on the English Wikipedia) does seem to be an attempt to bypass the existing consensus of the English Wikipedia with an official WMF tool. 86.23.109.101 12:08, 12 January 2022 (UTC)[reply]

    I'm struggling to see the benefit of this. from an editing perspective the original text is explicitly clear - say that you needed to change 1983 to 1984, even if this is someone's first time ever editing a wiki it's fairly obvious what they need to change.

  • I would also agree that it is very obvious and simple to change the value. But what if that value is used on other language Wikipedias or is mentioned multiple times in an article? Their values won't be changed... Or, what if someone wants to add another reference for a value? Nobody can reasonably do all of those things on every language Wikipedia. But with Wikidata, you can!

    The replacement is a complete mess with templates plonked in the middle of sentences which make the text realy hard to read, and doing something as simple as altering a date requires you to understand the property and item structure of wikidata and to go to an entirely different project.

  • The current state of Wikitext on articles is already an unreadable mess of templates, links, and references. But, we have the Visual Editor for a reason! I believe that the increased usage of templates will benefit the quality of articles much more than the tradeoff of the Wikitext being a bit more of a mess than what it already is.

    The data you've given as an example here is completely static and is never going to change, so I don't see the need to load it from a database rather than just hardcoding it.

  • The reason this example is beneficial is because we can directly derive the reference for the data. With the current state of references often being placed at the end of sentences, we can't easily do this. Also, if someone wanted to add another reference for the data in the future, they could do this and all other wikis that fetch the data would have it. Finally on a less-useful-note, finding out where and how Wikidata data values are referenced on Wikimedia projects could be useful for data analysis in the future.

    You also state that this would improve referencing, but in the original article this statement is properly sourced to his profile on a university website (at the end of the sentence, right next to the text you quote) while wikidata is sourced to "Imported from the English Wikipedia", this seems to be about normal for wikidata entries, even in an entry like Barak Obama most of the data is unsourced. This also opens up another vector for vandalism to creep in to pages as information in the article is now dependant on another project.

  • That's why we would require fetched values to have actual references. Also, as explained in the proposal, the system will be built to look for references nearby the data and prompt the user which they would like to add them to Wikidata. That should help clean up those "Imported from the English Wikipedia" references.

    Per above if you want to write articles pulling all their information from wikdata Abstract Wikipedia is the project to look at.

  • See my response to that above.

    As a final point the use of wikidata in articles is highly contentious, at least on the English project, something that the proposer should be well aware of given that they have spent the last year trying to get wikidata integrated into wikipedia only for their proposals to be repeatedly rejected (e.g. [1] [2] [3] and others) This tool would almost certainly not be allowed to run on the English Wikipedia (other projects may think differently) and the text of this proposal (mentioning adding wikidata info to articles on the English Wikipedia) does seem to be an attempt to bypass the existing consensus of the English Wikipedia with an official WMF tool.

  • Hopefully the tool will be easy to use and recognizably useful enough for editors to realize it's benefits. My goal is not to bypass an consensus. I just want to offer ideas that will improve the state of all wikis, not just the English Wikipedia.
    Finally I'd recommend you look at my Automated page protector wishlist proposal which aims to stop the primary concern of using Wikidata used on the English Wikipedia - vandalism. Lectrician1 (talk) 21:01, 12 January 2022 (UTC)[reply]
    At most a value is going to be used three times in an article - the body, the lead and the infobox, updating all three together is not a massive deal and if you're changing figures you will often need to adjust other things to make sure the text still makes sense. You could update articles in other projects, but personally I don't think it would be a good idea for me to be changing sentences in languages I don't speak and am therefore unable to check.
    I don't use the visual editor regularly because I find it slower than editing wikitext, but my experience is that editing templates in it is a chore. If we were to implement your proposal when you try to edit an article seemingly random bits of sentence won't be editable as text and instead will require you to navigate multiple layers of pop-up menus to fill in "P" and "Q" values, or to go to another project. it's still a much worse experience and would be completely inaccessible to a large number of users.
    The reason that references are placed at the end of sentences is that the entire sentence structure needs to be sourced, not just the random data points within it - The way you present data can completely change it's meaning. References in articles are going to stay, so I think that wikidata references would simply duplicate what is already there.
    Vandalism is only one small part of the reason why Wikipedias are unwilling to use Wikidata, an equally big issue is that most of the contents of Wikidata are unsourced and are generated by bots, user scripts and tools using a combination of guesswork and pattern matching. Last time I was there I saw bots doing all manner of slightly iffy things, e.g. guessing people's gender based on their first name. Wikipediocracy currently has a post on it's front page about an actress who had fake porn accounts added to her Wikidata item, these weren't added by a vandal, these were added by a user with hundreds of thousands of edits because they were using a semi-automatic tool to match up pornhub account names with people.
    It's also trivial to come up with examples where it is not possible to map information in sentences 1:1 to wikidata statements. Consider the sentence "In 1945 John Doe moved to Exampleville, where he founded his first company, widgets Inc" what do you link the "1945" to, the date when he moved or the founding date of his company? What do you link "Exampleville" to, his place of residence or the location of the company? This kind of stuff is going to show up all the time when dealing with actual articles.
    In summary:
    • Projects are not willing to use Wikidata in it's current state for a variety of reasons including vandalism, the large amount of unsourced or poorly sourced data, it's poor implementation/enforcement of policies regarding things like protection of living people and privacy etc.
    • I don't think that the "after" wikitext is an improvement, I think it would be slower and more difficult to write and edit and such editing would be inaccessible to a large number of people who are not familiar with wikidata. The contents of the text is also now dependant upon another project.
    • Wikidata usage in articles is highly controversial in some projects - in my estimation the probability that this tool would be allowed to be used on the English Wikipedia is approximately 0%. I imagine that the situation would be similar on at least a few of the other WMF projects.
    • I have serious doubts that it would even be feasible to build this tool. 86.23.109.101 00:08, 14 January 2022 (UTC)[reply]

    At most a value is going to be used three times in an article - the body, the lead and the infobox, updating all three together is not a massive deal and if you're changing figures you will often need to adjust other things to make sure the text still makes sense. You could update articles in other projects, but personally I don't think it would be a good idea for me to be changing sentences in languages I don't speak and am therefore unable to check.

  • You're not changing sentences in other articles. You're changing a single data value that is comprehensible in any language. Didn't you see my example? All Wikidata replaces is "Bachelor of Arts" and "1983".

    I don't use the visual editor regularly because I find it slower than editing wikitext, but my experience is that editing templates in it is a chore. If we were to implement your proposal when you try to edit an article seemingly random bits of sentence won't be editable as text and instead will require you to navigate multiple layers of pop-up menus to fill in "P" and "Q" values, or to go to another project. it's still a much worse experience and would be completely inaccessible to a large number of users.

  • That's why I'm proposing for this to be implemented first: Community Wishlist Survey 2022/Editing/Tool to add Wikidata to an article.

    The reason that references are placed at the end of sentences is that the entire sentence structure needs to be sourced, not just the random data points within it - The way you present data can completely change it's meaning. References in articles are going to stay, so I think that wikidata references would simply duplicate what is already there.

  • I don't think placing a reference directly next to a data value is bad if that's what the Wikidata template was going to do (or it could do something else!). It tells users exactly where a fact came from. You can still maintain the references at the end of a sentence if you want. In general, I think references need to restructured to somehow link or "surround" the text they are directly referencing for. Arbitrarily placing references at the ends of sentences and paragraphs confuses machines and readers alike.

    Vandalism is only one small part of the reason why Wikipedias are unwilling to use Wikidata, an equally big issue is that most of the contents of Wikidata are unsourced and are generated by bots, user scripts and tools using a combination of guesswork and pattern matching. Last time I was there I saw bots doing all manner of slightly iffy things, e.g. guessing people's gender based on their first name. Wikipediocracy currently has a post on it's front page about an actress who had fake porn accounts added to her Wikidata item, these weren't added by a vandal, these were added by a user with hundreds of thousands of edits because they were using a semi-automatic tool to match up pornhub account names with people.

  • Um, machines are not going to be adding data from Wikidata to Wikipedia. Instead, humans are because they can actually check the data and its references and appropriately use it. All machines do in this proposal is prompt users to use Wikidata where it could be used. It doesn't execute any edit on its own. Also, as I've said before, all data fetched and inserted from Wikidata to Wikipedia must be sourced and the source reviewed by a human. Despite Wikidata containing bad data (which I completely recognize), none of it should enter Wikipedia. And if it does, watchers would see the change and revert it.

    It's also trivial to come up with examples where it is not possible to map information in sentences 1:1 to wikidata statements. Consider the sentence "In 1945 John Doe moved to Exampleville, where he founded his first company, widgets Inc" what do you link the "1945" to, the date when he moved or the founding date of his company? What do you link "Exampleville" to, his place of residence or the location of the company? This kind of stuff is going to show up all the time when dealing with actual articles.

  • This is why we separate and linked items that describe each of these things in their associated contexts. Also, this system does not have to analyze every place where Wikidata could be used the day it launches. Complex sentences like that would require a lot more neural network training and configuration. There are potentially billions of simple single-subject statements on millions of articles across the English Wikipedia that the system would be able to recognize and correctly replace with Wikidata. That kind of power delivered by a such a tool even for simple sentences in articles would be extremely incredibly beneficial for the wiki as a whole.

    I have serious doubts that it would even be feasible to build this tool.

It's completely feasable. If there are neural networks out there that can create original creative works or drive a car, we can totally make one that can recognize data in simple sentences. Lectrician1 (talk) 02:07, 14 January 2022 (UTC)[reply]

  • I think Wikidata is a big data in information and we can do that if we need. However, this require complex analysis from natural language to coding language and statement trees, and we can change something called items or property in Wikidata. However, I think we should use this idea in other languages, to not have to update information in all projects, only changing in Wikidata. Thingofme (talk) 07:47, 14 January 2022 (UTC)[reply]
Please don't underestimate how many people are working on Tesla's neural net for over 10 years now.. The problem is not technical feasibility, it's how much work would be required to achieve that technical level. —TheDJ (talkcontribs) 15:02, 16 January 2022 (UTC)[reply]
  • @Lectrician1: It looks like there are two parts to this proposal: a way to add Wikidata values to Wikipedias, and an automated way to highlight particular sentences into which the data might be added. The first can be done with existing functionality (as you point out) but is controversial on some wikis; this is a matter of wiki policy, and it's not something that the CommTech team can do anything about. The second is not currently a feature, and is technically very challenging, probably more than a year; it is, however, a valid proposal, so I'll move this to larger suggestions. Hope you understand and aren't too disappointed that CommTech won't be able to work on it. — SWilson (WMF) (talk) 07:00, 26 January 2022 (UTC)[reply]

Voting