|Status of the proposal|
|Details of the proposal|
|Project description||This proposal presents a paraphrase graph database extension to Wikidata. The intent is to create a directed graph of paraphrase nodes and retained context edges initially of Wikipedia content, but ultimately Internet wide. Initially focusing on unstructured data (text), the vision is to be media agnostic harvesting sentence equivalent information. This foundational construct is intended to support a wide range of uses from indexing to misinformation detection to knowledge representation to enterprise work management.|
|Is it a multilingual wiki?||Ultimately this construct would be multi-lingual covering the same languages as Wikipedia|
|Potential number of languages||Ultimately this construct would be multi-lingual covering the same languages as Wikipedia|
|New features to require||Please see the Overview for details|
This proposal recommends the development of a semantic graph of paraphrases by the Wikimedia Foundation. With recent research illustrating that even phrases adhere to Zipf’s Law, a paraphrase semantic graph is the most powerful foundational construct in data curation. By normalizing unstructured data to the full concepts (phrases, sentences, sentence fragments, etc.) we use to communicate with each other, the foundation data store is created. Everything else is meta. With additional effort, a linked data architecture can unite knowledge domains to the graph creating an extremely powerful knowledge representation. This uber capability must be under the control of the people, thus the Foundation is the logical place as a repository and shepherd.
I came at this problem from a work management perspective, but as my trail of aha moments along the way occurred, I realized this capability is much too powerful for a profit incentivized business. During the discussion of this proposal, I would expect debates on using the graph for natural language programming, misinformation detection, and even AI ethics among many other uses.
Only recently has the ability to create this construct become a reality. Machine learning tools such as BERT, have increased NLP semantic resolution accuracy to acceptable levels. Further, supporting tools and techniques such as graph databases and linked data make at scale creation of an unabridged (near real time complete) paraphrase graph realistic.
Dictionaries have been around for almost 4500 years, thesauri over 1500 years. WordNet is 25 years old! For many reasons, annotated corpora of sentences have been very limited in scope and almost all paraphrase work has been focused simply on detection. In fact, massive scale phrase or higher annotated corpora top out with Google 5-gram datasets that are counts only. The Microsoft Paraphrase Database tops out at 5800 pairs and is focused on detection.
The paraphrase graph proposed here is a complete collection of paraphrases, retaining the relative position of the sentence (fragment, phrase, etc.) in situ of some container. The containers are envisioned as web pages, tweets, documents, and later, audio portions of videos, podcasts, and other human communication containers where we can devise a way to understand the semantics and conduct a paraphrase comparison. Thus, each node is composed of N sentences which are all paraphrases of each other and represent one concept. Nodes are connected via directed edges from individual node members to the preceding and following members of other paraphrase nodes.
The following is a high level example of one way to build the graph. Modern machine learning tools can assign vectors to words and sentences based on semantic similarity. Even more sophisticated tools can actually perform paraphrase detection - this sentence is a paraphrase of this sentence. Sadly, using the more robust approach means comparing the Internet’s total number of sentences to the Internet’s total number of sentences (greater than 500 billion sentences). What can be done is to use the vector approach, plus a machine learning clustering algorithm, and then use paraphrase detection for low confidence clusters only. The goal of the early approaches will be to work out the scale issues with later work focused on edge member fidelity. JSON-LD could be used for resource descriptions and everything would be stored in a graph database. All technology I have described to this point have well supported, open source software tools with generous licenses.
With an initial operating capability, Wikipedia would have the data store to support the most complex search capability yet created. No longer will search be transactional. Each search string will allow machine learning models to ascertain all type of "next" and "other" information depending on which node the string is resolved to. Additionally, the Foundation can develop API’s, as well as Foundation built software services powered by the graph. While the Foundation is donation only today, there are several financial models for business use of the API's.
I chose pragmatica due to the contextual retention aspect of the graph's edges. Marketing is not my forte, so I would expect someone of that particular set of skills to suggest a better project name.
At the risk of giving attention to a pot not needing a stir, this proposal in many ways is similar to the goals of the Knowledge Engine.
It makes sense that this capability would initially be a suite of API services under the Wikidata domain.
- Support as initial proposer.--DougClark55 (talk) 22:52, 7 January 2021 (UTC)
- Comment: You may be (very) interested in the Abstract Wikipedia project, which if I understand correctly has the same goal but with a different approach, and is currently in long-term development. I hope you'll take a look through the many subpages there, and then join us in the various discussion channels. Quiddity (WMF) (talk) 19:38, 8 January 2021 (UTC)
- Comment: Indeed I am very interested in the Abstract project. As a data curation, Pragmatica will re-index Wikipedia to paraphrases in a graph construct. This graph database can potentially replace your current lexeme work. We would use TensorFlow and BERT to give us the higher order lexemes of sentences and sentence fragments. Typically lexemes are objects that are completely dependent on context provided by possibly a large number of other objects. Making paraphrases our basic units of meaning creates a much simpler traversal for the functions. There are ~500 billion sentences out in the Internet versus 100 trillion words. Pragmatica is the WordNet for sentences! I look forward to reading and understanding more about Abstract. Thank you for reaching out.--DougClark55 (talk) 19:29, 11 January 2021 (UTC)
- I think this is something what should not replace Lexemes in Wikidata. I like the Lexemdata in Wikidata and think is easier to query as the proposal. I think it needs a lot of energy to find all the words and the relations between them through scraping Websites in the Internet.--Hogü-456 (talk) 19:00, 17 January 2021 (UTC)
- My proposal does not target the word level, it targets the sentence level. If you wanted something as in your description, WordNet has built a massive community around it's lexical database. Wordnet provides a complete morphology of words from their senses to relationships ranging from synonym to meronym to hypernym. It makes little sense and is a great waste of energy to build a new lexical database. WordNet has a friendly license and has many extensions. It is important to note that WordNet was also developed by cognitive scientists. It is arguably the finest lexical database ever developed. However, WordNet and Lexemdata both suffer from a complete lack of context or syntactic clues. For the translation function to work efficiently and with high accuracies, or any higher order text manipulation, we need something more than word relationships. Google continues to struggle with their knowledge representation as they too lack context and they have historically topped out at five-grams for nearest neighbor type of processing. GPT-3, the state of the art in text processing, is a massive language model with ~175B parameters from 45 terabytes of text. However, all it does is predict the next set of words given some input. It's a capability not a tool. I have developed software using WordNet. It was great for it's time. However, machine learning driven natural language processing easily beats humans in both quantity and quality. My proposal is a data curation. Yes it uses machine learning, but the models are basically off the shelf, so fairly simple from todays machine learning world. You can actually read the output. What is so powerful about this data curation is that it not only provides the context for what is next, it also provides contextual pathways. In essence, we would build the paraphrase equivalent of WordNet with context. Further, the curation is of full concepts not just the word building blocks. So energy use of people would be fairly low, as we automate the paraphrase detection, and it would also be low from a resource perspective as we leave data in place using a modern linked data architecture. The first and most powerful outcome of this approach is that Wiki's will be re-indexed to concepts. Translation will be massively simplified as we are translating larger contextual blocks than with lexemes. If anyone would like to do a Zoom/Teams/Etc., I would love to discuss how this proposal can support Abstract. — Preceding unsigned comment added by 2601:643:381:74d0:1811:dfa0:45d3:8999 (talk) 19:02, 25 January 2021 (UTC)
- Comment: You might find interesting the Wikifact proposal which links to this proposal: https://meta.wikimedia.org/wiki/Wikifact#URL-addressable_statements_and_clusters_of_paraphrases . —The preceding unsigned comment was added by AdamSobieski (talk) 10:34, 17 February 2021
- Jake Ryland Williams, Paul R. Lessard, Suma Desu, Eric Clark, James P. Bagrow, Christopher M. Danforth, Peter Sheridan Dodds. 2015. Zipf's law holds for phrases, not words (V2). arXiv:1406.5181. https://arxiv.org/abs/1406.5181