WikiCite 2016/Report/Group 3/Notes

Workgroup session notes

edit

Goal

edit
  • Model and store citation instances

Topics

edit
  • Use cases
  • What are we working on?
  • Discussion of semantic typing
  • Data about the citation instance
  • Where is the citation
  • Are citations separable?
    • Yes-- question of cost, maintenance

Tasks

edit
  • Other things to do
  • Useful metatdata it needs
  • What page the citation is on?
  • if citations are separable items, what can we store?
  • New feature coming -- attachments to wiki pages. (eg, categories)
    • Versioning updates
    • Changes to attachments change the page
  • Wiki text social constraints
  • javascript bot for: "wikitext unique identifier -> human readable text"
  • Using Zotero data for consistency checks
  • Need for querying, doable system for storing citations that can be queried
  • Re-visit: are, and can, citations be fully separable item.
    • Point: fully separable citations are useful for analysis
    • BUT separable citations are much more brittle, and have problems in some encodings
    • How to anchor, repetitive elements,
  • Middle ground: citations can exist as separable, but still include placement mark in wiki page.
  • Cross wiki things, cross pages things
  • Possibly very simple 3-element model
  1. Source cited
  2. Cited bib data
  3. Anchor at this location
  • Discussion of
    • / ?standard to which the standard conforms of bib data//"type"
    • - > normalized into bib data
  • citations styles -- how the citation instance gets human readable rendered on the Wikipedia page.
    • probably on a per page basis, or a user preferences
    • discussion on rendering time vs user choice on citation styles
    • is citation style per citation? template
  • optional additions
    • fragment, page/chapter/anchor ?
    • sentiment ??
    • timestamp created_at modified_at
  • versions
    • citations point to the source, does not reference the version of a bibliographic data
    • multiple copies of citation instances in a single page
      • related closely to...
    • do citation instances have a global ID
      • hash ( sensitive ID system) or UUID (very insensitive)
    • "are page versions mutable?"
      • perhaps just hash the snippet text around the citation, and hash that into the unique identifier
      • ... answers "does the citation identifier change?"
    • spectrum::
      • always changes on every version ... never changes UUID -> both ends are near useless
      • good ids are some combination of data elements and hashing in between
  • suggestion:
    • surrogate-ID(UUID) __ smart hash of slowly changing stuff __ full hash that changes each time
  • middle point "smart hash" is difficult - determine when a citation changes
  • point: UUIDs are really useful, we want them.
    • required for diffs on two citations
  • citation ranges - lorem ipsum: [4][5][6]
    • co-citations?
    • they would have the same anchor
      • but then they wouldn't preserve order. (hard)
    • POINT: very valuable for understanding how knowledge is connects, human curated connections by snippet
      • moves the analysis away from re-processing pages
  • discussion: how does an anchor work?
    • some meaningful expression of where the citation lives
    • discussion of use cases of co-citation analysis
    • result: anchors need to be smart- and will assert citations get to the same location
  • what does source cited "thing" looks like?
    • just a URI?
  • discussion: not all bib data in wikidata
    • community norms for notability apply to data in wikidata (potential problem)
    • organic growth of use of wikidata - all entries manually curated
    • library base as alternative, but no curations
  • what do we allow if it's not a URI?
    • free text allowed?
  • fallback to current wikitext? consensus: no, problematic. costly.
    • citoid fields?
    • use bibliographic fields from wikidata
    • consensus:
      • use a URI, but if not
      • use a unique external identifier, if not
      • use the same bibliographic fields as defined in the wikidata models
  • for legacy citations:
  1. do nothing, or
  2. keep the wiki text

Still to be discussed

edit
  • references, notes, further reading categories covered.


Day 2

edit

Data Structure

edit
UUID #UUID-identities citation object. generated on demand. do we need this in wikitext?
Citation target (URI) #bibliographic database object /or/ wikitext (with templates/parameters)
Target Anchor #chapter, page, etc. This is optional, but can only come from the wikitex
Citation origin (URI) #Wikipedia page specific revision
Origin anchor #point or range. some kind of offset
  • We discussed storing a structured representation of the citation traget locally, as an "attachments" (MCR). This seems complicated and confusoing though.
  • Anchors: preferably machine-actionable; possibly non-actionable text; fails gracefully
  • Hypothesis: Robust anchors for electronic documents: https://hypothes.is/about/
  • much work done on how to mark an anchor location in a web page
  • coordinated with Open Annotation

implementation mechanism citation object -> template -> wiki text -> html

Storage

edit

Citation instances are defined and maintained as part of the wikitext. The complexity of this is minimized by maintaining the bibliographic data separately. A machine readable representation of the citation instances is extracted, and made available along with the article text (perhaps using Multi-Content-Revisions).

Editing

edit
  • How will this be edited? How does migration happen?
  • You will have three or four out of five in Wikitext:
  • UUID (magic word _UUID_ substed by PST. Optional?)
  • citation target (URI. Wikidata itrem? Free-form wikitext? This needs to be explicit)
  • target anchor (chapter, page, etc. This is optional, but can only come from the wikitext)
  • Display style - different templates as mechanism for how to do "citation styles"
  • discussion:
  • Not part of "actual" citation instance
  • Specified in wikitext for rendering
  • Could be per-page, or per-user, instead of per-citation.
  • Models / use case
  • how the existing could be migrated
  • new citations being created
  • prevention of re-entering bibliographic data
  • form based editing
  • what we want to use stand-alone citations for
  • citation recommendation: you cited A, maybe you want B,C,D
  • You read A and B, you want to read C
  • show us all teh citation from Oxford in 2005-2015 that cover physics pages
  • name someone in an article, list the most cited works for that person
  • most frequently cited works from the topic of htis page
  • publishers - article level metrics, how are the articles being used on different pages
  • which wikipedia users are creating the same citations repeatedly
  • dependency tree for knowledge: fact X in article is supported by page Y, Z.. .etc. network analysis

Issues

edit
  • display style / human readability
  • template / json/ functin
  • use cases
  • relationship to wikitext
  • attachments
  • roll-out community / partnerships
  • make documentation & elucidate motivation/rationale
  • Relationship to Recent Changes changes
  • UUIDs and versioning
  • Consistent editing interface

Use cases

edit
  • Migrating existing data
  • create new enhanced citations
  • what you use standalone citations for
    • bibliographic use

resources

edit

Identity of Citations

edit
  • use case: unchanging ID
  • use case: IDs that change when citation changes
    • where does the UUID comes from, how does it persist? when a page changes, we don't want the UUID to change
    • easier if it's a hash
    • if UUID doesn't change then rsearch is easier
  • If origin anchor changes, it's the same citation
  • Citation target stays the same
  • What defines a citation, use and generation of UUID
  • if a anchor-origin changes, is it the same citation? (depends on use case)
  • if a citation target changes, is it the same citation? (no, it's different - not same uuid)
  • UUID can be injected in the wikitext, so the "same" citation can be tracked across revisions:
{{cite:SomeId:123495093802}} becomes {{cite:SomeId:123495093802 | _uuid_ }}
uuid gets replaced in the text (pst), similar to ~~~~ is replaced by name

How to store and reference bibliographic data

edit
  • Referencing in Wikipedia via reference rather than item in Wikidata solves both scalibility and notability concern in Wikidata.
  • Split between citation target and citation target anchor is arbitrary.

Two cases:

  • citations to sources that exist as Wikidata items (e.g. books, prominent articles)
  • citations to sources that won't become Wikipedia items (e.g. article x, from newspaper y, date z)

Idea: wikidata statements have sources that could be referenced. A wikipedia author might express:

"this text expresses a notion that is modeled by that wikidata statement, so shor the source references that wikidata has to support this statement".
{{cite-item:Q5678|page=15-17|style=Harvard}}    <--- citing a sources that isdescribed by an item
{{cite-statement:Q42|A47C4EA228B|style=Harvard}}   <--- citing a wikidata statement, e.g. "Water / Melting-Point / 0°C", means "recycling" the references for this statement.
{{cite-statement}}

is equivalent to the way references should be shown along with values pulled from wikidata.

  • Long discussion on how this could introduces the idea of citing 'facts' (e.g. item: Helium, statement: atomic number, value: 2, sources: list of publications... )
  • Difficult to model (e.g. statements changing over time, versioning statements instead of full items) and prone to loops (e.g. sources that become items)
  • This mechanism adds a level of indirection: text -> source becomes text -> statement -> source.
  • We might allow citing a specific revision of the statement:
{{cite-statement:Q42|A47C4EA228B|rev=346528745|style=Harvard}}
  • We might allow re-using only a specific reference (by id):
{{cite-statement:Q42|A47C4EA228B|ref-id=6A4EE82334C|style=Harvard}}
  • We might allow re-using citations of all statements about a property:
{{cite-statement:Q42|P31|style=Harvard}}

  • Citation instances are a relationship between a citing resource (the citation origin) and the cited resource (the citation target)
  • Citation instances can be modeled as follows:
  • ID, origin + anchor, target (bib data record) + anchor
  • target points to the current version, not a specific version, so updates to the bib data are reflected in any new rendering
  • it would be nice if the origin anchor would specify what section of text is covered by a citation, but this is uncommon, and should not be required.
  • it would be nice for the target anchor to be as specific as possible
  • Citation references are managed as part of wikitext (for now). The complexity is greatly reduced by being able to reference bibliographical data, instead of specifying it inline.
  • The citation style is defined locally in wikitext, e.g. by specifying a template name or parameter.
  • A machine-readable representation (e.g. as JSON) of citation instances is made available via a data API for every version of every page. This is achieved via a parser function or Lua library.
  • ...use cases...
  • Citations can reference bib data as:
    • wikidata items (e.g. a book like The Origin of Species). Chapter, page, etc can be supplied as local anchor information.
    • "recycling" the references attached to a wikidata statement (e.g. "water boils at 100°C" is supported by [1][2][3]...)
    • Alternatively, citations may contain bibliographical data directly, as part of the wikitext; we continue to use existing mechanisms like templates to structure them
    • Modelling all cited sources as separate wikidata items is impractical (maintenance overhead, community capacity, database scaling)

So:

  • three ways to specify a citation: inline, item, or statement.
  • citations managed in wikitext, available as JSON

Several details remain undecided:

  • How to track citation identity across page revisions (inject a UUID into the wikitext?)
  • How to specify anchors that are robust against editing?

Resources:


https://etherpad.wikimedia.org/p/WikiCiteCWGReportDraft