Proliferation of URIs, Managing Coreference

Title: Proliferation of URIs, Managing Coreference

Description: General Issue: There are some negative consequences to the current proliferation of new URIs being minted for the same things. The issue is how to avoid or manage this.

Diagram (this article has no graphical representation)

About

Proposed Solutions (OWL files)
Users	MichaelUschold
Domains	General
Competency Questions
Scenarios
Related patterns

Additional information

Concise Summary

Issue: How to avoid or manage two negative consequences to the current proliferation of new URIs being minted for the same things. Specifically:

it is hard to find when two things should be the same
even if you can find the links, prolific use of owl:sameAs will create computational problems.

Source: This discussion is taken from a thread called Managing Co-reference (Was: A Semantic Elephant?) on the W3C Semantic Web Discussion List. A vigorous discussion initially took place off the list, and then was moved to the list for the record.

Related Discussions: URIs and Unique IDs

Related Modeling Issues:

Example(s)

Wordnet issued new URIs for all of its terms for the new version, even when they referred to exactly the same thing.
Yago and DBpedia minted different URIs for the exact same Yago classes

Conclusions:

to some extent, URI proliferation is inevitable in an open web. It is probably the only way to allow people to independently publish web data.
We will have to rely on a combination of manual effort and sophisticated automated methods to detect clashes, and attempt to resolve them. There is a difference of opinion of how good a job automated techniques can do.
large scale duplication of URIs as in the case of YAGO and DBpedia, is unfortunate, but has now largely been rectified. That it happened at all is due to this all being very new.
it was argued that it does not matter much because noone is going to load all the triples into a one triple store. A counter argument to this is that even if you load a lot of triples into one store, the problem of inefficiency of sameAs reasoning can arise.
that these issues are arising at all can be seen as a good thing, insofar as it reflects that linked data is being made available and put to real use. Tim Berners-Lee notes that: "Life is, on balance, good."

Background

In May 2008, Michael Uschold kicked off a discussion with the subject "A Semantic Elephant" describing the unnecessary and costly proliferation of URIs and owl:sameAs links. This discussion evolved to be mostly about managing co-reference. Intitialy private, this discussion was moved to the W3C Semantic Web Discussion List at Tim Berners-Lee's request. See: Managing Co-reference (Was: A Semantic Elephant?).

The original discussion evolved into a discussion about owl:sameAs per se, which has been a recurring topic on various lists over the years. Aldo Gangemi provided a concise summary on May 16, 2008.

In October 2008, Uschold started a closely related discussion focused more on challenges with Versioning and URIs. It was under the subject: URIs and Unique IDs.

From all this, Uschold teased out three distinct but closely related modeling isssues that are in this ODP Wiki:

Overloading OWL sameAs: sameAs is being used in the linked data community in a way that is inconsistent with its semantics
Proliferation of URIs, Managing Coreference: How to avoid or manage two negative consequences to the current proliferation of new URIs being minted for the same things. Specifically:
1. it is hard to find when two things should be the same
2. even if you can find the links, prolific use of owl:sameAs will create computational problems.
Versioning and URIs: When and whether to make new URIs for different versions of things.

Original Post

Here is the original Post by Michael Uschold in May 2008:

This message is about URI proliferation on the Semantic Web, illustrated by examples from DBpedia, YAGO and Wordnet. Consider this:

The Yago team ontologized a portion of the Wikipedia category hierarchy and produced a set of YagoClasses, each corresponding to exactly one WordNet synset. Here is a sample URI:
http://www.mpii.de/yago/resource/wordnet_calculator_102938886
The DBpedians saw that it was good, so they added the Yago classes to their datasets. BUT: Different URIs were used.
http://dbpedia.org/class/yago/Calculator102938886
I found several URI conventions for Wordnet on the web, here are 3 (of how many?):

The synset IDs in the first two cases (hardwired into the URIs) are from version 3.0, but one has to poke around to discover that. There might be version problems too, but that's another elephant.There are various issues here:

What is the true relationship between all these URIs? e.g are they identical in meaning? In general, there is no good way to determine what this relationship is.
Why are there so many URIs for ostensibly the same exact thing?
Even if you do find out the exact semantic relationship, what is a practical way to resolve it?
1. Using owl:sameAs or owl:equivalentClass may be adequate logically, but IMHO, is wholly inadequate, practically. Adding large number of redundant triples and requiring inferencing will burden the already limited capabilities of knowledge stores at large scale.
2. pre-processsing all the files with a script changing the URIs may be possible, but has its own issues:
  1. what URI should be the one chosen, yet a new one? That exacerbates the problem.
  2. a simple namespace prefix would be easy to deal with, but in the example above the names are also different. e.g. wordnet_calculator_102938886 vs. Calculator102938886
    Writing scripts that rely on naming conventions is dangerous. They may not be consistently applied and there may be 100s or 1000s or millions to check.
  3. The scripts have to be re-checked and revised and re-run each time a new version comes online

Specific Questions/Recommendations to the DBpedia and Yago teams:

would it be possible to agree on a single URI for the Yago classes, and then use some other mechanism to go to the right URL?
what is the proper logical relationship between the Yago classes and the wordnet synsets?
Should they be taken as logically equivalent or merely as 'corresponding' to each other in some manner?
Might that correspondence be expressed with a meta-level property, (say correspondingSynset) with domain: YagoClass and range: WordNetSynset?

In summary:

In Uschold's original post to selected individuals, it was noted that a proliferation of different URIs for the same resource was occurring, and that it was causing two specific problems:

it is hard to find when two things should be the same
even if you can find the links, prolific use of owl:sameAs will create computational problems.

Summary of Responses

Chris Bizer:

Problem 1 is not really so bad, for there is much matching technology is out there that can be used, albeit there will be some limits on precision. Problem 2 is not a problem either because noone is going to load everything into a single store.

Frank van Harmelen

Problem 1 is very real, but is only recently becoming a problem with the recent surge of semantic web data coming on line. Frank disagrees with Chris Bizer's optimism. Also, matching at the schema/class level is handled differently than matching instance. Frank refers to some good work going on in addressing these issues, not by matching after the fact, but by elminiting the proliferation at source.

Chris Bizer:

My optimism was more about instance level identity links than at the class level. Within the LOD effort we repeatedly run into situations where it is really easy to generate owl:sameAs links based on some simple domain-dependent rules.

Kinsgley Idehan:

The URL problems are being addressed, e.g. in the UMBEL project. Wikipedia, OpenCye, WordNet and Yago Ideitifiers are being rationalized. See: http://www.umbel.org/announcement.xhtml

Fred Giasson:

There are edge cases when it is not immediately clear, even for a human, to decide what deserves a unique URI.

Jim Hendler:

"So what you are really saying is scaling is a technology/research challenge now that there's much more out there. We need to go beyond just triple stores and get some fast inferencing at Web scales. Makes sense to me."

Michael Uschold:

The computational issue of owl:sameAs proliferation is a major problem, even if noone is going to load all the semantic web data into a single store. For today's triple stores that do limited inference, owl:sameAs "has a significant run time" according to the developers of OpenLink's Virtuoso triplestore. It can easily double query times.

Chris Bizer's remark that there is no need to worry because noone is going to load all the data misses two important facts. First, companies that build and delivering software products using public data will have to bring the data they are using in house to control it. Second, you don't have to load all the data before computational issues arise. Do you really think that, for example, Powerset relies on the data sitting on the DBpedia servers. Proliferation of URIs on a large scale will cause performance issues and should be avoided where possible.

Soren Auer:

Even with such proliferation, people will be able to build useful applications. Once, certain information sources are established (and for that page rank inspired data rank algorithms could be developed) - people will automatically tend to reuse established identifiers and this will counteract the proliferation.

Tim Berners-Lee

So multiple URIs for the same thing is life, a constant tradeoff, but life is, on balance good.

References

Add a reference

List of Modeling Issues | Post a new modeling issue | Add a comment in the discussion page

Community:Proliferation of URIs, Managing Coreference Revision as of 22:02, 13 April 2010 by MichaelUschold (Talk | contribs) (diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)