Tag Archives: RDF

Fruitful RDF vocabularies are like cherries not bananas⤴

from @ Sharing and learning

The SEMIC style guide for semantic engineers is a very interesting document this is currently in review. It is part of a set of actions with the aim:

to promote semantic interoperability amongst the EU Member States, with the objective of fostering the use of standards by, for example, offering guidelines and expert advice on semantic interoperability for public administrations.

I am interested in the re-usability of RDF vocabularies: what is it that makes it easy or hard to take terms from an existing vocabulary and use them when creating a schema for describing something. That seems to me to be important for interoperability in a world where we know that there is going to be more than one standard for any domain and where we also know that no domain is entirely isolated from its neighbours which will have their own native standards. We need data at an atomic level that can persist through expressions conformant with different standards, and that is easiest if the different standards share terms where possible.

This idea of reusing terms from vocabularies is core to the idea of application profiles, and to the way that we conceive Dublin Core and LRMI terms being used, and indeed the way they are being used in the IEEE P2881 Learning Metadata work. One well known example of an application profile that reuses terms from Dublin Core is the Data Cataloguing Vocabulary DCAT which uses terms from Dublin Core, ODRL, FOAF, prov, and few terms created specifically for DCAT to describe entities and relationships according to its own conceptual domain model.

The SEMIC guidelines have lots of interesting things to say about reuse, including rules on what you can and cannot do when taking a terms from existing vocabularies in various ways including reuse “as-is”, reuse with “terminological adaptations” and reuse with “semantic adaptations”. I read these with special interest as I have written similar guidelines for Credential Engine’s Borrowed Terms policy. I am pleased to say we came to the same conclusion (phew). In discussion with the SEMIC authors that conclusion was described as “don’t cherry-pick”, that is: when you borrow or reuse a term from a vocabulary you must comply with everything that the–let’s use the O word here–ontology in which it was defined says about it. That’s not just the textual definition of the term but all that is entailed by statements about domain, range, relationships with other properties and so on. If the original ontology defines, directly or indirectly, “familyName” as being the name of a real person, then don’t use it for fictional characters.

Oof.

“The problem with object-oriented languages is they’ve got all this implicit environment that they carry around with them. You wanted a banana but what you got was a gorilla holding the banana and the entire jungle.”

Joe Armstrong, Coders at Work

I want to cherry-pick. I am looking for cherries, not bananas! I am looking for “lightweight” ontologies, so if you want your terms to be widely reused please create them as such (and this is also in the SEMIC guidelines). Define terms in such a way that they are free from too much baggage: you can always add that later, if you need it, but others can be left to add their own. One example of this is the way in which many Dublin Core Terms are defined without a rdfs:domain declaration. That is what allowed DCAT and others to use them with any class that fits their own domain model. Where ranges are declared for Dublin Core Terms they are deliberately broadly defined. The approach taken by schema.org in some ways goes further by not using rdfs:domain and rdfs:range but instead defining domainIncludes and rangeIncludes where the value is “a class that constitutes (one of) the expected type(s)” (my emphasis): by being non-exclusive domainIncludes and rangeIncludes provide a hint at what is expected without saying that you cannot use anything else.

If you’re interested in this line of thinking, then read the special section of Volume 41, Issue 4 of the Bulletin of the Association for Information Science and Technology on Linked Data and the Charm of Weak Semantics edited by Tom Baker and Stuart Sutton (2015).

 

The post Fruitful RDF vocabularies are like cherries not bananas appeared first on Sharing and learning.

SHACL, when two wrongs make a right⤴

from @ Sharing and learning

I have been working with SHACL for a few months in connexion with validating RDF instance data against the requirements of application profiles. There’s a great validation tool created as part of the JoinUp Interoperability Test Bed that lets you upload your SHACL rules and a data instance and tests the latter against the former. But be aware: some errors can lead to the instance data successfully passing the tests; this isn’t an error with the tool, just a case of blind logic: the program doing what you tell it to regardless of whether that’s what you want it to do.

The rules

Here’s a really minimal set of SHACL rules for describing a book:

@base <http://example.org/shapes#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns> .
@prefix sdo: <https://schema.org/> .
@prefix sh: <http://www.w3.org/ns/shacl#> .

<Book_Shape> a sh:NodeShape ;
    sh:property
        <bookTitle_Shape> ;
    sh:targetClass sdo:Book .

<bookTitle_Shape> a sh:PropertyShape ;
    sh:path dct:title ;
    sh:datatype rdf:langString ;
    sh:nodeKind sh:Literal ;
    sh:minCount 1 ;
    sh:severity sh:Violation .

Essentially it says that the description of anything typed as a schema.org:Book should have a title provided as a langString using the Dublin Core title property. Here’s some instance data

@prefix : <http://example.org/books#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sdo: <https://schema.org/> .

:42 a sdo:Book;
  dct:title "Life the Universe and Everything."@en .

Throw those into the SHACL validator and you get:

Result: SUCCESS
Errors: 0
Warnings: 0
Messages: 0

Which (I think) is what you should get.

So wrong it’s right

But what about this instance data:

@prefix : <http://example.org/books#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sdo: <http://schema.org/> .

:42 a sdo:Book;
  dct:title "Life the Universe and Everything." .

For this the validator also returns

Result: SUCCESS
Errors: 0
Warnings: 0
Messages: 0

Did you spot the difference? Well, for one thing the title has no language attibute, it’s not a langString.

How about:

@prefix : <http://example.org/books#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sdo: <http://schema.org/> .

:42 a sdo:Book .

which has no title at all, but still you’ll get the “success, no errors” result. That cannot be right, can it?

Well, yes, it is right. You see there is another error in those last two examples. In the SHACL and the first example, the URI for Schema.org is given as

@prefix sdo: <https://schema.org/> .

In the second and third examples it is

@prefix sdo: <http://schema.org/> .

That’s the sort of subtle and common mistake I would like to pick up, but in fact it stops the validator from picking up any mistakes. That’s because the SHACL rules apply to anything typed as a https://schema.org/Book and in the second and third examples (where the prefix is http, not https) there isn’t anything typed as such. No rules apply, no rules are broken: zero errors — success!

What to do?

I’m not really sure. I see this type of problem quite a lot (especially if you generalize “this type of problem” to mean the test doesn’t have quite the logic I thought it did). I suppose lesson one was always to test shapes with invalid data to make sure they work as expected. That’s a lot of tests.

Arising from that is to write SHACL rules to check for everything: the errors above would be picked up if I had checked to make sure there is always one entity with the expected type for Books: there’s recipe for this on the SHACL wiki.

Generalizing on the idea that simple typos can mean data not being tested because the term identifier doesn’t match any term in the schema your using, it’s worth checking that all term identifiers in the instance data and the SHACL are actually in the  schema. This will pickup when sdo:Organisation is used instead of sdo:Organization. SHACL won’t do this, but it’s easy enough to write a python script that does.

The post SHACL, when two wrongs make a right appeared first on Sharing and learning.

When RDF breaks records⤴

from @ Sharing and learning

In talking to people about modelling metadata I’ve picked up on a distinction mentioned by Staurt Sutton between entity-based modelling, typified by RDF and graphs, and record-based structures typified by XML; however, I don’t think making this distinction alone is sufficient to explain the difference, let alone why it matters.  I don’t want to get into the pros and cons of either approach here, just give a couple of examples of where something that works in a monolithic, hierarchical record falls apart when the properties and relationships for each entity are described separately and those descriptions put into a graph. These are especially relevant when people familiar with XML or JSON start using JSON-LD. One of the great things about JSON-LD is that you can use instance data as if it were JSON, without really paying much regard to the “LD” part; that’s not true when designing specs because design choices that would be fine in a JSON record will not work in a linked data graph.

1. Qualified Instances

It’s very common in a record-oriented approach when making statements about something that many people may have done, such as attending a specific school, earning a qualification/credential, learning a skill etc, to have a JSON record that looks something like:

{ "studentID": "Person1",
  "studentName": "Betty Rizzo",
  "schoolsAttended": [
    { "schoolID": "School1",
      "schoolName": "Rydell High School",
      "schoolAddress" : {...}
      "startDate": "1954",
      "endDate": "1959"
    }
  ]
}

It’s tempting to put a @context on the top of this to map the property keys to an RDF vocabulary and call it linked data. That’s sub-optimal. To see why consider two students Betty, as above, and Sandy who joined the school for her final academic year, 1958-59. Representing her data and Betty’s as RDF graphs we would get something like:

Two RDF graphs for two people attending the same school at different dates

The upper  graph is a representation of what you might get for the record about Rizzo shown above, if you choose a suitable @context. The lower is similar data about Sandy. When this data is loaded into an RDF triple store, the statements will be stored separately, and duplicates removed. We can show that data as a single merged graph:

RDF graph showing two people attending the same school with start dates and end dates as properties of the school.

Whereas in a record the hierarchy preserves the scope for statements like startDate and endDate so that we know who they refer to, in the RDF graph statements from the JSON object describing the school attended are taken as being about the school itself. The problem arises because the information about the school is treated as data that can be linked to by anything that relates to the school, not just the entity in whose record it was found, which makes sense in terms of data management.

There are options for fixing this: one is not to merge the graphs about the Betty and Sandy, but that means repeating all the data about the school in every record that mentions it; another possible solution is to use the property-graph or RDF-star approach of annotating the schoolAttended property directly with startDate and endDate; but often the answer lies in better RDF modelling. In this case we could create an entity to model the attendance of a person at a school:

Separate RDF graphs showing attendance of two individuals at a school

and when these are merged:Single RDF graph showing attendance of two individuals at a school

which keeps the advantage of not duplicating information about the school while maintaining the information about who attended which school when. In JSON-LD this conbined graph would look something like

{ "@context": {...},
  "@graph": [
    { "@id": "Person1",
      "name": "Betty Rizzo",
      "schoolAttended": { 
        "startDate": "1953",
        "endDate": "1959",
        "at": {"@id": "School1"}
      }
    },{
      "@id": "Person2",
      "name": "Sandy Olsson",
      "schoolAttended": {
        "startDate": "1953",
        "endDate": "1959",
        "at": {"@id": "School1"}
      }
   },{
     "@id": "School1",
     "name": "Rydell High",
     "address": {
       "@type": "PostalAddress",
       "...": "..."
     }
   }]
}

 

Finally, those who just want a JSON record for an individual student that could easily be converted to LD could use something like:

 
{ "studentID": "Person1", 
  "schoolsAttended": [ 
    { "startDate": "1954", 
      "endDate": "1959",
      "at": {
          "schoolID": "School1",
          "schoolName": "Rydell High School",
          "schoolAddress" : {...}
      } 
  ] 
} 

You might think that the “attendance” object sitting between a person and the school is a bit artificial and unintuitive, which it is, but it’s no worse than the tables that RDBM systems need for many-to-many relationships.

2. Lists

Another pattern that comes up a lot is when logically separate resource may be ordered in different ways for different reasons. This may be people in a queue, journal articles in a volume, or learning resources in a larger learning opportunity; anywhere that you might want to say “this” comes before “that”. Say we have an educational program that has a number of courses in it that should be taken in sequential order. JSON lists are ordered, so as a record this seems to work:

{
  "name": "My Program",
  "hasCourse": [
    {"name": "This"},
    {"name": "That"},
    {"name": "The other"}
  ]
}

So we sprinkle on some syntactic sugar for JSON-LD:

{
 "@context": {"@vocab": "http://schema.org/", 
               "@base": "http://example.org/resources/"},
 "@type": "EducationalOccupationalProgram",
 "name": "My Program",
 "hasCourse": [
   {"@type": "Course",
    "@id": "Course1",
    "name": "This"},
   {"@type": "Course",
    "@id": "Course2",
    "name": "That"},
   {"@type": "Course",
    "@id": "Course3",
    "name": "The other"}
 ]
}

But there is no RDF statement in there about ordering, and  the ordering of JSON’s arrays is not preserved in other RDF syntaxes (unless there is something in the @context to say the value of hasCourse is an @list, it wouldn’t be appropriate to say that every value of hasPart is an ordered list because not every list of parts will be an ordered list). So if we convert the JSON-LD into triples and store them, there is no saying how to order the results returned by a query.

The simple solution would be to have a property to state the position of the course in an ordered list (schema.org/position is exactly this)—but don’t be too hasty: if these courses are taken in more than one program, is Course 2 always going to be second in the sequence? Probably not. In general when resources are reused in different contexts they will probably be used in different orders, “this” may not always come before “that”. That’s why the ordering is best specified at one remove from resources themselves. For example, one of the suggestions for ordering content in K12-OCX is to create a table of contents as an ordered list of items that point to the content, something like:

{
  "@context": {
    "@vocab": "http://schema.org/",
    "ocx": "http://example.org/ocx/",
    "@base": "http://example.org/resources/",
    "item": {"@type": "@id"}
  },
  "@type": "EducationalOccupationalProgram",
  "name": "My Program",
  "ocx:hasToC": {
    "@type": "ItemList",
    "name": "Table of Contents",
    "itemListOrder": "ItemListOrderAscending",
    "numberOfItems": "3",
    "itemListElement": [
      { "@type": "ListItem",
        "item": "Course1",
        "position": 1},
      { "@type": "ListItem",
        "item": "Course2",
        "position": 2 },
      { "@type": "ListItem",
        "item": "Course3",
        "position": 3 }
    ]
  },
 "hasCourse": [
   {"@type": "Course",
    "@id": "Course1",
    "name": "This"},
   {"@type": "Course",
    "@id": "Course2",
    "name": "That"},
   {"@type": "Course",
    "@id": "Course3",
    "name": "The other"}
 ]
}

or if you prefer to use built-in RDF constructs there is that @list option:

{ "@context": {
    "@vocab": "http://schema.org/", 
    "ocx": "http://example.org/ocx/", 
    "@base": "http://example.org/resources/", 
    "ocx:hasToC": {"@container": "@list"}
  },
  "@type": "EducationalOccupationalProgram",
  "@id": "Program",
  "name": "My Program",
  "ocx:hasToC": ["Course1", "Course2", "Course3"],
  "hasCourse": [
  { "@id": "Course1",
    "@type": "Course",
    "name": "this"
  },{
    "@id": "Course2",
    "@type": "Course",
    "name": "that"
  },{
    "@id": "Course3",
    "@type": "Course",
    "name": "the other"
  }]
}

When this is processed by something like JSON-LD playground you will see that the list of values for hasToC is replaced by a set of statements about blank-nodes which mean this comes before the others:

<ocx:hasToC> _:b0 .
_:b0 <rdf:first> "http://example.org/resources/Course1" .
_:b0 <rdf:rest> _:b1 .
_:b1 <rdf:first> "http://example.org/resources/Course2" .
_:b1 <rdf:rest> _:b2 .
_:b2 <rdf:first> "http://example.org/resources/Course3" .
_:b2 <rdf:rest> <rdf:nil> .

Conclusion

If you’ve made it this far you deserve the short summary advice. The title for this post was meant literally. Representing a record in RDF will break the record down into separate statements, each about one thing, each saying one thing, with the assumption that those statements are each valid on their own. In modelling for JSON-LD you need to make sure that everything you say about an object is true even when that object is separated from the rest of the record.

The post When RDF breaks records appeared first on Sharing and learning.