Tag Archives: Application Profiles

Fruitful RDF vocabularies are like cherries not bananas⤴

from @ Sharing and learning

The SEMIC style guide for semantic engineers is a very interesting document this is currently in review. It is part of a set of actions with the aim:

to promote semantic interoperability amongst the EU Member States, with the objective of fostering the use of standards by, for example, offering guidelines and expert advice on semantic interoperability for public administrations.

I am interested in the re-usability of RDF vocabularies: what is it that makes it easy or hard to take terms from an existing vocabulary and use them when creating a schema for describing something. That seems to me to be important for interoperability in a world where we know that there is going to be more than one standard for any domain and where we also know that no domain is entirely isolated from its neighbours which will have their own native standards. We need data at an atomic level that can persist through expressions conformant with different standards, and that is easiest if the different standards share terms where possible.

This idea of reusing terms from vocabularies is core to the idea of application profiles, and to the way that we conceive Dublin Core and LRMI terms being used, and indeed the way they are being used in the IEEE P2881 Learning Metadata work. One well known example of an application profile that reuses terms from Dublin Core is the Data Cataloguing Vocabulary DCAT which uses terms from Dublin Core, ODRL, FOAF, prov, and few terms created specifically for DCAT to describe entities and relationships according to its own conceptual domain model.

The SEMIC guidelines have lots of interesting things to say about reuse, including rules on what you can and cannot do when taking a terms from existing vocabularies in various ways including reuse “as-is”, reuse with “terminological adaptations” and reuse with “semantic adaptations”. I read these with special interest as I have written similar guidelines for Credential Engine’s Borrowed Terms policy. I am pleased to say we came to the same conclusion (phew). In discussion with the SEMIC authors that conclusion was described as “don’t cherry-pick”, that is: when you borrow or reuse a term from a vocabulary you must comply with everything that the–let’s use the O word here–ontology in which it was defined says about it. That’s not just the textual definition of the term but all that is entailed by statements about domain, range, relationships with other properties and so on. If the original ontology defines, directly or indirectly, “familyName” as being the name of a real person, then don’t use it for fictional characters.

Oof.

“The problem with object-oriented languages is they’ve got all this implicit environment that they carry around with them. You wanted a banana but what you got was a gorilla holding the banana and the entire jungle.”

Joe Armstrong, Coders at Work

I want to cherry-pick. I am looking for cherries, not bananas! I am looking for “lightweight” ontologies, so if you want your terms to be widely reused please create them as such (and this is also in the SEMIC guidelines). Define terms in such a way that they are free from too much baggage: you can always add that later, if you need it, but others can be left to add their own. One example of this is the way in which many Dublin Core Terms are defined without a rdfs:domain declaration. That is what allowed DCAT and others to use them with any class that fits their own domain model. Where ranges are declared for Dublin Core Terms they are deliberately broadly defined. The approach taken by schema.org in some ways goes further by not using rdfs:domain and rdfs:range but instead defining domainIncludes and rangeIncludes where the value is “a class that constitutes (one of) the expected type(s)” (my emphasis): by being non-exclusive domainIncludes and rangeIncludes provide a hint at what is expected without saying that you cannot use anything else.

If you’re interested in this line of thinking, then read the special section of Volume 41, Issue 4 of the Bulletin of the Association for Information Science and Technology on Linked Data and the Charm of Weak Semantics edited by Tom Baker and Stuart Sutton (2015).

 

The post Fruitful RDF vocabularies are like cherries not bananas appeared first on Sharing and learning.

What am I doing here? 3. Tabular Application Profiles⤴

from @ Sharing and learning

The third thing (of four) that I want to include in my review of what projects I am working on is the Dublin Core standard for Tabular Application Profiles. It’s another of my favourites, a volunteer effort under the DCMI Application Profiles Working Group.

Application profiles are a powerful concept, central to how RDF can be used to create new data models without creating entirely new standards. The DCTAP draft says this about them:

An application profile defines metadata usage for a specific application. Profiles are often created as texts that are intended for a human audience. These texts generally employ tables to list the elements of the profile and related rules for metadata creation and validation. Such a document is particularly useful in helping a community reach agreement on its needs and desired solutions. To be usable for a specific function these decisions then need to be translated to computer code, which may not be a straightforward task.

About the work

We have defined an (extensible) set of twelve elements that can be used as headers in a table defining an application profile, a primer showing how they can be used, and along the way we wrote a framework for talking about metadata and application profiles. We have also worked on various implementations and are set to create a cookbook showing how DC TAP can be used in real world applications. The primer is the best starting point for understanding the output as a whole.

The framework for talking about metadata came about because we were struggling to be clear when we used terms like property or entity. Does “property” refer to something in the application profile or in the base standard or in as used in some metadata instance or does it refer to a property of some thing in the world? In short we decided that the things being described have characteristics and relationship to each other which are asserted in RDF metadata using statements that have a predicates in them, those predicates reference properties that are part of a pre-defined vocabulary, and an application profile defines templates for how the property is used in statements to create descriptions. There is a similar string of suggestions for talking about entities, classes and shapes as well as some comments on what we found too confusing and so avoid talking about. With a little care you can use terms that are both familiar in context and not ambiguous.

About my role

This really is a team effort, expertly lead by Karen Coyle, and I just try to help. I will put my hand up as the literal minded pedant who needed a framework to make sure we all understood each other. Otherwise I have been treating this a side project that gives me an excuse to do some python programming: I have documented my TAP2SHACL and related scripts on this blog, which focus on taking a DCMI TAP and expressing it as SHACL that can be used to validate data instances. I have been using this on some other projects that I am involved in, notably the work with PESC looking at how they might move to JSON-LD.

The post What am I doing here? 3. Tabular Application Profiles appeared first on Sharing and learning.

Graphical Application Profiles?⤴

from @ Sharing and learning

In this post I outline how a graphical representation of an application profile can be converted to SHACL that can be used for data validation.

My last few posts have been about work I have been doing with the Dublin Core Application Profiles Interest Group on Tabular Application Profiles (TAPs). In introducing TAPs I described them as “a human-friendly approach that also lends itself to machine processing”. The human readability comes from the tabular format, and the use of a defined CSV structure makes this machine processable. I’ve illustrated the machine processability through a python program, tap2shacl.py, that will convert a TAP into SHACL that can be used to validate instance data against the application profile, and I’ve shown that this works with a simple application profile and a real-world application profile based on DCAT. Once you get to these larger application profiles the tabular view is useful but a graphical representation is also great for providing an overview. For example here’s the graphic of the DCAT AP:

Source: join-up DCAT AP

Mind the GAP

I’ve long wondered whether it would be possible to convert the source for a graphical representation of an application profile (let’s call it a GAP) into one of the machine readable RDF formats. That boils down to processing the native format of the diagram file or any export from the graphics package used to create it. So I’ve routinely been looking for any chance of that whenever I come across a new diagramming tool. The breakthrough came when I noticed that lucid chart allows CSV export. After some exploration this is what I came up with.

As diagramming software, what Lucid chart does is quite familiar from Visio, yEd, diagrams.net and the like: it allows you to produce diagrams like the one below, of the (very) simple book application profile that we use in the DC Application Profiles Interest Group for testing:

two boxes, one representing data about a book, the other data about a person, joined by an arrow representing the author relationship. Lots of further detail about the book an author data is provided in the boxes, as discussed in the text of the blog post.

One distinctive feature of Lucid chart is that as well as just entering text directly into fields in the diagram, you can enter it into a data form associated with any object in the diagram, as shown below, first for the page and then for the shape representing the Author:

A screen shot of the Lucid Chart software showing the page and page data

A screen shot of the Lucid Chart software showing the Auhtor Shape and the data for it.

In the latter shot especially you can see the placeholder brackets [] in the AuthorShape object into which the values from the custom data form are put for display. Custom data can be associated with the document as a whole, any page in it and any shape (boxes, arrows etc) on the page;  you can create templates for shapes so that all shapes from a given template have the same custom data fields.

I chose a template for to represent Node Shapes (in the SHACL/ShEx sense, which become actual shapes in the diagram) that had the the following data:

  • name and expected RDF type in the top section;
  • information about the node shape, such as label, target, closure, severity in the middle section; and,
  • a list of the properties that have the range Literal is entered directly in the lower section (i.e. these don’t come from the custom data form).

Properties that have a range of BNode or URI are represented as arrows.

By using a structured string for Literal valued properties, and by adding information about the application profile and namespace prefixes and their URIs into the sheet custom data, I was able to enter most of the data needed for a simple application profile. The main shortcomings are that format for Literal valued properties is limited, and that complex  constraints such as alternatives (such as: use this Literal valued property or that URI property depending on …) cannot be dealt with.

The key to the magic is that on export as CSV, each page, shape and arrow gets a row, and there is a column for the default text areas and for the custom data (whether or not the latter is displayed). It’s an ugly, sparsely populated table, you can see a copy in github, but I can read it into a python Dict structure using python’s standard CSV module.

GAP2SHACL

When I created the TAP2SHACL program I aimed to do so in a very modular way: there is one module for the central application profile python classes, another to read csv files and convert them into those python classes, another to convert the python classes into SHACL and output them; so tap2shacl.py is just a wrapper that provide a user interface to those classes. That approach paid off here because having read the CSV file exported from lucid chart all I had to do was create a module to convert it into the python AP classes and then I could use AP2SHACL to get the output. That conversion was fairly straightforward, mostly just tedious if ...  else statements to parse the values from the data export. I did this in a Jupyter Notebook so that I could interact more easily with the data, that notebook is in github.

Here’s the SHACL generated from the graphic for the simple book ap, above:

# SHACL generated by python AP to shacl converter
@base <http://example.org/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sdo: <https://schema.org/> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<BookShape> a sh:NodeShape ;
    sh:class sdo:Book ;
    sh:closed true ;
    sh:description "Shape for describing books"@en ;
    sh:name "Book"@en ;
    sh:property <bookshapeAuthor>,
        <bookshapeISBN>,
        <bookshapeTitle> ;
    sh:targetClass sdo:Book .

<AuthorShape> a sh:NodeShape ;
    sh:class foaf:Person ;
    sh:closed false ;
    sh:description "Shape for describing authors"@en ;
    sh:name "Author"@en ;
    sh:property <authorshapeFamilyname>,
        <authorshapeGivenname> ;
    sh:targetObjectsOf dct:creator .

<authorshapeFamilyname> a sh:PropertyShape ;
    sh:datatype xsd:string ;
    sh:maxCount 1 ;
    sh:minCount 1 ;
    sh:name "Family name"@en ;
    sh:nodeKind sh:Literal ;
    sh:path foaf:familyName .

<authorshapeGivenname> a sh:PropertyShape ;
    sh:datatype xsd:string ;
    sh:maxCount 1 ;
    sh:minCount 1 ;
    sh:name "Given name"@en ;
    sh:nodeKind sh:Literal ;
    sh:path foaf:givenName .

<bookshapeAuthor> a sh:PropertyShape ;
    sh:minCount 1 ;
    sh:name "author"@en ;
    sh:node <AuthorShape> ;
    sh:nodeKind sh:IRI ;
    sh:path dct:creator .

<bookshapeISBN> a sh:PropertyShape ;
    sh:datatype xsd:string ;
    sh:name "ISBN"@en ;
    sh:nodeKind sh:Literal ;
    sh:path sdo:isbn .

<bookshapeTitle> a sh:PropertyShape ;
    sh:datatype rdf:langString ;
    sh:maxCount 1 ;
    sh:minCount 1 ;
    sh:name "Title"@en ;
    sh:nodeKind sh:Literal ;
    sh:path dct:title .

I haven’t tested this as thoroughly as the work on TAPs. The SHACL is valid, and as far as I can see it works as expected on the test instances I have for the simple book ap (though slight variations in the rules represented somehow crept in). I’m sure there will be ways of triggering exceptions in the code, or getting it to generate invalid SHACL, but for now, as a proof of concept, I think it’s pretty cool.

What next?

Well, I’m still using TAPs for some complex application profile / standards work. As it stands I don’t think I could express all the conditions that often arise in an application profile in an easily managed graphical form. Perhaps there is a way forward by generating a tap from a diagram and then adding further rules, but then I would worry about version management if one was altered and not the other. I’m also concerned about tying this work to one commercial diagramming tool, over which I have no real control. I’m pretty sure that there is something in the GAP+TAP approach, but it would need tighter integration between the graphical and tabular representations.

I also want to explore generating other outputs that SHACL from TAPs (and graphical representations). I see a need to generate JSON-LD context files for application profiles, we should try getting ShEx from TAPs, and I have already done a little experimenting with generating RDF-Schema from Lucid Chart diagrams.

The post Graphical Application Profiles? appeared first on Sharing and learning.

DCAT AP DC TAP: a grown up example of TAP to SHACL⤴

from @ Sharing and learning

I’ve described a couple of short “toy” examples as proof of concept of turning a Dublin Core Application Profile (DC TAP) into SHACL in order to validate instance data: the SHACL Person Example and a Simple Book Example; now it is time to see how the approach fares against a real world example. I chose the EU joinup Data Catalog Application Profile (DCAT AP) because Karen Coyle had an interest in DCAT, it is well documented (pdf) with a github repo that has SHACL files, there is a Interoperability Test Bed validator for it (albeit a version late) and I found a few test instances with known errors (again a little dated). I also found the acronym soup of DCAT AP DC TAP irresistable.

You can see the extended TAP as Google sheets.  The tap tab is the DCTAP, the other tabs are information about the Applicaiton profile, namespace prefixes and shapes that are needed by tap2shacl. You can also see the csv export files, test instances and generated SHACL for DCATAPDCTAP in my TAPExamples github repo. The actual TAP is maybe a little ragged around the edges. Partly I got bored, partly I wasn’t sure how far to go: for example, should every reference to a Catalog be a description that conforms to all the DCAT AP rules for a catalog, or is it sufficient to just have an IRI in the instance data? At the other extreme what to do about properties where the only requirement was that an entity of a certain type be referenced — should the SHACL demand the type be explicitly declared or is the intent that an IRI in the instance data is enough and the type may be inferred?

Reflections

I had to add a little functionality where I wanted a shape to be used against two targets, objects of a property and instances of a class. This actually prompted a fairly significant rewrite of how shape information is handled, and I have thought of further extensions.

It works in that valid SHACL is produced. The SHACL is more verbose than the hand crafted SHACL produced by the DCAT AP project, but I think that is to be expected from a general purpose conversion script. It also works in that when I run the test instances through the itb SHACL validator with my SHACL and through the DCAT AP specific validator they both flag the same errors. My shacl actually raises more error messages, but that is a result of  it being more verbose, sometimes it gives two errors (one that the value of a predicate does not match the expecterd shape, the second about how the shape is not matched). The important thing is that the same fixes work on both.

I’m quite happy with this. There may be some requirements in the DCAT AP that I haven’t checked for, but this will do for now. Next I want to work out how best to require a specific skos concept scheme be used.

The only niggle that I have is that the TAP sheet in Google docs isn’t as easily human readable as the corresponding tables in the DCAT AP documentation. There’s a balance we’re working on between keep the necessary precision while weighing technical correctness against human friendliness, and it is hard to strike.

The post DCAT AP DC TAP: a grown up example of TAP to SHACL appeared first on Sharing and learning.

TAP to SHACL example⤴

from @ Sharing and learning

Last week I posted Application Profile to Validation with TAP to SHACL about converting a DCMI Tabular Application Profile (DC TAP) to SHACL in order to validate instance data. I ended by saying that I needed more examples in order to test that it worked: that is, not only check that the SHACL is valid, but also that validates / raises errors as expected when used with instance data.

SHACL Person Example

Here’s a simple example of a person and their employer, drawn from the example used in the SHACL spec. The TAP for this only has three lines:

Line 1: Person data may have zero or one ex:ssn property for Social Security Number, which must be a string matching the regular expression for nnn-nn-nnnn (n = any digit)

Line 2: Person data may have zero or more ex:worksFor properties with an IRI referencing a node that matches the CompanyShape.

Line 3: Company data must have one and only one occurrence of the rdf:type property with the value ex:Company (or equivalent IRI).

And the extended TAP Shapes csv has two:

Line 1: nodes of class ex:Person must match the PersonShape, which is closed, though any occurrence of rdf:type is ignored, any failure to match is a Violation of the application profile.

Line 2: object nodes of the ex:worksFor property must match the CompanyShape, which is open, any failure to match is a Violation of the application profile.

The SHACL produced is

# SHACL generated by python AP to shacl converter
@base <http://example.org/> .
@prefix ex: <http://example.org/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<PersonShape> a sh:NodeShape ;
    sh:closed true ;
    sh:description "Node shape for person class"@en ;
    sh:ignoredProperties ( rdf:type ) ;
    sh:name "Person Shape"@en ;
    sh:property <personshapeEmployer>,
        <personshapeSocialSecurityNumber> ;
    sh:targetClass <Person> .

<CompanyShape> a sh:NodeShape ;
    sh:class <Company> ;
    sh:closed false ;
    sh:description "Shape for employers"@en ;
    sh:name "Employer"@en ;
    sh:targetObjectsOf <woksFor> .

<personshapeEmployer> a sh:PropertyShape ;
    sh:name "Employer"@en-US ;
    sh:node <CompanyShape> ;
    sh:nodeKind sh:IRI ;
    sh:path <worksFor> .

<personshapeSocialSecurityNumber> a sh:PropertyShape ;
    sh:datatype xsd:string ;
    sh:maxCount 1 ;
    sh:name "Social Security Number"@en-US ;
    sh:nodeKind sh:Literal ;
    sh:path <ssn> ;
    sh:pattern "^\\d{3}-\\d{2}-\\d{4}$" .

Results

This isn’t the same SHACL as in the spec, but when used with instance data (sample data in github, three are invalid taken from the spec, and one is valid data) in the Join-up itb SHACL validator it gives the same result as expected:

screen shot of validation results for RDF data showing two erros.

I had to make one change to the program to get this working, adding an “ignore props” column to the shape information sheet, which is then processed into sh:ignoredProperties.

I guess it is worth noting that the SHACL generated from tap2shacl is always more verbose than hand written examples: some defaults are always explicit, and property shapes are always identified objects in their own block rather than bnodes in the NodeShape. That can make the output less human readable. Also, as I currently re-use property shapes that recur on different node shapes, there can be unnecessary repetition. There is a mechanism in TAP for setting “defaults” which might be useful when, say, all instances of dct:description must follow the same rules whatever class they are used on.

My next example will (hopefully) be based on a well known real-world application profile; a lot longer, and introducing some more complex rules. Watch this space.

The post TAP to SHACL example appeared first on Sharing and learning.

Application Profile to Validation with TAP to SHACL⤴

from @ Sharing and learning

Over the past couple of years or so I have been part of the Dublin Core Application Profile Interest Group creating the DC Tabular Application Profile (DC-TAP) specification. I described DC-TAP in a post about a year ago as a “human-friendly approach that also lends itself to machine processing­”, in this post I’ll explore a little about how it lends itself to machine processing.

A metadata application profile represents an implementor’s or a community’s view on how a metadata description should be formed: which terms should be used from which vocabularies, are they required or optional? may they be repeated? what range of values do expect each of them to have? and so on [see Heery and Patel and the Singapore Framework for more details]. Having created an application profile it is fairly natural to want to ask whether specific instance data conforms with it — does it have the data required in the form required by the implementor or community’s application(s)? I have been using the EU-funded joinup Interoperability Test Bed‘s SHACL Validator for this, and so I would like to generate SHACL for my validation.

The simple book application profile.

There are several current and potential application profiles that I am interested in, including the Credential Engine minimum data requirements for using CTLD to describe the various types of resource in the Credential Registry. I also have a long standing interest in application profiles of schema.org and LRMI for describing learning resources, which is currently finding an outlet in the IEEE P2881 Standard for Learning Metadata working group. But these and any other real application profile tend to be long and can be complex. A really short simple application profile is more useful for illustrative purposes, and we in the DC-TAP group have been using various iterations around a profile describing books (even for such a simple application profile the table doesn’t display well with my blog theme, so please take a look at it on github). Here’s a summary of what it is intended to encode:

  • Line 1: Book instance data must have one and only one dct:title of type rdf:langString.
  • Line 2: Book instance data may have zero or more dct:creator described as an RDF instance with a URI or a BNODE, matching the #Author shape.
  • Line 3: Book instance data may have zero or one sdo:isbn with Literal value being an xsd:string composed of 13 digits only.
  • Line 4: Book instance data must have rdf:type of sdo:Book.
  • Line 5: Author instance data may have zero or more foaf:givenName with Literal value type xsd:string.
  • Line 6: Author instance data may have zero or more foaf:familyName with Literal value type xsd:string
  • Line 7: Author instance data must have rdf:type of foaf:Person

(Let’s leave aside any questions of whether those are sensible choices, OK?)

A SHACL view of the TAP

Looking at the TAP through the lens of SHACL, we can draw a parallel between what in TAP we call statement constraints, i.e. the rows in the TAP, which each identify a property and any rules that constrain its use, and what SHACL calls property shapes. Likewise what  in TAP we call shapes  (a group of statement constraints) aligns with what SHACL calls a Node Shape. The elements of a statement constraint map more or less directly to various SHACL properties that can be used with Node and Property Shapes. So:

Construct in TAP Rough equivalent in SHACL
Statement Constraint Property Shape
Shape Node Shape
propertyID sh:path of a sh:PropertyShape
propertyLabel sh:name on a sh:PropertyShape
mandatory = TRUE sh:minCount = 1
repeatable = FALSE sh:maxCount = 1
valueNodeType sh:nodeKind
valueDataType sh:dataType
valueShape sh:node
valueConstraint depends on valueConstraintType and valueNodeType

Processing valueConstraints can get more complex than other elements. If there is no valueNodeType and a single entry, then it is used as the value for sh:hasValue, either as a Literal or IRI depending on valueNodeType. If the valueConstraintType is “pickList” or if there are more than one entries in valueConstraint then you need to create a list to use with the sh:or property. If the  valueConstraintType is “pattern” then the mapping to sh:pattern is straightforward, and that should only apply to Literal values. Some other SHACL constraint components are not in the DCTAP core set of suggested entries for valueConstraint, but it seems obvious to add “minLength”, “maxLength” and so on to correspond to sh:minLength, sh:maxLength etc.; lengthRange works too if you specify the range in a single entry (I use the format “n..m”). I do not expect DC TAP will cover all of what SHACL covers, so don’t expect to find sh:qualifiedValueShape or other complex constraint components.

As alluded to above, TAP allows for multiple entries in a table cell to provide alternative ways of fulfilling a constraint. These lists of entries need to be processed in different ways depending on which element of statement constraint they relate to. Often they need turning into a list to be used with sh:or or sh:in; lists of alternative valueNodeType need turning into the corresponding value for sh:nodeType, for example IRI BNODE becomes  sh:BlankNodeOrIRI.

Extended TAPs to provide other SHACL terms

Some things needed by SHACL (and other uses of application profiles) are not in DC TAP. We know that TAP only covers a core and we expect different solutions to providing other information to emerge depending on context. For some people providing some of the additional information as, say, a YAML file will work; for other people or other data, further tables may be preferable. So while we know that metadata about the application profile and a listing of the IRI stems for any prefixes used for compact URI encodings in the TAP need to be provided, we don’t specify how. I chose to use additional tables for all this data.

I’ve already touched on how additional constraints useful in SHACL, like minLength that are easily provided if we extend the range of allowed valueConstraintType. Another useful SHACL property is sh:severity, for this I added an extra column to the TAP table.

However the biggest omission from the TAP of data useful in SHACL is data about sh:NodeShapes. From the “Shape” column we know which shapes the properties relate to, but we have no way of providing descriptions for these shapes, or, most crucially, specifying what entities in the instance data should conform to which shapes. I use a table to provide this data as well. Currrently(*),  it has columns for shapeID, label, comment, target, targetType (these last two can be used to set values of sh:targetObjectsOf, sh:targetClass, sh:targetNode, sh:targetSubjectsOf as appropriate), closed (true or false, to set  sh:Closed), mandatory, severity and note. (*This list of columns is somewhat in flux, and the program described below doesn’t process all of them).

A Google Sheets template for extended TAPs

So, for my application profile I have four tables: the DC TAP, plus tables for: metadata about the application profile; prefixes/namespaces used; information about TAP Shapes / sh.NodeShapes. All these tables can be generated as spreadsheet tabs in a single workbook. The details are still a little fluid, but I have created a template for such in Google Sheets, which also includes some helpful additions like data validation to make sure that the values in cells make sense. You are welcome to make a copy of this template and try it out, it includes a sheet with further information and details of how to leave feedback.

Processing TAP to SHACL

I have also written a set of python classes to convert CSV files exported from the extended TAP into SHACL. These are packaged together as tap2shacl, available on Github — it is very much an alpha-release proof of concept, that doesn’t cover all the things discussed above let alone some of the things I have glossed over, but feel free to give it a try.

The architecture of the program is as follows:

  • I have data classes for the application profile as a whole and individual property statements. This includes methods to read in Metadata, shape information and prefix/namespace information from CSV files as exported from Google Sheets (or other editor with the same column headings)
  • I use Tom Baker’s dctap-python program to read the CSV of the TAP. This does lots of useful validity checking and normalization as well as handling a fair few config options, and generally handles the variation in CSVs better than the methods I wrote for other tables. TAP2AP handles the conversion from Tom’s format to my python AP classes.
  • The AP2SHACL module contains a class and methods to convert the python AP classes into a SHACL RDF Graph and serialize these for output (leaning heavily on rdf-lib).
  • Finally the TAP2SHACL package pulls these together and provides a command line interface.

If that seems like more modules and github repos that necessary, you may be right, but I wanted to be hyper-granular because I have use cases where the input isn’t a TAP CSV and the output isn’t SHACL. For example, the Credential Engine minimum data requirements are encoded in JSON, for other sources YAML is another possibility, and I have ideas about converting Application Profile diagrams; and I can see sense in outputting ShEx and JSON-Schema as well as SHACL. I also want to keep the number of imported third-party modules down to a minimum: why should someone wanting to create JSON-Schema have to import the rdf-lib classes needed for SHACL?

Does it work?

Well, it wouldn’t have been fair to let you read this far if the answer wasn’t broadly “yes” 🙂 Checking that it works is another matter.

There are plenty of unit tests in the code for me to be confident that it can read some files and output some RDF, albeit with the caveat that it is alpha-release software so it’s not difficult to create a file that it cannot read, often because the format is not quite right.

There are even some integration tests so that I know that the RDF output from some TAPs matches the valid SHACL that I expect, at least for simple test cases. Again, it is not difficult to generate invalid SHACL or not produce the terms you would expect if there happens to be something in the TAP that I haven’t yet implemented. TAP is quite open, and the software is still developing, so I’ll not attempt to list the potential mismatches here, but I’ll be working documenting them in github issues.

But then there’s the question of whether the SHACL that I expect correctly encodes the rules in the application profile. That takes testing itself, so for each application profile I work on I need test cases of instance data that either matches or doesn’t match the expectations in mind when creating the application profile. I have a suite of 16 test cases for the simple book profile. These can be used in the SHACL Validator with the SHACL file generated from the TAP; and yes, mostly they work.

I have got to admit that I find the implications of needing 16 tests for such a simple profile somewhat daunting when thinking about real-world profiles that are an order of magnitude or more larger, but I hope that confidence and experience built with simple profiles will reduce the need to so many test cases. So my next steps will be to slowly build up the range of constraints and complexity of examples. Watch this space for more details, or contact me if you have suggestions.

The post Application Profile to Validation with TAP to SHACL appeared first on Sharing and learning.

SHACL, when two wrongs make a right⤴

from @ Sharing and learning

I have been working with SHACL for a few months in connexion with validating RDF instance data against the requirements of application profiles. There’s a great validation tool created as part of the JoinUp Interoperability Test Bed that lets you upload your SHACL rules and a data instance and tests the latter against the former. But be aware: some errors can lead to the instance data successfully passing the tests; this isn’t an error with the tool, just a case of blind logic: the program doing what you tell it to regardless of whether that’s what you want it to do.

The rules

Here’s a really minimal set of SHACL rules for describing a book:

@base <http://example.org/shapes#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns> .
@prefix sdo: <https://schema.org/> .
@prefix sh: <http://www.w3.org/ns/shacl#> .

<Book_Shape> a sh:NodeShape ;
    sh:property
        <bookTitle_Shape> ;
    sh:targetClass sdo:Book .

<bookTitle_Shape> a sh:PropertyShape ;
    sh:path dct:title ;
    sh:datatype rdf:langString ;
    sh:nodeKind sh:Literal ;
    sh:minCount 1 ;
    sh:severity sh:Violation .

Essentially it says that the description of anything typed as a schema.org:Book should have a title provided as a langString using the Dublin Core title property. Here’s some instance data

@prefix : <http://example.org/books#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sdo: <https://schema.org/> .

:42 a sdo:Book;
  dct:title "Life the Universe and Everything."@en .

Throw those into the SHACL validator and you get:

Result: SUCCESS
Errors: 0
Warnings: 0
Messages: 0

Which (I think) is what you should get.

So wrong it’s right

But what about this instance data:

@prefix : <http://example.org/books#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sdo: <http://schema.org/> .

:42 a sdo:Book;
  dct:title "Life the Universe and Everything." .

For this the validator also returns

Result: SUCCESS
Errors: 0
Warnings: 0
Messages: 0

Did you spot the difference? Well, for one thing the title has no language attibute, it’s not a langString.

How about:

@prefix : <http://example.org/books#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sdo: <http://schema.org/> .

:42 a sdo:Book .

which has no title at all, but still you’ll get the “success, no errors” result. That cannot be right, can it?

Well, yes, it is right. You see there is another error in those last two examples. In the SHACL and the first example, the URI for Schema.org is given as

@prefix sdo: <https://schema.org/> .

In the second and third examples it is

@prefix sdo: <http://schema.org/> .

That’s the sort of subtle and common mistake I would like to pick up, but in fact it stops the validator from picking up any mistakes. That’s because the SHACL rules apply to anything typed as a https://schema.org/Book and in the second and third examples (where the prefix is http, not https) there isn’t anything typed as such. No rules apply, no rules are broken: zero errors — success!

What to do?

I’m not really sure. I see this type of problem quite a lot (especially if you generalize “this type of problem” to mean the test doesn’t have quite the logic I thought it did). I suppose lesson one was always to test shapes with invalid data to make sure they work as expected. That’s a lot of tests.

Arising from that is to write SHACL rules to check for everything: the errors above would be picked up if I had checked to make sure there is always one entity with the expected type for Books: there’s recipe for this on the SHACL wiki.

Generalizing on the idea that simple typos can mean data not being tested because the term identifier doesn’t match any term in the schema your using, it’s worth checking that all term identifiers in the instance data and the SHACL are actually in the  schema. This will pickup when sdo:Organisation is used instead of sdo:Organization. SHACL won’t do this, but it’s easy enough to write a python script that does.

The post SHACL, when two wrongs make a right appeared first on Sharing and learning.

Harmonizing Learning Resource Metadata (even LOM)⤴

from @ Sharing and learning

The best interoperability is interoperability between standards. I mean it’s one thing for you and I to agree to use the same standard in the same way to do the same thing, but what if we are doing slightly different things? What if Dublin Core is right for you, but schema.org is right for me–does that mean we can’t exchange data? That would be a shame, as one standard to rule them all isn’t a viable solution. This has been exercising me through a couple of projects that I have worked on recently, and what I’ll show here is demo based on the ideas from one of these (The T3 Data Ecosystem Mapping Project) applied to another where learning resource metadata is available in many formats and desired in others. In this post I focus on metadata available as IEEE Learning Object Metadata (LOM) but wanted in either schema.org or DCAT.

The Problem

Interoperability in spite of diverse standards being used seems an acute problem when dealing with metadata about learning resources. It makes sense to use existing (diverse) schema for describing books, videos, audio, software etc, supplemented with just enough metadata about learning to describe those things when they are learning resources (textbooks, instructional videos etc.). This is the approach taken by LRMI. Add to this the neighbouring domains with which learning resource metadata needs to connect, e.g. research outputs, course management, learner records, curriculum and competency frameworks, job adverts…, all of which have there own standards ecosystems and perhaps you see why interoperability across standards is desirable.

(Aside it often also makes sense to use a large all-encompassing standard like schema.org, often as well as more specialized standards, which is why LRMI terms are in schema.org.)

This problem of interoperability in an ecosystem of many standards was addressed by Mikael Nilsson in his PhD thesis “From Interoperability to Harmonization in Metadata Standardization” where he argued that syntax wasn’t too important, what mattered more was the abstract model. Specifically he argued that interoperability (or harmonization) was possible between specs that used the RDF entity-based metamodel but less easy between specs that used a record-like metamodel.  IEEE Learning Object Metadata is largely record-like: a whole range of different things are described in one record, and the meanings of many elements depend on the context of the element in the record, and sometimes the values of other elements in the same record. Fortunately it is possible to identify LOM elements that are independent characteristics of an identified entity, which means it is possible to represent some LOM metadata in RDF. Then it is possible to map that RDF representation to terms from other vocabularies.

Step 1: RML to Map LOM XML to a local RDF vocabulary

This RML is the RDF Mapping Language, “a language for expressing customized mappings from heterogeneous data structures and serializations to the RDF data model … and to RDF datasets”.  It does so through a set of RDF statements in turtle syntax that describe the mapping from (in my case, here) XML fragments specified as XPath strings to subjects, predicates and objects. There is parser called RMLMapper that will then execute the mapping to transform the data.

My LOM data came from the Lifewatch training catalogue which has a nice set of APIs allowing access to sets of the metadata. Unfortunately the LOM XML provided deviates from the LOM XML Schema in many ways, such  as element names with underscore separations rather than camel case (so element_name, not elementName) and some nesting errors, so the RML I produced won’t work on other LOM XML instances.

Here’s a fragment of the RML, to give a flavour. The whole file is on github, along with other files mentioned here:

 
<#Mapping> a rr:TriplesMap ;
  rml:logicalSource [
    rml:source "lifewatch-lom.xml" ;
    rml:referenceFormulation ql:XPath ;
    rml:iterator "/lom/*"
  ];
  rr:subjectMap [
    rr:template "http://pjjk.local/resources/{external_id}" ;
    rr:class lom:LearningObject
  ] ;
  rr:predicateObjectMap [
    rr:predicate lom:title;
    rr:objectMap [
      rml:reference "general/title/langstring";
      rr:termType rr:Literal;
    ]
  ] ;
 #etc...
 .

I have skipped the prefix declarations and jumped straight the part of the mapping that specifies the source file for the data, and the XMLPath of the element to iterate over in creating new entities. The subjectMap generates an entity identifier using a non-standard element in the LOM record appended to a local URI, and assigns this a class. After that a series of predicateObjectMaps specify predicates and where in the XML to find the values to use as objects. Running this through the mapper generates RDF descriptions, such as:

<http://pjjk.local/resources/DSEdW5uVa2> a lom:LearningObject;
  lom:title "Research Game";
#etc...

Again I have omitted the namespaces; the full file, all statements for all resources, is on github.

Step 2: Describe the mappings in RDF

You’ll notice that lom: namespace in the mapping and generated instance data. That’s not a standard rdf representation of the IEEE LOM, it’s a local schema that defines the relationships of some of the terms mapped from IEEE LOM to more standard schema. The full file is on github, but again, here’s a snippet:

lom:LearningObject a rdfs:Class ;
  owl:equivalentClass sdo:LearningResource , lrmi:LearningResource ;
  rdfs:subClassOf dcat:Resource .

lom:title a rdfs:Property ;
  rdfs:subPropertyOf sdo:name ;
  owl:equivalentProperty dcterms:title .

This is where the magic happens. This is the information that later allows us to use the metadata extracted from LOM records as if it is schema.org or LRMI, Dublin Core or DCAT. Because this schema is used locally only I haven’t bothered to put in much information about the terms other than their mapping to other more recognized terms. The idea here isn’t to be able to work with LOM in RDF, the idea is to take the data from LOM records and work with it as if it were from well defined RDF metadata schema. I also haven’t worried too much about follow-on consequences that may derive from the mappings that I have made, i.e. implied statements about relationships between terms in other schema, such as the implication that if lom:title is equivalent to dcterms:title, and also a subProperty of schema.org/name, then I am saying that dcterms:title is a subProperty of schema.org/name. This mapping is for local use, I’ll assert what is locally useful, if you disagree that’s fine because you won’t be affected by my local assertions.

Just to complete the schema picture, I also have RDF schema definitions files for Dublin Core Terms, LRMI, DCAT and schema.org.

(Aside: I also created some SKOS Concept Schemes for controlled vocabularies used in the LOM records, but they’re not properly working yet.)

Step 3: Build a Knowlege Graph

(Actually I just put all the schema definitions and the RDF representation of the LOM metadata into a triple store, but calling it a knowledge graph gets people’s attention.) I use a local install of Ontotext GraphDB (free version). It’s important when initializing the repository to choose a ruleset that allows lots of inferencing: I use the OWL-MAX option. Also, it’s important when querying the data to select the option to include inferred results.

SPARQL interface for GraphDB showing option to include inferred data
SPARQL interface for GraphDB showing option to include inferred data

Step 4: Results!

The data can now be queried with SPARQL. For example, a simple query to check what’s there:

PREFIX lom: <http://ns.pjjk.local/lom/terms/>

SELECT ?r ?t { 
    ?r a lom:LearningObject ;
    lom:title ?t .  
}

This produces a list of URIs & titles for the resources:

r,n
http://pjjk.local/resources/DSEdW5uVa2,Research Game
http://pjjk.local/resources/FdW84TkcrZ,Alien and Invasive Species showcase
http://pjjk.local/resources/RcwrBMYavY,EcoLogicaCup
http://pjjk.local/resources/SOFHCa8sIf,ENVRI gaming
http://pjjk.local/resources/Ytb7016Ijs,INTERNATIONAL SUMMER SCHOOL Data FAIRness in Environmental & Earth Science Infrastructures: theory and practice
http://pjjk.local/resources/_OhX8O6YwP,MEDCIS game
http://pjjk.local/resources/kHhx9jiEZn,PHYTO VRE guidelines
http://pjjk.local/resources/wABVJnQQy4,Save the eel
http://pjjk.local/resources/xBFS53Iesg,ECOPOTENTIAL 4SCHOOLS 

Nothing here other than what I put in that was converted from the LOM XML records.

More interestingly, this produces the same:

PREFIX sdo: <http://schema.org/> 

Select ?r ?n { 
    ?r a sdo:LearningResource ;
    sdo:name ?n ;
}

This is more interesting because it’s showing a query using schema.org terms yielding results from metadata that came from LOM records.

If you prefer your metadata in DCAT, with a little added lrmi to describe the educational characteristics this:

PREFIX dcat: &lt;http://www.w3.org/ns/dcat#&gt;
PREFIX dcterms: &lt;http://purl.org/dc/terms/&gt;
PREFIX lrmi: &lt;http://purl.org/dcx/lrmi-terms/&gt;

CONSTRUCT {
  ?r a dcat:Resource ;
  dcterms:title ?t ;
  dcat:keywords ?k ;
  lrmi:educationalLevel ?l .
} WHERE {
  ?r a dcat:Resource ;
  dcterms:title ?t ;
  dcat:keywords ?k ;
  lrmi:educationalLevel ?l .
}

will return a graph of such:

@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix lrmi: <http://purl.org/dcx/lrmi-terms/> .

<http://pjjk.local/resources/DSEdW5uVa2> a dcat:Resource ;
	dcterms:title "Research Game" ;
	dcat:keywords "competition" ;
	lrmi:educationalLevel <lomCon:/difficulty/Medium> ;
	dcat:keywords "european" , "gaming" , "research game" , "schools" .

<http://pjjk.local/resources/FdW84TkcrZ> a dcat:Resource ;
	dcterms:title "Alien and Invasive Species showcase" ;
	dcat:keywords "EUNIS habitat" ;
	lrmi:educationalLevel <lomCon:/difficulty/Medium> ;
	dcat:keywords "alien species" , "invasive species" .

<http://pjjk.local/resources/RcwrBMYavY> a dcat:Resource ;
	dcterms:title "EcoLogicaCup" ;
	dcat:keywords "game" ;
	lrmi:educationalLevel <lomCon:/difficulty/Medium> ;
	dcat:keywords "Italian" , "competition " , "ecology" , "school" .

That’s what I call interoperability in spite of multiple standards. Harmonization of metadata so that, even though the data started off as LOM XML records, we can create a database that can queried and exported as if the metadata were schema.org, Dublin Core, LRMI, DCAT…

Acknowledgements:

This brings together work from two projects that I have been involved in. The harmonization of metadata is from the DESM project, funded by the US Chamber of Commerce Foundation, and the ideas coming from Stuart Sutton. The application to LOM, schema.org, DCAT+LRMI came about from a small piece of work I did for DCC at the University of Edinburgh as input to the FAIRsFAIR project.

The post Harmonizing Learning Resource Metadata (even LOM) appeared first on Sharing and learning.

JSON Schema for JSON-LD⤴

from @ Sharing and learning

I’ve been working recently on definining RDF application profiles, defining specifications in JSON-Schema, and converting specifications from a JSON Schema to an RDF representation. This has lead to me thinking about, and having conversations with people  about whether JSON Schema can be used to define and validate JSON-LD. I think the answer is a qualified “yes”. Here’s a proof of concept; do me a favour and let me know if you think it is wrong.

Terminology might get confusing: I’m discussing JSON, RDF as JSON-LD, JSON Schema, RDF Schema and schema.org; which are all different things (go an look them up if you’re not sure of the differences).

Why JSON LD + JSON Schema + schema.org?

To my mind one of the factors in the big increase in visibility of linked data over that last few years has been the acceptability of JSON-LD to programmers familiar with JSON. Along with schema.org, this means that many people are now producing RDF based linked data often without knowing or caring that that is what they are doing. One of the things that seems to make their life easier is JSON Schema (once they figure it out). Take a look at the replies to this question from @apiEvangelist for some hints at why and how:

Also, one specification organization I am working with publishes its specs as JSON Schema. We’re working with them on curating a specification that was created as RDF and is defined in RDF Schema, and often serialized in JSON-LD. Hence the thinking about what happens when you convert a specification from RDF Schema to JSON Schema —  can you still have instances that are linked data? can you mandate instances that are linked data? if so, what’s the cost in terms of flexibility against the original schema and against what RDF allows you to do?

Another piece of work that I’m involved in is the DCMI Application Profile Interest Group, which is looking at a simple way of defining application profiles — i.e. selecting which terms from RDF vocabularies are to be used, and defining any additional constraints, to meet the requirements of some application. There already exist some not-so-simple ways of doing this, geared to validating instance data, and native to the W3C Semantic Web family of specifications: ShEx and ShACL. Through this work I also got wondering about JSON Schema. Sure, wanting to use JSON Schema to define an RDF application profile in JSON Schema may seem odd to anyone well versed in RDF and W3C Semantic Web recommendations, but I think it might be useful to developers who are familiar with JSON but not Linked Data.

Can JSON Schema define valid JSON-LD?

I’ve heard some organizations have struggled with this, but it seems to me (until someone points out what I’ve missed) that the answer is a qualified “yes”. Qualifications first:

  • JSON Schema doesn’t defined the semantics of RDF terms. RDF Schema defines RDF terms, and the JSON-LD context can map keys in JSON instances to these RDF terms, and hence to their definitions.
  • Given definitions of RDF terms, it is possible to create a JSON Schema such that any JSON instance that validates against it is a valid JSON-LD instance conforming to the RDF specification.
  • Not all valid JSON-LD representations of the RDF will validate against the JSON Schema. In other words the JSON Schema will describe one possible serialization of the RDF in JSON-LD, not all possible serializations. In particular, links between entities in an @graph array are difficult to validate.
  • If you don’t have an RDF model for your data to start with, it’s going to be more difficult to get to RDF.
  • If the spec you want to model is very flexible, you’ll have difficulty making sure instances don’t flex it beyond breaking point.

But, given the limited ambition of the exercise, that is “can I create a JSON Schema so that any data it passes as valid is valid RDF in JSON-LD?”, those qualifications don’t put me off.

Proof concept of examples

My first hint that this seems possible came when I was looking for a tool to use when working with JSON Schema and found this online JSON Schema Validator.  If you look at the “select schema” drop down and scroll a long way, you’ll find a group of JSON Schema for schema.org. After trying a few examples of my own, I have a JSON Schema that will (I think) only validate JSON instances that are valid JSON-LD based on notional requirements for describing a book (switch branches in github for other examples).

Here are the rules I made up and how they are instantiated in JSON Schema.

First, the “@context” sets the default vocabulary to schema.org and allows nothing else:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "context.json",
  "name": "JSON Schema for @context using schema.org as base",
  "description": "schema.org is @base namespace, but others are allowed",
  "type": "object",
  "additionalProperties": false,
  "required": [ "@vocab" ],
  "properties": {
    "@vocab": {
      "type": "string",
      "format": "regex",
      "pattern": "http://schema.org/",
      "description": "required: schema.org is base ns"
    }
  }
}

This is super-strict, it allows no variations on @context": {"@vocab" : "http://schema.org"} which obviously precludes doing a lot of things that RDF is good at, notably using more than one namespace. It’s not difficult to create looser rules, for example madate schema.org as the default vocabulary but allow some or any others. Eventually you create enough slack to allow invalid linked data (e.g. using namespaces that don’t exist; using terms from the wrong namespace) and I promised you only valid linked data would be allowed. In real life, there would be a balance between permissiveness and reliability.

Rule 2: the book ids must come from wikidata:

{
 "$schema": "http://json-schema.org/draft-07/schema#",
 "$id": "wd_uri_schema.json",
 "name": "Wikidata URIs",
 "description": "regexp for Wikidata URIs, useful for @id of entities",
 "type": "string",
 "format": "regex",
 "pattern": "^https://www.wikidata.org/entity/Q[0-9]+" 
}

Again, this could be less strict, e.g. to allow ids to be any http or https URI.

Rule 3: the resource described is a schema.org/Book, for which the following fragment serves:

    "@type": {
      "name": "The resource type",
      "description": "required and must be Book",
      "type": "string",
      "format": "regex",
      "pattern": "^Book$"
    }

You could allow other options, and you could allow multiple types, maybe with one type manadatory (I have an example schema for Learning Resources which requires an array of type that must include LearningResource)

Rules 4 & 5: the book’s name and description are strings:

    "name": {
      "name": "title of the book",
      "type": "string"
    },
    "description": {
      "name": "description of the book",
      "type": "string"
    },

Rule 6, the URL for the book (i.e. a link to a webpage for the book) must be an http[s] URI:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "http_uri_schema.json",
  "name": "URI @ids",
  "description": "required: @id or url is a http or https URI",
  "type": "string",
  "format": "regex",
  "pattern": "^http[s]?://.+"
}

Rule 7, for the author we describe a schema.org/Person, with a wikidata id, a familyName and a givenName (which are strings), and optionally with a name and description, and with no other properties allowed:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "person_schema.json",
  "name": "Person Schema",
  "description": "required and allowed properties for a Person",
  "type": "object",
  "additionalProperties": false,
  "required": ["@id", "@type", "familyName", "givenName"],
  "properties": {
    "@id": {
      "description": "required: @id is a wikidata entity URI",
      "$ref": "wd_uri_schema.json"
    },
    "@type": {
      "description": "required: @type is Person",
      "type": "string",
      "format": "regex",
      "pattern": "Person"
    },
    "familyName": {
      "type": "string"
    },
    "givenName": {
      "type": "string"
    },
    "name": {
      "type": "string"
    },
    "description": {
      "type": "string"
    }
  }
}

The restriction on other properties is, again, simply to make sure no one puts in any properties that don’t exist or aren’t appopriate for a Person.

The subject of the book (the about property) must be provided as wikidata URIs, with optional @type, name, description and url; there may be more than one subject for the book, so this is an array:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "about_thing_schema.json",
  "name": "About Thing Schema",
  "description": "Required and allowed properties for a Thing being used to say what something is about.",
  "type": "array",
  "minItems": 1,
  "items": {
    "type": "object",
    "additionalProperties": false,
    "required": ["@id"],
    "properties": {
      "@id": {
        "description": "required: @id is a wikidata entity URI",
        "$ref": "wd_uri_schema.json"
      },
      "@type": {
        "description": "required: @type is from top two tiers in schema.org type hierarchy",
        "type": "array",
        "minItems": 1,
        "items": {
          "type": "string",
          "uniqueItems": true,
          "enum": [
            "Thing",
            "Person",
            "Event",
            "Intangible",
            "CreativeWork",
            "Organization",
            "Product",
            "Place"
          ]
        }
      },
      "name": {
        "type": "string"
      },
      "description": {
        "type": "string"
      },
      "url": {
        "$ref": "http_uri_schema.json"
      }
    }
  }
}

Finally, bring all the rules together, making the @context, @id, @type, name and author properties mandatory; about, description and url are optional; no others are allowed.

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "book_schema.json",
  "name": "JSON Schema for schema.org Book",
  "description": "Attempt at a JSON Schema to create valid JSON-LD descriptions. Limited to using a few schema.org properties.",
  "type": "object",
  "required": [
    "@context",
    "@id",
    "@type",
    "name",
    "author"
  ],
  "additionalProperties": false,
  "properties": {
    "@context": {
      "name": "JSON Schema for @context using schema.org as base",
      "$ref": "./context.json"
    },
    "@id": {
      "name": "wikidata URIs",
      "description": "required: @id is from wikidata",
      "$ref": "./wd_uri_schema.json"
    },
    "@type": {
      "name": "The resource type",
      "description": "required and must be Book",
      "type": "string",
      "format": "regex",
      "pattern": "^Book$"
    },
    "name": {
      "name": "title of the book",
      "type": "string"
    },
    "description": {
      "name": "description of the book",
      "type": "string"
    },
    "url": {
      "name":"The URL for information about the book",
      "$ref": "./http_uri_schema.json"
    },
    "about": {
      "name":"The subject or topic of the book",
      "oneOf": [
        {"$ref": "./about_thing_schema.json"},
        {"$ref": "./wd_uri_schema.json"}
      ]
    },
    "author": {
      "name":"The author of the book",
      "$ref": "./person_schema.json"
    }
  }
}

I’ve allowed the subject (about) to be given as an array of wikidata entity link/descriptions (as described above) or a single link to a wikidata entity; which hints at how similar flexibility could be built in for other properties.

Testing the schema

I wrote a python script (running in a Jupyter Notebook) to test that this works:

from jsonschema import validate, ValidationError, SchemaError, RefResolver
import json
from os.path import abspath
schema_fn = "book_schema.json"
valid_json_fn = "book_valid.json"
invalid_json_fn = "book_invalid.json"
base_uri = 'file://' + abspath('') + '/'
with open(schema_fn, 'r') as schema_f:
    schema = json.loads(schema_f.read())
with open(valid_json_fn, 'r') as valid_json_f:
    valid_json = json.loads(valid_json_f.read())
resolver = RefResolver(referrer=schema, base_uri=base_uri)
try :
    validate(valid_json,  schema, resolver=resolver)
except SchemaError as e :
    print("there was a schema error")
    print(e.message)
except ValidationError as e :
    print("there was a validation error")
    print(e.message)

Or more conveniently for the web (and sometimes with better messages about what failed), there’s the JSON Schema Validator I mentioned above. Put this in the schema box on the left to pull in the JSON Schema for Books from my github:

{
  "$ref": "https://raw.githubusercontent.com/philbarker/lr_schema/book/book_schema.json"
}

And here’s a valid instance:

{
  "@context": {
    "@vocab": "http://schema.org/"
  },
  "@id": "https://www.wikidata.org/entity/Q3107329",
  "@type": "Book",
  "name": "Hitchhikers Guide to the Galaxy",
  "url": "http://example.org/hhgttg",
  "author": {
    "@type": "Person",
    "@id": "https://www.wikidata.org/entity/Q42",
    "familyName": "Adams",
    "givenName": "Douglas"
  },
  "description": "...",
  "about": [
    {"@id": "https://www.wikidata.org/entity/Q3"},
    {"@id": "https://www.wikidata.org/entity/Q1"},
    {"@id": "https://www.wikidata.org/entity/Q2165236"}
  ]
}

Have a play, see what you can break; let me know if you can get anything that isn’t valid JSON LD to validate.

The post JSON Schema for JSON-LD appeared first on Sharing and learning.