The last of my January reviews of the main projects that I am working on is about Learning Resource Metadata, which is the theme that I have been working on for the longest. One of my first jobs in learning technology was maintaining and publishing a catalogue of computer-based resources that were useful for teaching Physics at university. That was back in 1994, and some of my work was still around when the Internet Archive started indexing. Since then I have been through several more online learning resource guides, and got involved in standards for learning resource metadata through the CETIS Metadata Special Interest Group, IEEE LOM / IMS Learning Resource Metadata, DCMI Education Community, and LRMI. LRMI is still going and that is what this blog post is about.
One of the most exciting recent developments has been that through the IEEE P2881 Learning Metadata Working Group LRMI—and more importantly the approach it embodies, as described below—is being incorporated in formal standards.
About the work
LRMI exists as an RDF vocabulary of terms (properties, classes and concept schemes) that can be used with terms from other vocabularies to describe learning resources. The LRMI vocabulary is declared as equivalent properties and classes in Dublin Core namespace and in Schema.org (mostly under Learning Resource). There are classes, properties and, in the Dublin Core namespace, concept schemes providing broad, top-level controlled vocabularies / value enumerations for some of the properties.
While we have done some technical work over the last year or so (notably creating a concept scheme for Learning Resource Type, and adding German translations to the published concept schemes) the main thrust of the current work is focused on promotion and dissemination. A major part of our effort currently underway is setting up to gather information about who is using LRMI, how they are using it, so that we can create documentation based on current real life examples.
Application Profiles are key to understanding how the LRMI vocabulary may be used. Application Profiles allow the “mixing and matching” of terms from different base vocabularies, and the overlay of additional rules or other information for how the terms are to be used in a specific application—see my previous posts on DCMI Tabular Application Profiles for more about how we do that. The LRMI terms focus on the educational aspects of any form of learning resource (text book, instructional video, web-based or face to face course, V/AR, podcast…), and are designed to supplement other vocabularies designed to describe any general resource of a similar form. So the idea is that you would select the terms from whatever vocabularies best describe the form that your learning resource takes, be it book, video, web site, event etc., and then use the LRMI properties to describe the educational characteristics.
This approach is at the root of that taken by the IEEE P2881 Working Group (described here). That work is not aiming to replace IEEE 1484.12.1 Standard for Learning Object Metadata (LOM), but to provide an RDF-based alternative for uses where the flexibility of RDF is beneficial. The current draft in progress is an application profile based mostly on LRMI terms (from the DCMI namespace) and Dublin Core Terms.
About my role
About twelve years ago I was mostly done with metadata. I was more interested in Open Education Resources (OERs) and how discoverability was supported by having the full text on open view. But then, along came schema.org. Andy Powell asked what would it be necessary to add to the metadata understood by Google in order to adequately describe an OER. That approach was very much in line with the DCMI Education an UKOER metadata line of thinking, so I wrote a blog post in reply. Then, when the LRMI project got funding and was asking for input, I used that blog post to lever my way into the advisory group. And my involvement kind of grew from there. I now chair the Dublin Core LRMI Working Group, which is tasked with maintaining and advocating for the use of the LRMI vocabulary, and I represent the work in other fora, such as IEEE P2881.
It took a while, but we now have in LRMI a Learning Resource Type concept scheme that defines a controlled vocabulary of terms that you might use to describe the type or nature of a learning resource.
Why it took a while: what is a learning resource, and what is a type?
Aside from everything in metadata being harder than you first think, and having less time than we would like, the main reason it took so long (and it took, like, a few years) comes down to aspects of what is a learning resource? and what it a type? Some folk maintain that there is no well-defined class of “learning resources”, anything can be used for learning and teaching, and that trying to describe different sub-types of “anything” is going to be a fruitless task. Pedagogically, I have no argument with the statement that anything can be used for learning and teaching, but for information systems that is not a useful starting point. I have seen repositories crash and burn because they took that as their collection policy. Telling people who are looking for resources to help them learn maths that they can use anything, just be imaginative in how you use it, is not helpful.
By way of analogy, pretty much anything can be used as a hammer. Somethings will be better than others, the ones that are the right weight, hard and not brittle, but I’ve used stones, shoes, lumps of wood, monkey wrenches and so on as hammers with some success. That doesn’t mean that “hammer” doesn’t exist as a category, not does it mean that it isn’t useful to distinguish a peen hammer from a sledgehammer from a copper-headed mallet. Not that I am easily distracted, but I have found plenty of shops that not only sell hammers as a distinct type of tool but they have a fascinating array of different types of specialist hammers.
A persistent resource that has one or more physical or digital representations, and that explicitly involves, specifies or entails a learning activity or learning experience.
So, not just anything that can be used for learning and teaching, but something that is meant to be used for learning and teaching.
Intuitively it seems clear that there are different types of learning resource: lesson plans are different to textbooks, video lectures are different to assessment items. But how to encapsulate that? Is something an assessment because it is used for assessment, or is there something inherent in some resources that makes them assessments? Likewise is a video lecture a different type of thing from a lecture or just a different format? The answer in each case is sometimes yes to both. The use of something may be strongly correlated to what it is but use and type are still distinct. That is fine: we have in LRMI a property of educationalUse, which can be assessment, and now learningResourceType which can also be assessment. Likewise, the format of something may be correlated to what it is: textbooks tend to include text; a video recording of a lecture will be a video. Again that is fine, we have mime types and other ways of encoding format to convey that information, but they won’t tell you whether something is a text book or a children’s picture book, and not all recordings of lectures will be videos. So learning resource type may be correlated to to educational use and format without being the same.
Principles adopted for the LRMI Learning Resource Type vocabulary
As with all our work in LRMI, we adopted a couple of principles. First it should focus solely on what was relevant to learning, education and training: other vocabularies deal well with other domains or generic terms. Second, create small vocabulary of broad, a high level terms to which other people can map their special cases and similar terms: those special cases are often so context dependent that they frequently don’t travel well. Related to both of these, we mapped our vocabulary to terms in two others: CEDS and the Library of Congress Genre/Form terms. The links to CEDS terms are useful because CEDS is well established in the US system, and we provided pre-existing terms many of which we adopted. The link to the LoC terms is useful because it links our terms into a comprehensive list of generic terms. LoC terms are an example of one of the vocabularies that you might want to use if you are describing things like data as learning resources: we don’t cover it because data as a learning resource is not distinct from data in general, but we are all linked data here, and when providing resource descriptions you can mix terms from our scheme with those from others.
Using the LRMI Learning Resource Type vocabulary
The vocabulary is expressed in SKOS, and so it ready for linked data use.
If you manage your own list of learning resource types using SKOS, we invite you create links to the LRMI concepts and thus improve interoperability of learning resource descriptions. We would be interested in hearing from you if you are in this situation. Perhaps you have suggestions for further concepts; you can raise an issue about the concept scheme if that is the case.
If you create learning resource descriptions you may reference this vocabulary in several ways, for example in JSON-LD you may have:
In schema.org, you may use the labels defined as simple string values, but you could include a link to our full definition (and hence provide access the links to other schemes that we define — after all this is linked data), by using a defined term as the value for learning resource type
Over the last year or so I have been contributing to a Dublin Core Metadata Initiative Interest Group, lead by Karen Coyle, that has been drafting recommendations on how to create and document application profiles in simple a tabular format. At the end of last year we had a webinar presenting the work so far, which now available on YouTube. For me the really useful thing about this approach, which we are calling DC TAP, is that it is a human-friendly approach that also lends itself to machine processing—at the end of the webinar recording Tom Baker demos a python script that will take a tabular application profile exported in CSV format and turn it into ShEx that can be used to validate data.
Application Profiles
“data elements drawn from one or more namespaces, combined together by implementors, and optimised for a particular local application” –Heery and Patel
Application profiles are a powerful concept, central to how RDF can be used to create new schemas without creating entirely new standards. The idea was presented in “Application Profiles: Mixing and Matching Metadata Schemas” by Rachel Heery and Manjula Patel, where it was defined as: “schemas which consist of data elements drawn from one or more namespaces, combined together by implementors, and optimised for a particular local application”. Twenty years on and the only part of that definition that I would quibble with is that “local” underplays how widely useful an application profile can be when shared. A couple of examples might illustrate this.
The Data Catalog Vocabulary (DCAT) is a “vocabulary designed to facilitate interoperability between data catalogs published on the Web”. It draws on terms from external vocabularies like Dublin Core and FOAF; where these aren’t sufficient it mints new terms, and it mixes these terms according to a domain model relevant to data sets. Thus DCAT can be seen as an application profile of several pre-existing vocabularies supplemented with few new terms and a model of how to use those terms. Recently, when I was asked about describing learning resources about data management, I suggested adding LRMI into this mix and so create a schema using terms that are mostly already familiar in that domain.—No need for the implementors to move to a new specification. (Hence the use of DCAT at the end of my previous post on harmonizing LOM metadata with RDF.)
Bioschemas “aim to improve the Findability on the Web of life sciences resouces”. They take a similar approach, but with the starting point of schema.org. Unlike Dublin Core and FOAF, schema.org is a huge and ever-growing vocabulary, and the main issue in using it is one of selection, i.e. cutting it down to size, and explanation, i.e. explaining how you want to use what has been selected. On their website you can see their draft recommendation for how they would describe Courses; note the tabular format, how they tweak the term descriptions to be understandable for their community, how they add requirements such as which properties must be used and whether a property can be repeated, also they place constraints on the values allowable for some properties, e.g. that the value should be drawn from a specific concept scheme. This general approach will also be familar to anyone who has looked at Google’s guidlines on use of schema.org for various search products it offers, for example recipes.
Tabular Application Profiles
Select, Explain, Constrain, Share
One of my colleagues in the DC TAP interest group, Nishad talks about “profiling” “contraining” and “explaining” as the key aspects of an Application Profile. Application Profiles allow you to select which terms you want to use, explain how to use them, place constraints on their use (i.e. narrow down what is allowed in the definition of the vocabulary) and share them with others. Existing standards, notably ShEx and SHACL overlap with this approach, but the aim of the DC TAP group has been to create a simple format congruent with the approach that is often taken by the people who build application profiles. Hence the tabular approach rather than an RDF specification, which I feel will provide a format that fosters collaboration around these profiles. The simplest allowable TAP is a single column list of the properties that have been selected. The most complex allows you to specify which properties are used to describe which types of entity, how they are related, what formats should be used for values, which gets more technical but the tabular approach to preseting this helps keep it manageable. Importantly, the output is actionable, e.g. can be transformed to ShEx and SHACL to allow data selection and validation services, or can be parsed to automatically create data entry forms, etc.
There’s plenty of further expanation in the video (embedded above) and in the draft specification and draft primer, but I’ll end with screen shots of a simple example: a schema for describing a course. The vocabulary used is schema.org, so the central entities are CourseInstances (offerings of a course) with specific start and end dates, at a virtual or physical location, about some concept, and with a person as the instructor. First the circles-and-arrows diagram, then the corresponding TAP.
When Clint Lalonde came up with the crazy and wonderful idea of recording aserialised audio version of Martin Weller’s25 years of Ed Tech book and asked if I was interested in getting involved, I knew right away which chapter I wanted to read. And I also knew it was one of the chapters no one else was likely to volunteer for – E-Learning Standards. That’s not to disrespect to Martin’s writing, it’s more a reflection of the fact that the standards that he highlights in this chapter were never particularly successful. The reason I was so keen to read this chapter though is that I spent 15 years of my life working on various learning technology standards projects, mostly focused on resource description, metadata, and controlled vocabularies, including some that Martin critiques in the chapter. On pretty much all of these projects I worked with Phil Barker, so it seemed only natural to invite Phil to join me for the recording, which he kindly agreed to do, despite claiming not to enjoy audio recording :}
Another reason I thought it would be fun to record this chapter is that people used to find the way I pronounce “metadata” absolutely hilarious. To this day, I have no idea why, however I used to get frequent requests from American colleagues at ed tech standards meetings to “say the word metadata.” Admittedly standards meetings could be pretty dull, so we had to make our own entertainment.
We used Zencastr to record the chapter, which worked pretty well, however for reasons of acoustics I ended up having to record my audio in my bedroom rather than the open plan office downstairs were I usually work. Sitting on the bedroom floor reciting the elements of the Dublin Core was definitely one of the more surreal moments of lockdown.
Recording this chapter also put Phil and I in the interesting position of reading Martin dissing several standards that we had been responsible for developing, particularly the UKLOM Core. Although we both felt that some of Martin’s criticism of Dublin Core really applied to the IEEE LOM, the cause of this confusion highlighted exactly what was wrong with many e-learning standards; too many of them were overly complex and educators should never have been expected to get to grips with them in the first place. Phil and I had an opportunity to discuss these and other issues in an entertaining Between the Chapters discussion with Laura Pasquini. Phil has already written a blog post about this discussion here, Reading one of 25 years of EdTech, which I can highly recommend reading, particularly if you want to revisit thedawn of EduProg.
From my perspective, I think one of the most important points Martin raises in this chapter, and which we discussed in the podcast, is that although many e-learning standards didn’t really work, we learned from our mistakes, and went on to lay the foundations for the emerging OER movement. The UKOER programme is a perfect case in point. When the programme was launched in 2009, Phil and I had already contributed to the technical strategies for a number ofJISC development programmes and we knew from painful experience that expecting educators to understand baroque metadata standards and provide meaningful descriptions for fields such as semantic density (memorably described as “the poster child for useless metadata elements”) just didn’t work. Application profiles such as the UK LOM Core were an attempt to simplify these complex standards, and make them more human friendly and interoperable, but even these profiles were woefully complex. So when the UKOER Programme came along, we decided on a completely different approach. (You can read the original technical requirements on my old Cetis blog here:OER Programme Technical Requirements.) Rather than mandating the use of a specific metadata standard and application profile, we simply told people what information to record (title, author, date, url, basic technical info), we didn’t tell them how to record the information, we left that up to them. The only metadata item that we actually mandated was the programme hashtag, #UKOER. There was a lot of discussion about the wisdom of this approach, web 2.0 was still in its infancy, and people were only just beginning to get to grips with new-fangled approaches to resource description such as community generated metadata and folksonomies. Some people also complained that using the hashtag for resources as well as information about the projects would muddy the water and make it hard to find content. Another radical approach we recommended for the UKOER programme was that projects could share their resources using “any system or application as long as it is capable of delivering content freely on the open web”, on the proviso that they also uploaded their OERs to Jisc’s national learning resource repository Jorum for safe keeping. I’m not going to go into the reasons why central learning resource repositories are a bad idea, that’s a whole other blog post. Suffice to say that Jorum has long since been consigned to the annals of ed tech history.
The technical strategy we developed for the UKOER programme, was really our first attempt at using the open web as technical infrastructure and by and large, it actually worked. Almost ten years after the end of the programme, if you search for UKOER, you can still find some of the project outputs on the open web and on sites like Flickr and YouTube. And what is even more remarkable is that, despite there being someheated discussion at the end of the programme about whether the UKOER hashtag should be retired, people are still using it on twitter to this day. All of which backs up Martin’s closing point that:
“not only did some of the ideas from learning objects and standards later evolve into the work on open educational resources (OER) but many of the same personnel were involved; for example, Stephen Downes, David Wiley, Lorna Campbell, Brian Lamb, and Sheila MacNeill all contributed to this early field and then became significant voices in open education. This demonstrates that while some approaches do not achieve the success envisaged for them, the ideas and people involved develop the key ideas into more successful versions.”
Huge thanks to Martin for chronicling the history of ed tech with his brilliant book, to Clint for making this amazing project happen, Laura for being such an engaging podcast host, and last but not least, Phil for putting up with me for 15 years of metadata projects!
IEEE are standing up a working group for Learning Metadata, to build on IEEE Std 1484.12.1 Standard for Learning Object Metadata while exploring new paradigms and technology practices in education. I submitted the following as a suggestion for a starting point. It’s heavily based on the presentaion I did at the start of the LRMI Metadata in Use panel session for DCMI Virtual. I think it is a reasonable introduction to, and reflection on the strengths of LRMI (we can do the weaknesses some other day). Many thanks to colleagues in the DCMI LRMI Task Group who commented on it, and especially Staurt Sutton for the 10 principles enumerated at the end.
The Learning Resource Metadata Initiative (LRMI) was begun in the spring of 2011 with the initial goal of developing a set of RDF properties and classes to augment Schema.org for description of learning resources. The first terms from LRMI were added to schema.org in 2013, and in 2014 curation and further development was taken on by the Dublin Core Metadata Initiative LRMI Task Group. Within schema.org, many (but not all) of the terms that originate from LRMI are now collected under the LearningResource type. The LRMI Task Group also maintains the terms in a DCMI namespace as a separate vocabulary aligned to their counterparts in schema.org, along with some concept schemes that may be used to provide controlled value spaces for some of the properties. The terms in the DCMI namespace are more stable than those in schema.org and somewhat less bound to the overall modelling approach taken by schema.org; they are also currently under revision to incorporate the latest additions made by LRMI to schema.org.
Currently LRMI terms include the following properties and classes (concept schemes exist to provide controlled vocabularies for those marked with an asterisk *)
Properties associated directly with Learning Resources
From the outset, the intention was that LRMI metadata would be a set of terms defined using RDF that could be used with other vocabularies to build a complete description of a resource used for learning, education or training. It was recognised that other vocabularies existed that could describe the basic characteristics of various forms of learning resources—books, videos, simulations, web pages etc.—and that what was needed was a vocabulary that could describe their educational characteristics. So while LRMI does not include terms for some obvious generic characteristics, such as a “name”, “description”, “author”, and neither does it provide terms for characteristics that are important for specialized formats, such as “video encoding”, “scale”, “technical requirements”, in practice LRMI terms will always be used alongside a metadata schema that does provide these. This “mixing-and-matching” is especially easy with RDF vocabularies. However, the inclusion of LRMI properties in the very broad vocabulary of schema.org means that, in many cases, schema.org is sufficient to meet most requirements; where schema.org is not sufficient it can be augmented with classes and properties from other RDF-based vocabularies. For the sake of simplicity the examples below use LRMI properties within the schema.org vocabulary.
Examples of educational descriptions with LRMI
The two examples below show schema.org / LRMI metadata in JSON-LD format, with graphical representations of the metadata in each.
The name property is used to provide the title of the resource.
The about property is used to provide the subject of the resource in example 1 as a reference to a wikidata item, which is a common practice in linked data. The URI of the wikidata item could be resolved to provide more information. An alternative would be to embed a description of the subject using the name and description properties.
The audience property can be used to state whether the resource is intended for teachers, learners, parents/carers or other stakeholders. A controlled vocabulary for these exists as an LRMI concept scheme though it is not referenced in these examples.
typicalAgeRange can be used to show the age of learners for which the content is suitable–in example 1 this is not the age of the audience.
educationalLevel is used in example 1 to state the stage in the curriculum for which the resource is appropriate, in this case by reference to the Scottish national curriculum. Simple terms like “beginner”, “intermediate”, “advanced” are also allowed.
The teaches and assesses properties can be used to show alignments to competency frameworks or shared curricula; the educationalAlignment property allows other types of alignment to be shown.
Ways Forward
Both LRMI and schema.org are available under Creative Commons licenses (CC:BY for LRMI, CC:BY-SA for schema.org), so you could just use and adapt the terms, though you may want to discuss other rights and assurances with the relevant organizations (not me–I am not affiliated with and do not represent schema.org).
The LRMI Task Group liaises with standards bodies, and is happy to discuss how to use LRMI terms to meet specific use cases. They may be able to help provide concept schemes and additional terms should that prove necessary.
I would suggest that a profile of schema.org (or similar broadly scoped, widely used, rdf-based vocabularies) with LRMI terms and additional concept schemes could meet most use cases. Such a profile could provide better interoperability than the specifications on which it is based by adding constraints to limit redundant options in how metadata can be expressed, and that make data easier to validate.
Aside from the specifics of LRMI, the following points encapsulate the successful features of the approach of LRMI that we would recommend for IEEE P2881:
An entity centric as opposed to record centric model;
Modeled as semantic metadata through commitment to RDF/RDFS (which includes emerging property-graph related specifications notably RDF*);
Syntax independence with non-normative examples in RDF/XML, Turtle, JSON-LD etc.;
Following the Dublin Core 1-to-1 rule—i.e., a description describes one entity and one entity only (related to #1);
Commitment to supporting interoperability through the reuse of existing properties where appropriate and coining new properties only where necessary;
Supporting semantic local extensions to express local or community needs (i.e., things not “said” in the base P2881 standard can be added to local profiles of P2881 with clear semantics for downstream understanding and use);
No defined constraints on properties and classes in the specification—leave such constraints to downstream profiles of the spec (e.g. no “mandatory”, “repeatable” or record-like notions of field size limitations etc.);
No composite attributes (structured values); e.g., VCARD records. Attributes of structured values are decomposed into individual properties (also related to #1);
Defining and maintaining the needed value spaces—enumerations, classifications, and concept schemes as separate RDF entities using W3C’s SKOS (Simple Knowledge Organization System) Recommendation, and expressing them in instance data using an entity like sdo:DefinedTerm / sdo:DefinedTermSet;
Allowing referencing of any appropriate value spaces defined by others (in other words, defined outside the P2881 spec) that are identified by URI, including competency frameworks.
I am somewhat late in writing this, but back in May some new properties developed by LRMI were added to schema.org that simplify and expand how schema.org can be used to describe learning resources and educational events.
The new properties are:
teaches: The item being described is intended to help a person learn the competency or learning outcome defined by the referenced term.
assesses: The item being described is intended to assess the competency or learning outcome defined by the referenced term.
These are added as properties of CreativeWork and EducationEvent, and can both be used with either a DefinedTerm or text as the value (or URL, which is an allowed value for any schema.org property).
In a related change, the domain of educationalLevel, (“the level in terms of progression through an educational or training context”), which was added last year for EducationalOccupationalCredentials was expanded so that it can also be used with CreativeWork and EducationEvent. It also can have DefinedTerms, text or URL as a value.
As well as providing a way of describing the educational characteristics of EducationEvents, which was previously not possible, these changes simplify how learning resources can be described. Previously in order to describe these characteristice one had to use the educationalAlignment property and an AlignmentObject with alignmentType property of “teaches”, “assesses” or “educationalLevel”. Not only was this a somewhat complex indirect way of saying something simple, but we think that the use the AlignmentObject had the effect of hiding the availability of these important properties.
Having DefinedTerm as a value means that one can still describe the value as being drawn from an educational framework, the framework being modeled as a DefinedTermSet, just as one could when using various properties of the AlignmantObject. In fact there is the improvement that you can now provide a URL for the framework being used, not just the url of the target term and the name of the framework. Here are a couple of examples:
The educationalAlignment and AlignmentObject are still valid terms, and indeed are still the only means of describing some other educational characteristics of learning resources. However, going forward, we suggest you use the new teaches, assesses, educationalLevel properties in preference.
We will add these new terms to the DCMI LRMI namespace shortly, I hope, giving them a stable existence independent of schema.org.
Next steps?
If you’re interested in what may be coming next, here’s some of what has been discussed.
A property to express “the skill, knowledge or competence a learner needs before using a learning resource” is on our list of simple properties, which would complete the trio of requires, teaches, assesses. However the name “requires” has lead to confusion over what is required (a pen and paper?) and by whom/what.
Also, take a look at the schema.org gitub issue 1401 and you will see plans and discussion around creating a class for learning resources. I hope this will be limited to CreativeWorks and will sit alongside EducationEvent acting as a convenient place to find the properties relevant to describing educational characteristics, and also allowing such properties to be used when describing Videos, Books, Products, etc as learning resources (i.e. by declaring LearningResource as an additional type) without needing to add education-specific properties higher up the schema.org hierarchy.
Anyone who has ever worked in developing or implementing metadata specifications will know that they’re not exactly a bundle of laughs, so metadata jokes are a rare and precious thing (cf Winston Jingo and Patrice Roux-the-Day.) So when I came across this image in 2016, I had to retweet it for the poor souls who suffered through the LOM and DC years with me.
Being mindful of copyright, I attributed it with a link back to the original blog it was posted on, BigSnarf. Turns out that BigSnarf were not creator of the image and 3 years later the real creator, Michael J. Swart turned up in my mentions to claim his quite rightful ownership. Michael was very gracious, and when I offered to delete my original tweet said he wasn’t too bothered, he was more interested to see how people were using and attributing his images.
Thanks. I don’t really care too much. It’s interesting to see who’s using my drawings and how. It’s sometimes with attribution. This is just the first time with a mistaken credit.
So instead I offered to write a blog post, which means I can re-post Michael’s image with the correct attribution and I can also use this as a teachable moment. Even if you do your best to attribute images correctly, sometimes you will get it wrong. If you do get it wrong, then immediately offer to make it right, by apologising, offering to remove the misattributed image and correcting the attribution. Oh, and Michael has also made lots of other very cool illustrations, which you can see here Articles by Illustration
I’ve been experimenting with ways of putting JSON-LD schema.org metadata into HTML created by MkDocs. The result is a python-markdown plugin that will (hopefully) find blocks of YAML in markdown and insert then into the HTML that is generated. You can find the plugin on github, and you can read more about the development of it in some pages generated by MkDocs (that incidentally use the plugin).
What’s it do?
Markdown is a widely used simple text format that allows formatting of text using inline markup, it’s a bit like the markup used on mediawiki/wikpedia. MkDocs is a python program that will build HTML pages out of markdown and templates. It’s geared towards the production of software/spec documentation, and we have been using it for documenting the metadata spec we’re creating for educational materials in the K12OCX project. (You’ll see the OCX part, Open Content Exchange, made it through to the plugin name.) Steve Midgley suggested that we might go further and use markdown to create the learning resources, somehow generating the metadata along with the HTML. MkDocs can be extended in a number of ways that would facilitate this, but most relevantly it uses the python markdown module, which has an API allowing for extensions.
YAML seemed like the obvious way of putting metadata into markdown. It’s another simple text format, for expressing key-value pairs, where the values can be lists or sets of other key-value pairs. It’s already used by MkDocs for specifying the site structure, and a number of extensions to python markdown already use it.
So the ocxmd extension for python markdown will look for blocks of YAML in a markdown document and replace them with blocks of JSON-LD in the HTML that is generated. It also provides the metadata as python dict in the markdown object. Feel free to try it out (cautiously) and let me know how it goes wrong / what it doesn’t do that it should / what it does that it shouldn’t …
How’s it work?
Quite simple really. I took inspiration from an existing YAML in markdown processor written by Nikita Sivakov (who’s probably a better programmer than me, so if his plug does what you want, use it, not mine). The YAML is separated before and after by a triple dash (‘—‘) on a line by itself. The plugin extends the python markdown Preprocessor so that it goes through the mark down document line-by-line looking for a triple dash. Until it finds one the text is copied into what will be returned as the ‘pure’ markdown document for further processing. When it finds a triple dash, it instead copies the lines into what will be processed as YAML (along with a few lines that will become the JSON-LD context). When it finds the closing triple dash, it processes the YAML using a python library, and then copies it into the markdown to be returned as JSON-LD between a couple of <script> tags. It also stores the python dict generated by the YAML processor as a new property in the markdown object. Then it goes back to reading line-by-line copying the text straight into what will be returned as markdown until it meets the end of file or the next triple dash.
If installed in a suitable python environment you can add it to the extensions available to MkDocs with an entry to the mkdocs.yml markdown_extensions block.
What does JSON-LD in YAML look like?
The metadata that we use for OCX is a profile of schema.org / LRMI, OERSchema and few bits that we have added because we couldn’t find them elsewhere. Here’s what (mostly) schema.org metadata looks like in YAML:
I have posted details of my publications on this site. More interestingly, I did so with a plugin and theme extension which builds on previous work on presenting publications in WordPress and using WordPress as a semantic platform. These are useful steps towards WordPress as a lightweight repository.
Publications as custom post types and metadata
The approach rests heavily on custom post types, one for each publication type. And for each post type there are custom metadata boxes that correspond to the schema.org properties that are important for that type of publication. As an added factor, some of the schema.org properties expect values which are entities rather than literals (i.e. Things not Strings). So there are custom post types for these entity types, namely People (the authors) and Organizations (publishers).
Editing details of a chapter, the dropdown options for author some from the People custom posts.
Output and theme extension
For each custom post type and associated metadata there are helper functions which can be called in the theme to output the data wrapped in suitable HTML. This output includes the schema.org properties embedded as RDFa. So, if you look at one of the publication pages through an RDFa enabled tool (such as Google’s Structured Data Testing Tool) you will extract all the metadata for that chapter. (Note: in this case you’ll also get information about this blog put there by SEO plugins I use).
I also created a single archive for all the publication custom post types. This I did pretty much as outlined before. As noted, that code would give all the custom post types with an archive the same archive. That’s not what I want, so I registered the publication custom post types with a tag of my own
'pubwp_type'=>'publication'
and added this to the query to build the loop for the publications archive:
I would like to make the display of the archive more flexible, so it can be filtered and ordered by any parameter that makes sense (date, author, type, topic). Also I should integrate it better with the rest of the blog, so that publication types show up in general archives and searches.
Beyond my own publication list, I can see this approach being useful for disseminating information different types of resource, e.g. OERs, so I would like to make it easier to create new post types, with associated properties and helper functions.
XCRI-CAP (eXchanging Course Related Information, Course Advertising Profile) is the UK standard for course marketing information in Higher Education. It is compatible with the European Standard Metadata for Learning Opportunities. The W3C schema course extension community group has developed terms for describing educational courses that are now part of schema.org. Here I look at translating the data from an XCRI-CAP xml feed to schema.org json-ld.
Why?
The aim here is to illustrate the extent to which the two specifications are interoperable. Also mapping a functioning specification for advertising courses to schema.org terms will give an indication what might be lacking from the latter, or to help define a subset of schema.org that could be used as an application profile for course advertising. Finally, there is a python script that might be the start of a useful tool for people who have XCRI-CAP data and want to use schema.org to describe those courses.
Before going any further I should clear up one potential point of misunderstanding. In the UK ‘course’ is often used to describe a programme of study at University or College level lasting from one to five years, leading to an award such as an Diploma, Degree, Masters etc. These ‘courses’ also called programmes, and roughly translate to what in the US can be called a Course of Study. They typically comprise several modules, also often called courses (sorry, we made up this language as we went along). XCRI-CAP is primarily used to describe these long courses/programs of study, because in the UK that is what institutions typically advertise to potential students. However, XCRI-CAP can also be used to describe short courses. My sense from the development of the schema course extension is that many people had short courses in mind (e.g. MOOCs), however it is also applicable to long courses / programs of study. So, in short, for this discussion, if it is a “sequence of events and/or creative works that aims to build the knowledge, competence or ability of learners” then I’ll call it a course, however long or short it is.
The anatomy of an XCRI-CAP XML feed in schema.org terms
In XCRI. a Catalog contains Providers, who offer Courses. Courses specify Presentations, i.e. instances of the course, at a venue which is a Provider. Providers have locations. Courses lead to Qualifications and/or Credit that can be used towards a qualification. (image after BS 8581-1:2012 p1)
<?xml version="1.0" encoding="UTF-8"?>
<!--
Author: Alan Paull, APS Ltd, alan@alanpaull.co.uk
Created: 25 June 2014; modified: 21 May 2015
This is a generic XCRI-CAP 1.2 example file produced to illustrate the postgraduate format adopted by Prospects, including material that would be expected to be relevant for other aggregators.
It uses the coursedataprogramme.xsd schema.
It uses revisions to the schemas to include specific refinements for postgraduate data vocabularies.
Modified by Phil Barker <http://people.pjjk.net/phil> to show just starting tags and hints to content
-->
<catalog xmlns="http://xcri.org/profiles/1.2/catalog"
xmlns:xcriTerms="http://xcri.org/profiles/1.2/catalog/terms"
xmlns:credit="http://purl.org/net/cm"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:mlo="http://purl.org/net/mlo"
xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:courseDataProgramme="http://xcri.co.uk">
<dc:contributor>
<dc:description>
<provider>
<!-- list of all the university's departments that can 'own' courses. -->
<mlo:hasPart>
<mlo:hasPart>
<dc:description>
<dc:identifier>
<dc:identifier xsi:type="courseDataProgramme:ukprn"><!-- numerical id -->
<dc:title>Poppleton University</dc:title>
<mlo:url>
<course>
<!-- isPartOf must match exactly with a hasPart entry -->
<mlo:isPartOf>
<!-- Note XHTML markup (concise markup version) -->
<dc:description>
<xhtml:div>
</dc:description>
<!-- 'specialFeature' is in the XCRI-CAP 1.2 Terms schema (verbose markup version) -->
<dc:description xsi:type="xcriTerms:specialFeature">
<div xmlns="http://www.w3.org/1999/xhtml">
</dc:description>
<dc:identifier><!--url-->
<dc:identifier xsi:type="courseDataProgramme:internalID"><!--alpha-numeric id-->
<dc:subject xsi:type="courseDataProgramme:JACS3" identifier="N200"><!--name of subject-->
<dc:subject><!--name of subject-->
<dc:title>
<!-- Course type codes specific to PG(T) -->
<dc:type xsi:type="courseDataProgramme:courseTypeGeneral" courseDataProgramme:identifier="PG"><!--label for code-->
<dc:type xsi:type="mlo:RTCourseTypeFlag" mlo:RT-identifier="T"><!--label for code-->
<mlo:url>
<abstract>
<applicationProcedure href="http://www.poppleton.ac.uk/postgraduate/courses/how-to-apply/"/>
<mlo:assessment>
<learningOutcome>
<mlo:objective>
<mlo:prerequisite>
<regulations href="www.poppleton.ac.uk/regulations"/>
<mlo:qualification>
<dc:identifier><!--alpha-numeric id-->
<dc:title>
<abbr>
<dc:description>
<dcterms:educationLevel>
<mlo:url>
<awardedBy>
</mlo:qualification>
<mlo:credit>
<credit:scheme>
<credit:level><!--code-->
<credit:value><!--value-->
</mlo:credit>
<presentation>
<dc:identifier><!--url-->
<dc:title>
<mlo:start dtf="2015-09-01"><!--text equiv-->
<end dtf="2017-07-01"><!--text equiv-->
<mlo:duration interval="P2Y"><!--text equiv-->
<applyFrom dtf="2014-09"><!--text equiv-->
<applyUntil dtf="2015-09"><!--text equiv-->
<applyTo><!--url-->
<studyMode identifier="PT"><!--label-->
<!-- Note: in the absence of attendanceMode, consumers can assume that it is Campus, so the attendanceMode can be omitted -->
<attendanceMode identifier="CM"><!--label-->
<attendancePattern identifier="DT"><!--label-->
<mlo:languageOfInstruction><!--iso 639-2 code-->
<languageOfAssessment><!--iso 639-2 code-->
<mlo:places><!--iso 639-2 code-->
<mlo:cost><!--free text description-->
<!-- Note: in the absence of venue, consumers can assume that it is as per the main provider element, so the venue can be omitted -->
<venue>
<provider>
<dc:identifier><!--label-->
<dc:title>
<mlo:location>
<mlo:town>
<mlo:postcode>
<mlo:address>
<mlo:address>
<mlo:phone>
<mlo:email>
</mlo:location>
</provider>
</venue>
</presentation>
</course>
<mlo:location>
<mlo:town>
<mlo:postcode>
<mlo:address>
<mlo:address>
<!-- international convention also acceptable: +44 (0) 800 666 9999 -->
<mlo:phone>
<mlo:fax>
<mlo:email>
</mlo:location>
</provider>
</catalog>
Working through this from the top (root) down (up?):
catalog
Top levels of XCRI and corresponding schema.org types and properties. (click to enlarge)
In XCRI catalog is the root element for a list of courses, the Google Developer guidance for describing course lists suggest schema.org/ItemList is a good equivalent. The catalog element has an @generated attribute which is the date on which the catalog content was generated, it also has sub elements of description and contributor. I haven’t implemented this yet, but they could be translated if the schema.org course list is double typed as an ItemList and a CreativeWork. In schema.org the relationship between the course list / catalog and the Courses is provided by the itemListElement property of the ItemList. This expects a value which is a ListItem, and so we need to double type the course entities in schema.org as a ListItem and Course.
provider
XCRI provider elements and the equivalent schema.org graph. (click to enlarge)
In XCRI XML the relationship between courses and the organizations that provide them is expressed by nesting the a course element nested inside a provider element. In schema.org we use the provider property that Course inherits from CreativeWork. The information about the provider, i.e. description, title, parts, location (as a postal address), identifier, url mostly have obvious counterparts in schema.org/Organization (i.e. description, name, subOrganization (not implemented), address (as PostalAddress) and url). Identifiers take a bit of thought, see below.
course
XCRI Course and the equivalent schema.org graph (click to enlarge)
XCRI makes the distinction between Course as a thing which may be offered at different times and places, and Presentation as an offering or instantiation of a course. This is the same as the distinction between schema.org/Course and schema.org/CourseInstance. So the course elements in XCRI map directly to schema.org/Course entities.
The subelements of xcri:course that map clearly to schema.org properties of Course are: title (maps to name), url, abstract (maps to description), subject (maps to about, and, when an identifier from suitable framework is specified, an educational subject alignment) and mlo:prerequiste (maps to coursePrerequisites). Identifiers take a bit of thought, see below, but if the identifier was not an http URI I took a punt at it being the schema.org/courseCode (I am especially sure of this if it had the internalID attribute).
As well as abstract there were other descriptions in the XCRI feed, formatted in XHTML giving marketing information. These I passed over, but (stripped of the formatting) they could be used as descriptions, especially if the abstract is absent.
XCRI Elements that I haven’t mapped yet are, isPartOf, dc:Type, applicationProcedure, mlo:assessment, mlo:objective, regulations, mlo:qualifications, and mlo:credit. The last two of these are known gaps in schema’s ability to describe courses; there might be mappings to some LRMI properties for some of the others in some circumstances. For example if the dc:Type is PG: Postgraduate, then this could be an alignment to some educational level.
Additionally, we use the provider property of schema.org/Course to link to the provider, a relationship that is conveyed in XCRI XML by nesting, as mentioned above.
presentation
XML to schema.org mapping (click to enlarge)
The xcri:presentation maps to schema.org/CourseInstance, which is linked to from the Course by the hasCourseInstance property.
Elements of presentation which map directly to properties of CourseInstance are: title (maps to name), mlo:start (maps to startDate) end (maps to endDate), mlo:duration (maps to duration), studyMode, attendanceMode and attendance pattern (all mapped to courseMode). The venue element maps to CourseInstance’s location property, though the provider’s identifier turns up here in a way which requires a bit of thought, see below.
A number of other elements (namely cost, applyFrom, applyUntil and applyTo) can all be mapped to properties of a schema.org/Offer. mlo:cost maps to a description of a PriceSpecification (the costs for UKHE degrees are usually more complex than can be given with a single number/currency pair), the others map to availabilityStarts, availabilityEnds, availabilityAtOrFrom. This Offer is linked to from the CourseInstance’s offers property.
Identifiers
There are multiple identifiers in various formats in the XCRI XML input, and various required identifiers in the schema.org graph of the course information. As discussed above, some of the dc:identifiers provided were short alphanumeric codes, and were used as, for example, the value for schema.org/courseCode, or to identify an educational subject in an Alignment Object. There is also the mlo:url element, which I used for the schema.org/url property.
What I skipped over several times is that, as well as the mlo:url, similar (or identical) http URIs were used as values for dc:identifier. Also, as well as the schema.org url property, for linked data we need an identifier for the entities we are describing (the @id tag in JSON-LD), preferably an http URI. So, I decided to experiment with using the the dc:identifiers in XCRI XML as @id identifiers for the JSON-LD. This has an advantage over just using an arbitrary random identifier in that for larger data sets there is a chance of reducing repetition in the serialization of the graph. For example with luck many courses will share the same location, and so this could appear as a properly identified entity in the graph to which many Course Instance location properties link. I have experimented with different orders or preference for what to use for the @id, (e.g. 1. dc:identifier beginning with http, 2. mlo:url, 3. dc:identifier with text value). I immediately hit a snag with this, because the same http URI was being used for different things in the XCRI example, e.g. for course and presentation, or for institution and venue, and it troubled me that the URI I was using was actually the identifier for an Institutional web page. So to disambiguate I appended #<SchemaType> to the URI, e.g. http://example.org/course1#Course, http://example.org/course1#CourseInstance.
This is probably going to take more thinking about in the future.
Implementation
Most of the above mapping is implemented (or with luck, soon will be) in a python script using xml.eTree and rdflib. You’re welcome to take a look at it on github, but please bear in mind that it is pretty much untested on any input other than the example file given above. It is certainly not production level code, so don’t use it as such.
The output in JSON-LD is pretty unreadable, so perhaps the most interesting way to view it is through the Google Structured Data Testing Tool. Ignore the errors and warnings, they arise from requirements for various Google products, not problems with the schema.org data.
Conclusions
Broadly speaking, it works. With some exceptions that didn’t surprise me the XCRI-CAP data about a course can be represented as schema.org linked data.
Credit and Qualifications seem to me to be the biggest gaps, relating to existing use cases from the schema course extension community group.
Likewise there is a gap around how to represent aims and objectives in schema.org, which might be related to work on competencies.
In several places there was coded information in the XCRI (e.g. UK Provider ID, course mode codes) which isn’t easy to represent in schema.org. But this issue is being worked on.
Next steps
I’ll tidy up the code a bit, and I also want to test it more extensively. I’m also pondering putting the resulting JSON-LD into a graph database to test how well it can be queried. This would be a great test of whether the schema course extension project really did meet it’s use cases.
Do drop me a line if you have any ideas, or if you have any XCRI feeds (or similar data in another format) I could play with.