Archive

Archive for June, 2013

Spring 2013: Instance Hub 2

June 2, 2013 Leave a comment

My work this semester has focused finishing up porting the old TW instance hub over to a new system based off of Alvaro Grave’s LODSPeaKr framework for linked data applications. The new instance hub can currently be found at http://logd.tw.rpi.edu/ih2/id. The LODSPeaKr framework has allowed for the creation of site that is more powerful and intuitively useable for both users of the service and developers on the backend.

The idea of an instance hub is to provide those using the Semantic Web with a way to express authoritative references to entities, and to provide basic identification data about those entities. Authoritative referencing is important because many things may have similar names or in fact different names may be used to describe one thing. Take for example, US States, which are referred to by many different naming schemes – full names, abbreviations, fips codes, and variations on these (all caps, all lowercase, first letter capitalized). An instance hub can serve as a disambiguation service by hosting all of these variations, making it easier to link data that might have entirely different ways of referring to the same entity. Instance hubs can also provide basic additional data about the entities that they store – links out to DBPedia URIs about the same entity (or really any owl:sameAs URI), datasets related to the entity, descriptive texts, pictorial representations, and more. Developers can then leverage the data found in the instance hub in the creation of linked data applications.

Attached to this post is a poster about my research that I presented at Rensselaer’s fourth annual undergraduate research symposium – URGS Poster.

While it is not currently publicly accessible, I have already put the new instance hub to use in the development of a linked data application for Professor Jim Hendler’s Web Science class this past semester. My Web Science group created a visualization of asthma attacks per state versus smoking prevalence per state. We used the instance hub as way of linking two datasets with different ways of referring to US states, and as a way to get data about individual states if users were interested in learning more about them. In the first case, we had one data set that referred to states by their full names and another which only used two letter abbreviations, with absolutely no other fields that could be used to establish commonality between them. The instance hub allowed me to write a single query which grabbed data from one dataset, then got the abbreviations for each state in the dataset using the instance hub, and then grabbed data from the other dataset  for each on the basis of the abbreviations. In the second case, we used the instance hub to present data to users when they clicked on a state in a Google Map in the visualization. When the state was clicked we ran a query out to the LOGD SPARQL endpoint to retrieve data on the state, including it’s DBPedia URI, and this data in conjunction with data found in DBPedia was then presented to the user.

The new instance hub is an improvement upon the old one in several ways:

  • Hierarchical navigation of URIs – previously, while instances were presented at URIs that were authoritative and descriptive of the thing found at the URI (ie: NOAA is found at http://logd.tw.rpi.edu/id/us/fed/agency_page/Department_of_Commerce/National_Oceanic_and_Atmospheric_Administration, and the fact it is a a US federal agency under the Department of Commerce is encoded in its URI), the URIs themselves were not  navigable in a way one might expect them to be – going to a partial URI such as http://logd.tw.rpi.edu/id/us/fed or http://logd.tw.rpi.edu/id/us/fed_page would give an error rather than presenting information about US federal government related concepts and entities found in the instance hub. With LODSPeaKr, I was able to easily define “services” which fire off one or more SPARQL queries and then generate a page presenting a listing of entities which fall under the category defined by the URI fragment which is requested (ie: countries at “/id/country”, US states at “/id/us/state”, US federal entities at “/id/us/fed”). Examples of this can be found by clicking on the various headings found at http://logd.tw.rpi.edu/ih2/id, which then lead to more specific categorial listing pages.
  • Typing on instances – in the previous instance hub, entities did not have an rdf:type, which is a necessary prerequisite for using LODSPeaKr services. I had to use the csv2rdf4lod tool to reconvert many of the instance hub data sets so that they would be typed for LODSPeaKr. While we decided that using a local vocabulary was the best way to handle this typing, it still provides users with more data about the entities they find in the instance hub, and allows for easier retrieval of data via the TW SPARQL endpoint, as users can query on the basis of type.
  • Flexibility and ease of extension – the previous instance hub was built from PHP scripts that handled data retrieval and presentation. With the move to LODSPeaKr, development is much easier, as the framework handles all the backend logic of querying and parsing retrieved data, leaving the developer to simply write the queries they need for the page, specify the SPARQL endpoints that they should go to, and then use the Haanga templating engine built into LODSPeaKr to define the front end presentation style of pages.
  • Move away from “_page” convention – the previous PHP based instance hub used the convention of appending “_page” to any entity that was being presented with a human readable HTML page, as opposed to a content-negotiated RDF dump. With LODSPeaKr, an extension appended to the end of the URI defines the type of information that is presented, with “.html” as  a default if the user doesn’t specifically ask for something different. Users exploring the data in an HTML based web browser who then wish to retrieve an RDF representation of the data they are viewing can easily get it by simply replacing “.html” with an RDF format file extension – “.rdf”, “.ttl”, etc.

There are also a number of improvements I would like to make to the LODSPeaKr instance hub:

  • Better listing pages – I would like to make pages that list out entities more visually and interactively appealing. Having a JavaScript button to toggle expanding and contracting each category list would be one easy way to make these pages more manageable.
  • Google Maps Integration – I want to make each page describing a geographic entity (country, US States, US counties) feature an embedded Google Map area showing it.
  • More categories – the old instance hub features a few categories not in the new one, such a toxic chemicals, and these need to be ported over to the new one.
  • Abstraction of queries – currently each “service” listing page (ie: “/id”, “/id/us/fed”, “/id/country”) has it’s own set of queries that only it makes use of, when in fact many of these queries are reused across multiple pages (a query for all US Federal Agencies is needed at “/id”, “/id/us”, “/id/us/fed” and “/id/us/fed/agency”). If a single copy of these reused queries was available to any page in the instance hub, it would be much easier to mangage and modify the listings presented.
  • Use of scaffolding services – while I was developing the new instance hub, Alvaro Graves added support for “scaffolding services” into LODSPeaKr. Essentially, scaffolding services allow for the presentation of pages selected through regular expression matching of the URI which is requested. The current instance hub does not have support for county listing pages (ie: “/id/us/state/STATE/county”), but I could make this possible with the use of a scaffolding service which looks for that pattern and then presents a county listing page for each state’s counties if it is requested by a user.
  • Data set integration – pages should also feature a listing of datasets related to the entity that they describe.

 

Categories: Uncategorized