We’re just half way through the fourth week of the semester here at RPI, but I’ve already made a number of improvements to the TW Instance Hub, with more to come.
- Country and US state pages now present a Google map and flag at the top of the page. The location data is derived from DBPedia, which returns a lat/long of the country/state’s capitol, so on some of these, the sizing is a bit awkward and the borders of the entire entity don’t fit in the maps window, but it’s still better than nothing.
- Toxic chemicals commonly found in US EPA reports are now part of the instance hub. They can be found at: /id/us/fed/agency/Environmental_Protection_Agency/chemical/
- A selection of IOGDS (International Open Government Dataset Search) datasets are now presented at the bottom of country pages for those countries that have datasets in our catalog. As there are over a million of these datasets, currently a selection of the top 100 is served up for each country, but I would like to figure out a better way of integrating Instance Hub and IOGDS. Check out the United Kingdom’s datasets at the bottom of their page here.
- Better backend management – I recently reorganized the backend storage of queries to all for easier management. Rather than house a copy of each necessary query in the /queries directory of each service, I made a folder to house all my service queries and then just symlinked to the necessary queries for the /queries directory in each service. Abstracting out these queries has made the site much easier to manage, and will definitely be important as we seek to further expand it.
- The listings on intermediate “service” pages (ones that are formed from partial URI paths, but are not URIs themselves, ie: http://logd.tw.rpi.edu/ih2/id or http://logd.tw.rpi.edu/ih2/id/country) now list entities in alphabetical order, making navigation and exploration easier. It’s a minor tweak, but it took a decent amount of work to get this done, as in the process I discovered some bugs in our csv2rdf4lod tool conversion parameters for the instance hub data, and those had to be fixed.
I am working on some more exciting improvements and hope to blog again soon about the latest evolution of the TW instance hub.
My work this semester has focused finishing up porting the old TW instance hub over to a new system based off of Alvaro Grave’s LODSPeaKr framework for linked data applications. The new instance hub can currently be found at http://logd.tw.rpi.edu/ih2/id. The LODSPeaKr framework has allowed for the creation of site that is more powerful and intuitively useable for both users of the service and developers on the backend.
The idea of an instance hub is to provide those using the Semantic Web with a way to express authoritative references to entities, and to provide basic identification data about those entities. Authoritative referencing is important because many things may have similar names or in fact different names may be used to describe one thing. Take for example, US States, which are referred to by many different naming schemes – full names, abbreviations, fips codes, and variations on these (all caps, all lowercase, first letter capitalized). An instance hub can serve as a disambiguation service by hosting all of these variations, making it easier to link data that might have entirely different ways of referring to the same entity. Instance hubs can also provide basic additional data about the entities that they store – links out to DBPedia URIs about the same entity (or really any owl:sameAs URI), datasets related to the entity, descriptive texts, pictorial representations, and more. Developers can then leverage the data found in the instance hub in the creation of linked data applications.
Attached to this post is a poster about my research that I presented at Rensselaer’s fourth annual undergraduate research symposium – URGS Poster.
While it is not currently publicly accessible, I have already put the new instance hub to use in the development of a linked data application for Professor Jim Hendler’s Web Science class this past semester. My Web Science group created a visualization of asthma attacks per state versus smoking prevalence per state. We used the instance hub as way of linking two datasets with different ways of referring to US states, and as a way to get data about individual states if users were interested in learning more about them. In the first case, we had one data set that referred to states by their full names and another which only used two letter abbreviations, with absolutely no other fields that could be used to establish commonality between them. The instance hub allowed me to write a single query which grabbed data from one dataset, then got the abbreviations for each state in the dataset using the instance hub, and then grabbed data from the other dataset for each on the basis of the abbreviations. In the second case, we used the instance hub to present data to users when they clicked on a state in a Google Map in the visualization. When the state was clicked we ran a query out to the LOGD SPARQL endpoint to retrieve data on the state, including it’s DBPedia URI, and this data in conjunction with data found in DBPedia was then presented to the user.
The new instance hub is an improvement upon the old one in several ways:
- Hierarchical navigation of URIs – previously, while instances were presented at URIs that were authoritative and descriptive of the thing found at the URI (ie: NOAA is found at http://logd.tw.rpi.edu/id/us/fed/agency_page/Department_of_Commerce/National_Oceanic_and_Atmospheric_Administration, and the fact it is a a US federal agency under the Department of Commerce is encoded in its URI), the URIs themselves were not navigable in a way one might expect them to be – going to a partial URI such as http://logd.tw.rpi.edu/id/us/fed or http://logd.tw.rpi.edu/id/us/fed_page would give an error rather than presenting information about US federal government related concepts and entities found in the instance hub. With LODSPeaKr, I was able to easily define “services” which fire off one or more SPARQL queries and then generate a page presenting a listing of entities which fall under the category defined by the URI fragment which is requested (ie: countries at “/id/country”, US states at “/id/us/state”, US federal entities at “/id/us/fed”). Examples of this can be found by clicking on the various headings found at http://logd.tw.rpi.edu/ih2/id, which then lead to more specific categorial listing pages.
- Typing on instances – in the previous instance hub, entities did not have an rdf:type, which is a necessary prerequisite for using LODSPeaKr services. I had to use the csv2rdf4lod tool to reconvert many of the instance hub data sets so that they would be typed for LODSPeaKr. While we decided that using a local vocabulary was the best way to handle this typing, it still provides users with more data about the entities they find in the instance hub, and allows for easier retrieval of data via the TW SPARQL endpoint, as users can query on the basis of type.
- Flexibility and ease of extension – the previous instance hub was built from PHP scripts that handled data retrieval and presentation. With the move to LODSPeaKr, development is much easier, as the framework handles all the backend logic of querying and parsing retrieved data, leaving the developer to simply write the queries they need for the page, specify the SPARQL endpoints that they should go to, and then use the Haanga templating engine built into LODSPeaKr to define the front end presentation style of pages.
- Move away from “_page” convention – the previous PHP based instance hub used the convention of appending “_page” to any entity that was being presented with a human readable HTML page, as opposed to a content-negotiated RDF dump. With LODSPeaKr, an extension appended to the end of the URI defines the type of information that is presented, with “.html” as a default if the user doesn’t specifically ask for something different. Users exploring the data in an HTML based web browser who then wish to retrieve an RDF representation of the data they are viewing can easily get it by simply replacing “.html” with an RDF format file extension – “.rdf”, “.ttl”, etc.
There are also a number of improvements I would like to make to the LODSPeaKr instance hub:
- Google Maps Integration – I want to make each page describing a geographic entity (country, US States, US counties) feature an embedded Google Map area showing it.
- More categories – the old instance hub features a few categories not in the new one, such a toxic chemicals, and these need to be ported over to the new one.
- Abstraction of queries – currently each “service” listing page (ie: “/id”, “/id/us/fed”, “/id/country”) has it’s own set of queries that only it makes use of, when in fact many of these queries are reused across multiple pages (a query for all US Federal Agencies is needed at “/id”, “/id/us”, “/id/us/fed” and “/id/us/fed/agency”). If a single copy of these reused queries was available to any page in the instance hub, it would be much easier to mangage and modify the listings presented.
- Use of scaffolding services – while I was developing the new instance hub, Alvaro Graves added support for “scaffolding services” into LODSPeaKr. Essentially, scaffolding services allow for the presentation of pages selected through regular expression matching of the URI which is requested. The current instance hub does not have support for county listing pages (ie: “/id/us/state/STATE/county”), but I could make this possible with the use of a scaffolding service which looks for that pattern and then presents a county listing page for each state’s counties if it is requested by a user.
- Data set integration – pages should also feature a listing of datasets related to the entity that they describe.
This semester at TWC I’ve been working on improving the instance hub project. Over the summer I worked at Data.gov on building an instance hub, and I’ve been trying to apply my knowledge to work here at the lab.
I spent this past summer in Washington, DC, as an intern at Data.gov, part of the US General Services Administration. It was a great experience spending a second summer in Washington, and I learned a lot. Most of my work at Data.gov involved building an instance hub of US federal agencies that report to the site. The purpose of the instance hub is to provide authoritative URIs for US federal agencies and to allow for easy linking to and retrieval of data available on them from Data.gov. The process of developing the instance hub began with a SQL dump given to me by Data.gov’s software architect. I threw the SQL dump into Google Refine where I cleaned up and enriched the data. I then exported the refined data as a CSV file and wrote enhancement parameters for Tim Lebo’s csv2rdf4lod tool to convert it to RDF. After compiling an instance of Virtuoso, I imported the RDF and wrote SPARQL queries to get data out. Finally, I wrote a site in PHP to present a listing of agencies as well as individual agency pages. I thought about using a more complicated framework, but in the interest of making the site highly portable and easy to install on Data.gov servers, I opted to just do pure PHP.
Working with just PHP was a really interesting learning experience, as I got to see how things like content negotiation are handled from the inside. I wrote code to send HTTP 303 redirects to requests asking for HTML, and to handle the delivery of content to users in various formats as requested in the HTTP Accept header (xml/rdf, turtle, json).
At the end of the summer I had a fully functional instance hub of agencies reporting to Data.gov, presenting information about them including name, name abbreviation, logo, and website in HTML and supporting content negotiation for RDF representations of the agencies. All data for the site was dynamically queried for from a SPARQL endpoint running on my computer.
I gave a talk on my work in greater technical depth in October for a TWed night. A recording can be found here: http://www.ustream.tv/recorded/25884279
I began my semester at TWC by making a few improvements to my Data.gov instance hub, most notably, changing the way that SPARQL queries are used. In the initial version of the site I simply took request URIs and inserted them into a long query string, which was a very ugly hard to work solution. After getting some advice from Dominic at TWC, I switched to hosting the SPARQL queries in a separate folder as PHP files. I had PHP fill in a GET variable in the queries where an agency URI would go. This allowed me to then simply grab the query string found in the query file after making a request with the agency URI as a GET variable. I then escaped this string and submitted it to the database to get relevant information out. Not only is this a cleaner way of doing things, it also is more easily maintained; and by abstracting the queries out to another file, they can be reused in multiple contexts.
My next task involved porting the instance hub I made over to Alvaro Graves’ LODSPeaKr framework. I installed an instance of LODSPeaKr on my laptop and configured it to point at my local SPARQL endpoint. I then configured the system to present a layout of information about agencies in the instance hub.
After getting a bit of experience with using LODSPeaKr, I moved to looking at the current Drupal based instance hub and figuring out how to migrate it over to LODSPeaKr. The Drupal approach is very hacky and hard to deal with, and LODSPeaKr would provide a better interface and significantly easier operation. One of the things I’ve been looking at is using LODSPeaKr to allow for presentation of information at partial URIs. So we might have the URI http://logd.tw.rpi.edu/id/us/fed/agency/Department_of_Health_and_Human_Services/Agency_for_Healthcare_Research_and_Quality This URI cannot be deconstructed – there is nothing at http://logd.tw.rpi.edu/id/us/fed/agency or http://logd.tw.rpi.edu/id/us/fed/ Using LODSPeaKr would allow us to present information at each successive level of specificity in the URI. So “http://logd.tw.rpi.edu/id/us/fed/” could present all things that related to US and Federal Government, while “http://logd.tw.rpi.edu/id/us/fed/agency” could present a listing of US Federal Agencies.
My current problem is that much of the data in the instance hub backend is not typed, so I don’t have a way of picking out everything that is, say, US and Federal. I’m currently working on how to address this issue. Even if I don’t get it done before the impending close of the semester, I plan on working on this project until it’s done, as I’d really like to see the instance hub improved.
This semester I’ve been attending weekly Data.gov/Linked Open Government Data meeting at Tetherless World in preparation for a summer internship at Data.gov in Washington, DC. I’m still working through the hiring process for the internship, there are a number of requirements to complete before actually being formally hired. Additionally, I’ve been trying to do some work with the csv2rdf4lod tool to convert government data on mine safety. I’ve gotten the converter working, but I need to work on enhancing the data conversion manually.
Yesterday I presented a poster on the OrgPedia visualization from last semester at RPI’s third annual Undergraduate Research Symposium. The poster can be found on the TWC site here: http://tw.rpi.edu/web/doc/Using_d3js_Visual_Corporate_Board
This Summer it’s looking like I’ll be interning at Data.gov, so I’ve been trying to get more of a feel for the LOGD (linked open government data) programs going on at Tetherless World. I’ve been attending weekly meetings about Data.gov related LOGD projects at TWC, and I’ve also been tasked with working with the csv2rdf4lod tool to convert some government data on mine safety. So far I’ve been extremely busy with 22 academic credits before research and extracurriculars, but I’m enjoying getting to work with more government related research.
This semester at TWC I assisted Xian Li with the OrgPedia organizational transparency site. While I tried to help a bit with the OrgPedia pages and the information they present, my most significant contribution to the project was the board members network demo I created with Bharath Santosh. I already blogged about the creation of the demo here. It was definitely a highlight of the semester for me, and now that I have some experience with using Python to pull and format web data, I’d like to further explore visualization with the d3.js library – perhaps next semester.
Outside of TWC, in the former half of the semester I worked on an independent study about my experiences as a intern in the US House of Representatives this past summer. I ended up writing almost forty single-spaced pages on the role of information technology in the federal government as I observed it both on the clerical day-to-day side and in larger policy issues such as open government data and cybersecurity strategy. My experience in Washington taught me a lot about the policy implications that the sort of work done at the TWC can have, and on a broader level, the interplay of technology and policy at large.
Overall, I had a great semester, and I’m looking forward to a busy schedule in the new year. I plan to continue my involvement with the TWC Undergraduate Lab.
I recently worked with Bharath Santosh to help Xian Li with a demo of the OrgPedia organizational transparency site. The project involved creating an interactive graph visualization of connections between members of corporate boards (the final product can be found here). Given a list of a few hundred stock tickers and access to the LittleSis API, the goal was to ultimately produce a JSON file of board members that could be use by the D3.js force-directed graph framework. I started by looking up each ticker symbol, yielding a JSON file with a unique ID number for each company. My script then queried the API for actual company page associated with that ID and stored the names, company associations, and URIs of each board member. Finally, a JSON file for the D3.js graph was output describing the ~2800 board members and the links between each of them.
While I had used Python a bit for command line scripting, I hadn’t really dug into it before this project. The work gave me a better taste for the language and its capabilities. I made extensive use of the “urllib” library for accessing web content, and worked with opening up the data in JSON files. Bharath helped me with the syntax of program and some of the graph construction. While I was aware of Python’s reputation for ease of use and high level abstraction, working with it let me experience this abstraction first hand, I was very impressed. The ease with which complex multistep operations could be completed let me focus more on the flow of the data through the process rather than the specifics of handling it. The project also gave me a bit more hands on experience with JSON.