Home > Uncategorized > Summer/Fall 2012 at TWC

Summer/Fall 2012 at TWC

This semester at TWC I’ve been working on improving the instance hub project. Over the summer I worked at Data.gov on building an instance hub, and I’ve been trying to apply my knowledge to work here at the lab.

I spent this past summer in Washington, DC, as an intern at Data.gov, part of the US General Services Administration. It was a great experience spending a second summer in Washington, and I learned a lot. Most of my work at Data.gov involved building an instance hub of US federal agencies that report to the site. The purpose of the instance hub is to provide authoritative URIs for US federal agencies and to allow for easy linking to and retrieval of data available on them from Data.gov. The process of developing the instance hub began with a SQL dump given to me by Data.gov’s software architect. I threw the SQL dump into Google Refine where I cleaned up and enriched the data. I then exported the refined data as a CSV file and wrote enhancement parameters for Tim Lebo’s csv2rdf4lod tool to convert it to RDF. After compiling an instance of Virtuoso, I imported the RDF and wrote SPARQL queries to get data out. Finally, I wrote a site in PHP to present a listing of agencies as well as individual agency pages. I thought about using a more complicated framework, but in the interest of making the site highly portable and easy to install on Data.gov servers, I opted to just do pure PHP.

Working with just PHP was a really interesting learning experience, as I got to see how things like content negotiation are handled from the inside. I wrote code to send HTTP 303 redirects to requests asking for HTML, and to handle the delivery of content to users in various formats as requested in the HTTP Accept header (xml/rdf, turtle, json).

At the end of the summer I had a fully functional instance hub of agencies reporting to Data.gov, presenting information about them including name, name abbreviation, logo, and website in HTML and supporting content negotiation for RDF representations of the agencies. All data for the site was dynamically queried for from a SPARQL endpoint running on my computer.

I gave a talk on my work in greater technical depth in October for a TWed night. A recording can be found here: http://www.ustream.tv/recorded/25884279


I began my semester at TWC by making a few improvements to my Data.gov instance hub, most notably, changing the way that SPARQL queries are used. In the initial version of the site I simply took request URIs and inserted them into a long query string, which was a very ugly hard to work solution. After getting some advice from Dominic at TWC, I switched to hosting the SPARQL queries in a separate folder as PHP files. I had PHP fill in a GET variable in the queries where an agency URI would go. This allowed me to then simply grab the query string found in the query file after making a request with the agency URI as a GET variable. I then escaped this string and submitted it to the database to get relevant information out. Not only is this a cleaner way of doing things, it also is more easily maintained; and by abstracting the queries out to another file, they can be reused in multiple contexts.

My next task involved porting the instance hub I made over to Alvaro Graves’ LODSPeaKr framework. I installed an instance of LODSPeaKr on my laptop and configured it to point at my local SPARQL endpoint. I then configured the system to present a layout of information about agencies in the instance hub.

After getting a bit of experience with using LODSPeaKr, I moved to looking at the current Drupal based instance hub and figuring out how to migrate it over to LODSPeaKr. The Drupal approach is very hacky and hard to deal with, and LODSPeaKr would provide a better interface and significantly easier operation. One of the things I’ve been looking at is using LODSPeaKr to allow for presentation of information at partial URIs. So we might have the URI http://logd.tw.rpi.edu/id/us/fed/agency/Department_of_Health_and_Human_Services/Agency_for_Healthcare_Research_and_Quality  This URI cannot be deconstructed – there is nothing at http://logd.tw.rpi.edu/id/us/fed/agency or http://logd.tw.rpi.edu/id/us/fed/ Using LODSPeaKr would allow us to present information at each successive level of specificity in the URI. So “http://logd.tw.rpi.edu/id/us/fed/” could present all things that related to US and Federal Government, while “http://logd.tw.rpi.edu/id/us/fed/agency” could present a listing of US Federal Agencies.

My current problem is that much of the data in the instance hub backend is not typed, so I don’t have a way of picking out everything that is, say, US and Federal. I’m currently working on how to address this issue. Even if I don’t get it done before the impending close of the semester, I plan on working on this project until it’s done, as I’d really like to see the instance hub improved.

Categories: Uncategorized
  1. December 1, 2012 at 5:36 pm

    Great work Alexei. The work over the summer sounded exciting. I for one am thankful for representing our lab so well. And great work this semester. I for one hope that you decided to continue with our lab.

  2. Bud
    December 1, 2012 at 8:02 pm

    Check out Palantir.com I’m interested if the software might be able to enrich your data analysis….

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: