Category Archives: databases

Tales from the Port: Part 2 — Migrating the Database

In retrospect, maybe I shouldn’t have promised to write a blog post every night this week. The port has been going well, but I’ve been working late each night, and it’s just too hard to write clear English prose starting at midnight. So here, at last, is the promised post on migrating Project Quincy’s database from Rails to Django.

My first love in Digital Humanities is data modeling and database architecture. The actual “code” in Project Quincy is pretty basic by professional programming standards. The underlying data structure is the real intellectual achievement. I spent six months of my nine month fellowship at the Scholars’ Lab designing a database that would effectively and efficiently model historical sources and allow scholars to catalog and analyze their research in meaningful ways. I even wrote a program called DAVILA to auto-generate interactive, color coded, annotated diagrams of my schema to show other historians how the system works. After all that work had been done, designing the interface for The Early American Foreign Service Database (EAFSD) took about two weeks.

As I mentioned last time, Rails and Django are similar frameworks for connecting databases to websites. They both have procedures for creating new database instances in several open source databases: MySQL, PostgreSQL, or SQLite3. But I already have a MySQL database with all the information I’ve been entering for the last three years. I really didn’t want to redo all that work, so I kept the same underlying database and connected it to the new Django project, with a few minor changes.

In the past three years, I’ve found a few shortcomings in the data model I created. So, I’ve used the port as an opportunity to add a couple more tables. Project Quincy records a “latitude” and “longitude” point for every location in the database, but I forgot to indicate which geographic coordinate system the latitude and longitude were from. Luckily for me all my coordinates were in the same system, so my maps work properly. But I can’t count on that forever, so I added a table called CoordinateSystem. I also extended the table that records which individuals were members of a specific organization. I had a field called “role” but there was no way of creating a list of all those roles and reusing them. I added two new tables “RoleTitle” and “RoleType” to allow for lists and grouping by type.

Then there were a few changes required by Django, mostly to my Footnotes module. Since Project Quincy is designed to store scholarly research, it gives users the ability to ‘footnote’ any record in the system by attaching the record to a cited source and saying whether or not that source supports the information in the record. This is accomplished by the Validations table, which can (but does not have to) be connected to any record in the database. This type of unspecified relationship is known as a “polymorphic association,” and Rails and Django implement polymorphic associations differently. Rails uses the name of the table to create the relationship. Django makes a meta-table that holds the names of all the other tables and assigns them a numeric key. So, I had to replace my table names in the Validations Table with their new keys. Figuring out how to do this took a post to the ever helpful Stackoverflow website and I was back in business. The old Footnotes module also had a little “Users” table that kept track of the people who could upload into the system. Django comes with a very powerful authentication system which also records users, so I got rid of my little table and hooked the footnotes module directly into the django_auth_user table.

I had greater plans to include an “Events” module. But, as I started to design one, I realized that this was not a decision I should make on my own and under a deadline. Project Quincy is an open source project, and I want other scholars to use it for their research. I need to do more reading on modeling events and talk to people before I commit to one particular structure.

So how did I actually migrate the database? MySQL has a nice command for backing up and redeploying a database; it’s called mysqldump. I took a dump (yes you read that correctly:-) of the database off my server and used it to create a transition database on my local machine. I then went in made the changes to the transition database directly, safe in the knowledge that could always restore the original database if I messed up. Once I had the transition database the way I wanted it, I made a second dump and used it to populate the database Django had already created for the new project.

Once all my data was in the new database, I ran an extremely helpful Django command ‘inspectdb.’ This lovely little program examined my database and created a file with its best guess on how to represent each database table in Django syntax. Then all I had to do was check for errors, and there weren’t many. It mistook my boolean (true/false) fields for integers and wanted me to specify some additional information for self joins (tables containing more than one relationship to the same, second table).

Once I had the tables properly represented it was time to sort them into their appropriate ‘applications.’ One of the biggest diferences between Rails and Django is their file structure. Rails creates a folder (with its own nested folders) for every table in the database. Django asks developers to chunk their database into folders called applications, designed to keep similar functions together in the system. Project Quincy was always designed with six modules: Biographical, Locations, Correspondence, Citations, Organizations, and Assignments. Each of these modules has 2 to 8 database tables inside it. One of the biggest decisions I had to make in planning this port was how to use applications. Did I put everything in one app folder, create an app for every module, or find an new way of grouping my system?

To make the decision, I wrote out index cards for each module listing the tables involved and what other modules it related to. I realized that Assignments and Organizations both brought people to a location for a reason, and that I would likely be visualizing those two kinds of relationships in vary simliar ways, but what should I call the new app? I ran the idea past my father, who has been designing databases since before I was born and recently took his entire development to python and django. He suggested the name “Activities” and that my future Events module could go in the same application.

After I sorted my tables into their appropriate (and newly created applications) I synced my Django project with the underlying database. So far, everything looks good.

Tales from the Port: Day 1 — Dry Dock

Welcome to my one week blog series, Tales from the Port, chronicling my rewriting of Project Quincy from Ruby on Rails to Django. This series may be a little rough around the edges — I’ll be writing it every night after I accomplish my goals for that day. But I wanted to give people a window into the life of (at least one) Digital Humanities developer. To see what it’s like to imperfectly translate your research and theories into lines of code and then watch your project come ‘alive.’

Of course, Digital Humanities is not about writing code or knowing how to program. DH is a community of people searching for a new way of working and researching, and we find inspiration in many disciplines. But, this is going to be one of the more intense work weeks of my life to date, so I’m hoping you’ll keep me company.

First, some background:

Project Quincy is an open source software package I wrote a few years back to trace historical networks through time and space. It is an integral part of my dissertation and currently runs The Early American Foreign Service Database, which went live almost two years ago on October 18, 2010. Project Quincy got its start at the University of Virginia’s Scholars’ Lab, where I was a graduate student fellow in 2008-2009. I was very pleased with the system when I first designed it, but technology doesn’t stop when your fellowship ends. Faced with an aging code base and an interface that could no longer accomodate the visual arguments which are becoming more and more central to my dissertation, it was time to upgrade. I could have taken Project Quincy from Rails 2.3.8 to 3.0 and tweaked the stylesheet along the way, but I am no longer at the University of Virginia. Last summer I was hired by the Brown University Library as their first Digital Humanities Librarian. My new colleagues program (mostly) in Django, and I’ve already met one or two professors here who could probably use the system for their own research. It was time to learn some new skills.

I have thoroughly enjoyed learning Python and Django, so much so that I will probably write more on them once this week is over. Since finishing up the tutorials, I have spent the last two weeks planning the port. As the week unfolds I’ll discuss how the system is changing and my reasons for making those changes. Although both Django and Rails exist to connect databases to websites with minimal headaches for the programmer, they have different affordances and make very different assumptions about what constitutes beautiful code.

So what have I done today?

Today I created the new Django project which I will be extending into ProjectQuincy. I had hoped to have the entire data model rewritten by now, but no plan survives contact with a new development environment . . . Apparently when I got my new MacBook Pro my MySQL installation did not survive the transfer. It took a few hours of research, then reinstalling MacPorts, before I could really get underway. I will have more to say tomorrow on my changes to the data structure.

Planning this port has been a bittersweet experience. I’ve had a great deal of fun learning a new language and framework. My colleagues at Brown, particularly Birkin Diana and Joseph Rhoads, have been extremely helpful: suggesting good training materials, answering questions, and teaching me the ever crucial “best practices.” Thanks to their help, I am looking forward to having a cleaner, more robust system. But, my fellowship year at Scholars’ Lab is a cherished memory, and so many people there helped and taught me as I figured out how to make The Early American Foreign Service Database a reality. As I worked on the project my friends pitched in, putting their own stamp on the code base. This new, fresh start won’t have code from Bess Sadler, Matt Mitchell, Joe Gilbert, or Wayne Graham. For a little while it will just be my code, and that feels a little lonely. But it won’t last. Soon I’ll be showing the system to my colleagues at Brown, and I can’t wait to see Project Quincy afresh through their eyes.

Am I even qualified?: Writing about Digital History

About two weeks ago, my article “Fielding History: Relational Databases and Prose” went online for open peer review and possible inclusion in the open access essay collection Writing History in the Digital Age, edited by Jack A. Dougherty and Kristen Nawrotzki. If you haven’t heard about Writing History in the Digital Age, you owe it to yourself to head over to the website and learn more. It is a fabulous project and experiment in open peer review and open access publishing. I am honored that my essay has made it this far.

As part of the experiment, the editors have asked their prospective contributors to publicly reflect on their writing process. That’s something I have neglected to do until now, for a variety of reasons . . . the main reason being that this may be the most difficult essay I have ever written.

OK, my master’s thesis was probably harder, but that was five times as long, and I worked on it (on and off) for 4 years before sending it out to be traditionally peer-reviewed. “Fielding History” clocks in at 3,077 words and I only had six weeks to write it. Six weeks in which I nearly gave up on the project approximately four times. One of these times (but not the last), my husband asked why I was quitting. Before I could think, I responded:

“Because I am only qualified to write about 18th Century Diplomats!”

He just stared at me. I stared back. Until the absurdity hit me. I’m an open source developer, a database architect, and a historian. I’ve developed and built databases for three distinct historical projects (including my own dissertation). If I can’t write an article on relational databases and historical writing, then who can?

But that is the trick with an emerging field. It’s hard to know when you are ready to write about it. I know how to become credentialed to write about my 18th Century diplomats. I’m getting a PhD. Most of the time I’m very happy with the boot-strapping culture that imbues Digital Humanities with so much of its energy and allows someone like me to have a great job and contribute to the field without having to wait for tenure or even my dissertation committee to declare me “fit.”

But then I have to write about what I’m doing, and all the doubts pile in. I always wanted to write about the theory of history, even before I entered a PhD program, but I thought I would do it after I retired — an emerita’s retrospective on a life of study. What right do I have to do this when I’m only 29?

What finally got me writing was the realization (voiced by my husband) that this was not about having the final word or all the answers. But I could put in an early word for something I care about and maybe start a conversation. I wrote the piece in first person (a big no-no for academic prose) to emphasize that point. I don’t have the theoretical chops to write in the third person about databases and history. What I have is a case study, a story, that may help others think more reflectively about what we digital historians do every day.

I hope you will head over to Writing History in a Digital Age and look at the really amazing essays my fellow historians have written. Open peer review continues through November 14, 2011. Please comment if you have something it say: having an essay up for open peer review is orders of magnitude more nerve-wracking than wondering if anyone reads your blog.

Maybe stop by and read my piece as well, if the topic interests you. Let’s get this conversation started.

Republicans of Letters

Here are the slides for my January 26th talk at Brown University’s Center for Digital Scholarship, “Republicans of Letters: Historical Social Networks and The Early American Foreign Service Database.”

The abstract ran as follows, “Jean Bauer, an advanced doctoral candidate in the Corcoran Department of History at the University of Virginia and creator of The Early American Foreign Service Database, will discuss her use and creation of digital tools to trace historical social networks through time and space. Drawing on her research into the commercial, kinship, patronage, and correspondence networks that helped form the American diplomatic and consular corps, Bauer will examine how relational databases and computational information design can help scholars identify and analyze historical social networks. The talk will include demos of two open source projects Bauer has developed to help scholars analyze their own research, Project Quincy and DAVILA.”

Some of the slides are pretty text intensive, so if something catches your eye, go ahead and hit pause!

Do You See What I See?

This is the abstract for my talk, “Do You See What I See?: Technical Documentation in Digital Humanities,” which I gave at the 2010 Chicago Colloquium on Digital Humanities and Computer Science.

The actual presentation was more informal and consisted of a series of examples from my various jobs as a database designer.

The slides are embedded below.

*********************

Technical diagrams are wonderfully compact ways of conveying information about extremely complex systems. However, they only work for people who have been trained to read them. Humanists might never see the technical diagrams that underlie the systems they work on, reducing their ability to make realistic plans or demands for their software needs. Conversely, if you design a database for a historian, and then hand him or her a basic E-R (Entity-Relationship) or UML (Unified Modeling Language) diagram, you will end up explaining the diagram’s nomenclature before you can talk about the database (and oftentimes you run out of time before getting back to the research question underlying the database). Either scenario removes the major advantage of technical diagrams and leads to an unnecessary divide between the technical and non-technical members of a digital humanities development team.

True collaboration requires documentation that can be read and understood by all participants. This is possible even for technical diagrams, but not without additional design work. Using the principles of information design, these diagrams can be enhanced through color coding, positioning, and annotation to make their meaning clear to non-technical readers. The end result is a single diagram that provides the same information to all team members. Unfortunately, graphical and information design are specialized fields in their own right, and not necessarily taught to people with backgrounds in systems architecture.

A tool that I have recently designed may provide some first steps in that direction. The program is called DAVILA, an open source relational database schema visualization and annotation tool. It is written in Processing using the toxiclibs physics library and released under the GPLv3. DAVILA comes out of my work on several history database projects, including my own dissertation research on the Early American Foreign Service. As a historian with a background in database architecture and a strong interest in information design, I have tried several ways of annotating technical diagrams to make them more accessible to my non-technical colleagues and employers. However, as the databases increased in complexity making new diagrams by hand became a time-consuming and frustrating process. The plan was to create a tool that would create these annotated diagrams quickly to accommodate the workflow used in rapid application development.

With DAVILA you fill out a CSV file to label your diagram with basic information about the program (project name, URL, developer names) and license the diagram under the copyright or copyleft of your choice. You can then group your entities into modules, color code those modules, indicate which entity is central to each module, and provide annotation text for every entity in the database.
Once DAVILA is running, users can click and drag the entities into different positions, expand an individual module for more information, or hide the non-central entities in a module to focus on another part of your schema. All in a fun, force-directed environment courtesy of the toxiclibs physics library. Pressing the space bar saves a snapshot of the window as a timestamped, vector-scaled pdf.

I now use DAVILA to describe databases and have received positive feedback on their readability from programmers and historians. I have little training in visual theory or graphic design and would welcome comments from those with more expertise in those fields. DAVILA also only works with database schemas, but similar tools would be extremely useful for other types of technical diagrams. Collaboration would undoubtably be improved if, when looking at a technical diagram, we could all see the same thing.

For more on the project see: http://www.jeanbauer.com/davila.html.

And now, without further ado: My Slides

Partial Dates in Rails with Active Scaffold

As a historian I am constantly frustrated (but bemused) by how computers record time. They are so idealistically precise and hopelessly presentist in their default settings that creating intellectually honest digital history becomes impossible without some serious modifications.

In designing Project Quincy, my open-source software package for tracing historical networks through time and space, I quickly realized that how I handled dates would make or break my ability to design the kinds of interfaces and visualizations I needed to perform my analysis.

As a database designer, however, I balk at entering improperly formatted data into the database (I am firm in my belief that this will always come back to bite you in the end). So while MySQL lets me enter an unknown birth date as 1761-00-00, because it doesn’t require proper date formatting unless running in “NO_ZERO_DATE mode”, if I ever migrated the data to another database (say Postgres) I would be up to my eyebrows in errors. But I also don’t want to mislead my users into thinking that half the individuals in my database were born on January 1st.

So here are my solutions, drawn from the code of Project Quincy, which powers The Early American Foreign Service Database.

A relatively easy way to format partial dates in your frontend interface is to add 3 boolean flags to each date: year_known, month_known, and date_known. Then add the following method into your application helper (link to code here) to determine how you display each type of partial date.

For entering partial dates Project Quincy makes extensive use of ActiveScaffold, a Rails plugin that auto-generates an administrative backend. The nice thing about ActiveScaffold is that it is fully customizable. The problem with ActiveScaffold is that the defaults stink, so you basically end up customizing everything.

By default, ActiveScaffold treats date entry as a unified field, so you have to break up the javascript that knits day, month, and year together. You also have to change the default from today’s date to blank. If you enter only part of a date, it sets the other components to the lowest value possible.

Matt Mitchell, former Head of R&D for the University of Virginia Scholars’ Lab came up with the following elegant solution to my problem:

Create a partial view in /app/views/activescaffold/_common_date_select.html.erb and populate it with the following code.

And activate that partial with a helper method in your application_helper (link here).

And you should be good to go.

**************************************

If the pastie links go down, you can find the partial view and helper methods on Project Quincy at Github.

It’s [A]live!

It is with great pleasure, and no small amount of trepidation, that I announce the launch of the Early American Foreign Service Database (EAFSD to its friends). While the EAFSD has been designed as an independent, secondary source publication, it also exists symbiotically with my dissertation “Revolution-Mongers: Launching the U.S. Foreign Service, 1775-1825.”

I created the EAFSD to help me track the many diplomats, consuls, and special agents sent abroad by the various American governments during the first fifty-years of American state-building. Currently the database contains basic information about overseas assignments and a few dives into data visualization (an interactive Google map and Moritz Stefaner’s Relation Browser).

I have been a reluctant convert to the principles of Web 2.0, and I keenly feel the anxiety of releasing something before my perfectionist tendencies have been fully exhausted. The pages of the EAFSD are therefore sprinkled with requests for feedback and my (hopefully humorous) under construction page, featuring Benjamin West’s unfinished masterpiece the “American Commissioners of the Preliminary Peace Agreement with Great Britain.”

Over the next few months (and coming years) I will be adding more information to the database, allowing me to trace the social, professional, and correspondence networks from which American foreign service officers drew the information they needed to represent their new (and often disorganized) government. I will also be enhancing the data visualizations to include hypertrees, time lines, and network graphs.

This launch has been over two years in the making. As I look back over that time, I am amazed at the generous support I have received from my colleagues at the University of Virginia and the Digital Humanities community writ large. I wrote an extended acknowledgments page for the EAFSD, my humble attempt to recognize the help and encouragement that made this project possible.

Launching the EAFSD also gives me a chance to test, Project Quincy, the open-source software package I am developing for tracing historical networks through time and space. The EAFSD is the flagship (read guinea pig) application for Project Quincy. I hope my work will allow other scholars to explore the networks relevant to their own research.

To that end the EAFSD is, and always will be, open access and open source.

Introducing DAVILA

I have just released my first open source project. HUZZAH!

DAVILA is a database schema visualization/annotation tool that creates “humanist readable” technical diagrams. It is written in Processing with the toxiclibs physics library and released under GPLv3. DAVILA takes in the database’s schema and a pipe separated customization file and uses them to produce an interactive, color-coded, annotated diagram similar in format to UML. There are many applications that will create technical diagrams based on database schema, but as a digital humanist I require more than they can provide.

Technical diagrams are wonderfully compact ways of conveying information about extremely complex systems. But they only work for people who have been trained to read them. If you design a database for a historian, and then hand him or her a basic E-R or UML diagram, you will end up explaining the diagram’s nomenclature before you can talk about the database (and oftentimes you run out of time before getting back to the research question underlying the database). This removes the major advantage of technical diagrams and can also create an unnecessary divide between the technical and non-technical members of a digital humanities development team.

I have become fascinated by how documenting a project (either in development or after release) can build community. I’m not just talking about user generated documentation (ala wikis), but rather the feeling created by a diagram or README file that really takes the time to explain how the software works and why it works the way it does. There is a generosity and even warmth that comes from thoughtful, helpful documentation, just as inadequate documentation can make someone feel stupid, slighted, or unwanted as a user/developer. I will be writing on this topic more in the months to come (perhaps leading up to an article). In the meantime, check out DAVILA and let me know what you think.

Project homepage: http://www.jeanbauer.com/davila.html

The Design Bug

Edward Tufte should come with a warning label. Since I took his course a year ago last October, I have been bitten by the design bug. I realized the depth of this obsession last night while putting together a projected syllabus for a summer course in the History Department. Just a simple word processing document, right? Wrong.

Before I knew it, I was agonizing over font choices (what is wrong with Times New Roman?), getting the spacing just right between the columns (ensuring that the document will have to be exported as a pdf file to avoid disaster), and designing a banner graphic (two versions: a large one for the front page and a smaller one for subsequent pages). And not just a pretty picture, but a semantically rich graphic, which made me think hard about the essential theme of the course before I could render it visually.

This is an internal document! It is only supposed to get the course accredited, but I just can’t send it in without some attention to its visual impact.

I wasn’t always like this. Until about eighteen months ago, I had two intense, but distinct, sets of aesthetic appreciation: one based in logic and one based in visual or written art. I have always been drawn to “elegant solutions,” whether in the relational algebra behind a third normal form database, a well constructed thesis, or a beautiful piece of code. I am also a photographer and the daughter of a novelist, so I prize an arresting composition of shapes or colors or words to convey thoughts and feelings.

My new found interest in graphic and informational design is starting to blend these two senses together. Particularly, as I seek to find more effective ways of visually rendering my research on information flows in the Early American Foreign Service.

I don’t know where this newfound interest is taking me, or my scholarship. I only know that, for now, I’m along for the ride.

Control your Vocab (or not)

I am a NINES Graduate Fellow for 2009-2010, and this post was written for the NINES Blog. To see it in its original context, click here.

Yesterday I had two conversations about controlled vocabulary in digital humanities projects (a.k.a. my definition of a really good day). Both conversations centered around the same question: what is the best way to associate documents with subject information? If you don’t attach some keywords or subject categories to your documents then you can forget about finding anything later. There are, in my estimate, two main camps for doing this in a digital project — tags and pre-selected keywords.

In my humble opinion, tags are best when you want your users to take ownership of the data. They decide the categories, so in some sense, they have a stake in the larger project and how it evolves. You might even be able to tell why people are using the data in the first place, by looking at what tags they associate with your (or their) content. On the downside, tags can be problematic for first time users who need to search (rather than explore) your data. On several occasions I have been confronted with tag clouds that have descended (or ascended) into the realm of performance art. They are fascinating in of themselves, but fail to provide a meaningful path into the data.

Pre-selected keywords often work best when a clearly defined set of people are in charge of marking up the content. They are great for searching, and if indexed in a hierarchical structure, can provide semantically powerful groupings (especially for geographical information). And if you have a Third Normal Form database, then you never have to worry about misspellings or incorrect associations between your keywords (Disclaimer: I love 3NF databases. I know they don’t work for every project, but when your data fits that structure life is good). As a historian, however, I am wary of keywords that are imposed on a text. If someone calls himself a “justice,” I balk at calling him a “judge” even if it means a more efficient search.

Of course, it all depends on your data and what you want to do with it, but my favorite solution is have, at minimum, two layers of keywords. The bottom layer reflects the language in the text (similar to tagging), but those terms are then grouped into pre-selected types. So “justice,” “justice of the peace,” “judge,” “lawyer,” “barrister,” counselor” all get associated with type “legal.” You can fake hierarchies with tags, but it requires a far more careful attention to tag choices than I typically associate with that methodology.

I implemented the two-tiered approach in Project Quincy, but I would love to hear other suggestions and opinions.