Organization was the most complicated problem we ran into during the course of our project. With 250,000 records, thousands of duplicates, and over 20 different columns in our original data set, we needed to systematically and meticulously clean the data. Using Google’s OpenRefine, we reduced the dataset into 7000 or so records, and the process is explained below.
First, we filtered the data by place using the keyword “Germany”. We included ambiguous data points like "Germany|Flanders."
Second, we filtered the data by language, using "German" and "Latin" as keywords. We included ambiguous data points like "German|Latin" as their own category. After cleaning up the language variable, we ran it again in R, a statistical analysis software, to do some basic analysis, such as frequency tables. We realized that the German dialects can be lumped together as simply “German”.
Third, we removed data points with ambiguous place names such as “Germany|France” that had languages other than “German”. We also created another subset that was filtered first by language using “German” as the keyword to find places that were not Germany that had German manuscripts. Using Excel, we combined this new subset data with the first subset data, while making sure to avoid duplicate rows.
Fourth, we cleaned up the “place” variable using OpenRefine’s clustering option. Additionally, we aggregated ambiguous data points such as “Germany|France” to simply “Germany”. We got rid of data rows that have pipes or blanks for "Manuscript Date" to reduce the ambiguity and uncertainty. Also, because the dates were in years, we converted the dates into a format that was readable by both CartoDB and Tableau: 01/01/YYYY.
Fifth, we geocoded the place names using Excel, OpenRefine, and Geocode by Awesome Table (an add-on for Google Sheets). To do this, we filtered our data set for unique place names and created a new data set. Then, using the Geocode add-on, we retrieved geocoordinates for each place name. For those place names that Geocode wasn’t able to identify, we had to manually edit them. After correcting for spelling mistakes and changing ambiguous place names accordingly, such as “Germany, southern” to “Germany”, we were then able to get the complete geocodes for the dataset. Using OpenRefine, we were able to add the latitudes and longitudes as separate columns into our cleaned-up data based on the matching place names.
Sixth, we combed through the "Title" option, lumping together the different types of books into 50 or so general categories, like "Psalter," "Prayerbook," "Book of Hours," and "Calligraphy." There were liberties taken to lump the data. Examples include using "Bible" for titles like "New Testament," "Old Testament," and "Genesis," or "Vitae Sanctorum" for books about Peter and Paul's lives. Spelling and language were also accounted for these broad categories. However, due to the extensive list of titles in Latin and time constraints, many of the titles were left as it is. We will need a Latin specialist and additional time to further clean them up.