Plain Text
To begin our project, we imported the lyrics and some metadata that we collected from our source from each of the 25 songs into separate plain texts files. These plain text files allow us to have the most basic file format of our collection as back-up, in case anything happened to our XML codes or anywhere in the analysis process.
Mark up
After obtaining the plain text files from the website, we used oXygen to mark the elements in the files with XML. We first identified repeating patterns in the files and used Regex (Regular Expressions) to mark up our files efficiently. Then, after the first files were marked propoerly with elements and attributes, we create a Relax NG Schema to keep our files and mark up style consistent with each other. We tried to keep in as many details as we could without making the code become too complicated, so we chose to separate the metadata part and the actual song part of the files with different tags.
Analysis
We utizlied eXist for the majority of the analysis process. eXist is a platform for us to write Xquery to pull specific data points from our collection and output them into formats to conduct further analysis or for visualization purposes. For a lot of the analysis we utilized eXist, where we wrote Xquery, which we used quite a bit for this project. We were able to utilize eXist while creating some svg graphs as well as writing code to extract elements to run through Cytocscape. We also used Voyant Tools to search through each individual song and find the words that repeated the most, the total words in the songs, and the amount of unique word forms in the songs.
Preparation for Xquery Analysis
EXist allows us to write Xquery code to draw SVG graphs using variables and loops. To do so, we first had to define the axes and what type of chart we were looking to plot. After determining those and deciding on the variables we were going to chart, all that was left was for us to write the code in eXist.
Preparation for Network Analysis
After determining what data points we would be interested in seeing the relationships between, we used Xquery to generate TSV (Tab Separated Values) files to hold the collection of data so that we could import them into Cytoscape later. Cytoscape is a software platform that allows you to output data through networks and diagrams. Inside Cytoscape, we generated networks using different elements such as voice actor, song titles or composer. Cytoscape creates interactive diagrams that show connections and interactions between our selected elements.
Preparation for Natural Language Analysis
This portion of the analysis relies heavily on artificial intelligence and online libraries, so our preparation for it was fairly simple and straight forward. In order for spaCy and pyCharm to be able to read our files properly, we had to create a new plain text file that does not contain any of our mark up nor the formatting, or the metadata as we were not looking to analysis those. For this portion of the analysis, we focused only on the actual lyrics portion.
GitHub
We used GitHub to hold our collection of songs and all of our work, and also to publish our website. Our Github is public, if you'd like to take a look around and see how we operated on this project you can click here.