Heyo, Joe here, coming off of a great semester of working with Marcus to coerce our data to play nice with Python. Our goal was to unite the geographic data we had for the area upstream of each point with the isotope data from Harbin, our hard working germanium detector, into a format that could be sensibly manipulated with Python for graphing and statistical purposes. The problem was twofold, figuring out a way to sensibly store and access data, and how to take that data and use it to make graphs that were understandable and looked nice. I dove into writing a whole tangle of functions to pull out the data of interest and Marcus became good friends with the matplotlib documentation, his only ally in the noble fight against the matplotlib library.
It must’ve been just about two years ago now that I first started to truly get my hands dirty with both Python and ArcPy, ArcGIS’s Python library. I started with a simple goal, create unique watershed files for each point in a shapefile full of sample collection locations. Through a combination of the ArcPy documentation, stackoverflow answers, and a dear friend of mine with far more Python experience than myself, I was able to create such a script. It was tailored to my specific project, but I tried my best to make it something that could be reused for other projects. Looking back now, I would do it all totally differently, but ya live and learn!
When I began work on my next script, which extracted spatial information for each watershed, I became consumed with finding my way around ArcGIS’s ‘table joins’ which is perhaps the most obtuse way to unite two sets of data. I won't go into detail, but I accomplished my goal, learning a lot about how ArcGIS stores data in the process, and thus began the quest that still consumes me this day, which is to avoid using ArcGIS at all costs, offloading as much work as possible to Python.
In the fall, I declared, to no one in particular, my intent to secede from ArcGIS, and began work on a Python project to manage my data, which would only dirty its feet by dipping into ArcGIS as needed for certain spatial analyses, then whisking the results out of the clutches of whatever heinous file that Arc would create, and into the sanctuary of my Python datatype. Progress was slow, mainly because I kept on trying to start over! My code worked fine, but I was never satisfied with how it was structured, I wanted this to be something that people doing similar, but distinct, work could use. I struggled with how to avoid design decisions specific to my project, which was hard to do when I was also trying to use it at the same time to do my actual project. Eventually looming deadlines (apparently you need “results” when you “present” at a “conference”) forced me to move forward, so I ended the semester with a datatype to store the data about my samples, some functions to grab that data, and some functions to graph it.
Now, from what you read above when you saw “functions to graph” you may have thought to yourself, “oh, this must be where matplotlib comes into play” and you would be right, if I had an ounce of sense in me. For a reason I am unable to explain, I’m not sure if it was ignorance about the existence of graphing specific libraries like matplotlib, hubris, or just naive fondness for LaTeX, I decided to write functions to generate the markup to generate plots using the PGFPlots package for Latex. This meant that instead of calling functions like plt.plot(), I was writing long format strings to generate a file in the LaTeX markup language. The results were rather pleasing, but when I came back to the project in February with Marcus I thought a more straightforward approach would be appropriate. When we found out about matplotlib, I thought, “Now here is the answer to all of our problems! All we have to do is hook up the Python code that stores the data to matplotlib and out will come beautiful graphs”. Sure, I thought we might have to do some tweaking to get graphs up to our very refined standards, but how hard could it be. For that answer, see Marcus’s post.
The template... |
...and the result! |
So, as Marcus went off to figure out just how to make matplotlib give us graphs that could be read with ease, I went off to figure out how I could pull the data we wanted out of the jumble of samples we were working with. Our dataset was a collection of 83 soil samples from three different field seasons. For each sampling location we determined the area upstream of it, and calculated various geographic parameters.
Now that I had gotten all this data, it was time to get organized logically. This took some time, but boy was it worth it. Once I knew that all the data would be have the same way, I wrote a series of functions (way too many functions, probably, but once you learn Lisp, there’s no going back) to return lists of the data we actually wanted to plot. If each sample has activity and error values for 3 different isotopes, a thousand different geographic parameters, a location, links to files, lists of other samples that this sample is related, it’s not quite plug-and-play. But it got done, and meant that if some of the values for our samples changed (as they often do) or if new samples got added or old samples got removed, as long as they conformed to the standards, we didn’t have to do a darned thing! Just take the list of sample objects, plug it into the function that pulls out the data you want to graph, and then shoot the result of that into Marcus’s graphing code. Badadbing badaboom!
Things should be smoother from here on out...until I finally figure out that perfect structure and write the definitive program for managing soil samples, computing watersheds, doing some spatial analyses, and plotting and tabulating the data. Someday it will happen, and we will be better for it.
It's been a great 2+ years working for OGRe, but I wouldn't be surprised if I come back in one form or another, even if it's just to preach about why we should be scripting more and clicking less.