Monday, April 30, 2018

A Little Bit of Big Data

Data crunching.  Over the many years that I’ve written this blog, I’ve touched on different aspects of paleontology and the activities of paleontologists but, for the most part, I’ve ignored one kind of endeavor that has increasingly marked this field in recent decades and that is data crunching.

It’s puzzling that I’ve had this blind spot because for most of those years I’ve been a volunteer at the Smithsonian’s National Museum of Natural History assisting projects that are quintessentially data-driven efforts.  In a post several years ago, I noted that the popular perception of how paleontologists spend their time conflicted with the reality of what such scientists actually did.  Every hour spent in the field collecting fossils, generated many hours of lab work, prepping and studying each fossil found.  My take on this issue was certainly incomplete because at that juncture I made no mention of the very different way in which, at present, many paleontologists endeavor to extract meaning from fossils, that is:  staring into a computer screen while manipulating large, often complex sets of data derived from collected fossils.  Yes, “big data” is now very much a part of the work that paleontologists writ large do.

Science historian David Sepkoski, in a delightful commentary on big data in paleontology, writes from a very personal perspective.  His father, paleontologist Jack Sepkoski (who died much too young at age 50 in 1999), made seminal contributions to our understanding of the diversity of life on earth and on patterns of extinction by carefully compiling and analyzing extensive data on marine families and genera.  As a child, David saw his father as an Indiana Jones figure (though the latter was an archaeologist):
The illusion was shattered some years later when I figured out what he actually did:  far from spending his time climbing dangerous cliffs and digging up dinosaurs, Jack Sepkoski spent most of his career in front of a computer, building what would become the first comprehensive database on the fossil record of life.  The analysis that he and his colleagues performed revealed new understandings of phenomena such as diversification and extinction, and changed the way that paleontologists work.  But he was about as different from Indiana Jones as you can get.  (What a Fossil Revolution Reveals About the History of “Big Data”, Aeon, February 12, 2018.)
By building this database and using computer power to manipulate it, Jack Sepkoski and his colleague paleontologist David M. Raup were able to discern the major extinction events that have punctuated life on Earth from deep time to the present.  The ability to assemble a vast quantity of data and analyze it has transformed the research avenues explored by paleontologists.   In an interview, Raup’s wife said of her husband, “He used to say he went into paleontology because it was a field with a lot of data that no one was analyzing.” (Bruce Weber, David M. Raup, Who Transformed Field of Paleontology, Dies at 82, New York Times, July 15, 2015.)  That’s certainly not true now.

A look backward is in order at this juncture.  The manipulation of data to discern patterns in the fossil record isn’t what’s new about this phenomenon, rather, it’s how prevalent it’s become.  I’m quite taken by the work that British geologist John Phillips did in the middle of the 19th century analyzing the data he had available to him at that point to depict the relative abundance of species over time.  Perhaps, most importantly, in his 1860 volume titled Life on Earth:  Its Origin and Succession, Phillips published what paleontologist Douglas H. Erwin has called “the first illustration of the diversity of life through time.”  (Erwin, Extinction:  How Life on Earth Nearly Ended 250 Million Years Ago, 2006, p. 20.)

Phillips’ graphic shows a quite modern take on changes in the diversity of life over long timescales, changes that Erwin describes as “major turnovers in the dominant fossils.”

I cannot leave Phillips without mentioning that as a child, after the death of his parents, he was raised by his uncle William Smith, the now famous geologist.  It was Smith who recognized there was a recurrent order of rock strata and that certain fossils were found only in specific strata, giving rise to the science of stratigraphy.  He created Britain’s first geologic map.  In addition, Phillips merits applause for having bucked the trend followed by so many of his contemporaries (Darwin included); he gave his 1860 volume a wonderfully brief and to the point title.

In his article on big data in paleontology, David Sepkoski makes much the same argument as I’ve made here regarding the very early use of data aggregation, though he turns to the German paleontologist Heinrich Georg Bronn.  Bronn, also in the middle of the 19th century, created a “paper ‘database’ of fossil groups” which, by means of what we now refer to as spindle diagrams, he illustrated the origin, duration, and extinction of taxa.

I think one sign that big data has come to occupy an important place in paleontology is the growing significance of the Paleobiology Database (PBDB) for analysis in this field.  The PBDB, the product of an international group of paleontologists, seeks to become a comprehensive database recording the taxonomy and collection-based occurrences of fossils around the world and in all time periods.  It is primarily funded by the National Science Foundation and the Department of Geoscience at the University of Wisconsin – Madison.  The PBDB is accessible to the general public, offering various tools for analyzing its data on (as of April 30, 2018):  1,366,816 fossil occurrences of 369,683 taxa housed in 192,792 collections.  There have been 310 official publications based on PBDB data.

A recent example of such a publication is Diversity Change During the Rise of Tetrapods and the Impact of the ‘Carboniferous Rainforest Collapse’, by Emma M. Dunne et al. (Proceedings of the Royal Society B, 2018) which used the PBDB to explore how the Carboniferous rainforest collapse (CRC) might have affected patterns of diversity change among tetrapods from the late Carboniferous into the early Permian.  This analysis, contradicting an earlier hypothesis, found that the CRC was associated with increased interconnectedness among communities, rather than increased endemism.  Dunne and her colleagues turned to the PBDB but only after spearheading an effort to add to it all of the currently published data on the terrestrial tetrapod fossil record across the Paleozoic Era.

The PBDB is a fascinating and challenging tool for looking at ancient life, and I’ve only just scratched the surface of how its riches can be used (and probably won’t ever be able to dig much deeper).  At the most prosaic level, it offers the collectors among us myriad references about different locations at which fossils have been found; some of those sites may still be available for hunting fossils.  More intellectually challenging is utilizing the PBDB to place in context fossils found in the field.  For this, the PBDB is invaluable, offering an avenue to the taxonomic history of a wide array of fossil taxa.  Among the data available are the formations in which these taxa have been found.

What follows is a fairly simple-minded example of a bit of research I undertook with the PBDB into the fossil record of the Ecphora, that beautiful gastropod so familiar to those of us who have collected along the western shore of the Chesapeake Bay.  Pictured below is an Ecphora from the St. Marys Formation (this fossil is presumably younger than 13.8 million years old).

For my adventure in the PBDB, I provide a series of screen shots of the database in action (very low key action, mind you).  Here’s the welcoming splash page for the PBDB (in most cases in the images below only a portion of the screen is shown):

In this instance, I started with the Navigator.  Here a map of the world is covered with a multitude of colored dots, many of them lying atop others, each representing a collection of fossils, color coordinated with the time scale shown at the bottom.

Limiting the taxa (using the beetle icon) to the genus Ecphora generates the following view of the world map highlighting (barely visible) the very few collections in which this genus appears.

With a single exception, the collections with Ecphora are located in the US.  In the view below, the dots are clearer though very much clustered in a mass along the mid-Atlantic coast of the country.  Deep yellow dots predominate, marking sites yielding Miocene Epoch fossils.  The few greenish and pale greenish yellow dots identify sites in different parts of the late Cretaceous Period (older than 66 million years) which are markedly older than the Ecphora sites elsewhere on the map.  The latter all date from the Oligocene through the Pliocene, leaving a gap of some 32 million years or more from those earlier finds.

Zooming in just on the Chesapeake Bay region, brings into focus the wealth of places where Ecphora specimens, now in collections, have been found.

Clicking on each dot will open up a window giving information about the location, the references in the literature to this site, and the occurrence of an Ecphora species at that location.  The picture below shows this information for the site in Texas that is one of the Cretaceous outliers for Ecphora.

Clicking on the link in the middle of the window which reads “Sohl Wolf City,” reveals the following information about the site.

Opening the “Occurrences” tab at the top of the window tells me which Ecphora species is found here.

Ecphora proquadricostata.   All of the Cretaceous sites in PBDB featuring Ecphora identify either E. proquadricostata or E. sp. as the species in question.  These Cretaceous sightings are a problem I’ll turn to in a bit.

I decided to explore the time ranges for each of the Ecphora species in the PBDB, asking when in the fossil record each species lived.  I wanted to see how much they overlapped with each other and when the whole genus first showed up and when it blinked out.  To do that, I downloaded PBDB’s information on the first and last appearances of each of these Ecphora species using a PBDB-based web app called Fossilworks.  In Fossilworks I requested PBDB occurrence data on all species of the Ecphora genus.  I opened the “Download” tab (see below) and clicked on “Collection, Occurrence, or Specimen Data.”  In the form that opened, I identified the taxon I was interested in as Ecphora.  I made sure to indicate that the taxonomic level I was interested in was set to “Species.”

I then clicked on “Collection Fields” tab and specified in the form that opened up what data I wanted, including minimum and maximum ages.

The screen shot above doesn’t show the whole “Collection Fields” form.  At the bottom of is a button to “Create Data Set,” which I clicked.  A window opened up showing the data sets that had been created.

I used the taxonomic ranges data set and, when clicked, it opened in Excel.  From that point on, the time range analysis I performed basically followed a very handy PBDB tutorial video.

The product of my efforts manipulating the PBDB data appears below – an Excel chart that shows the time span for each of the Ecphora species in the PBDB.  The present day is on the right of the chart and the bars for each species tracks its life span as a species, dating back to those late Cretaceous sightings of E. proquadricostata.  As the screen shot above indicates, the underlying data are from 128 separate occurrences of the Ecphora genus.

I haven’t cleaned this chart up very much since, at this point, I don’t have any plans to do more with it.  An initial cleaning would involve dropping the Ecphora sp. entry whose bar basically shows that a species identification wasn't possible in several collections that fell across the purported lifespan of the genus (at least, at the beginning and at the end).

More problematic is the inclusion of E. proquadricostata.  As already noted, this species appears in the Cretaceous and then disappears after the end of that period, long before any of the other species show up.  What’s going on here?  Well, exploring this has introduced me to one of the limits of the data aggregation behind the PBDB.  E. proquadricostata is a species over which experts have contended.  Is it truly an Ecphora or does it belong to some other genus entirely?  Paleontologist Norman F. Sohl (author of the reference to the Texas site discussed above) wrote that, given the close similarity he detected between specimens of this species and all of the other Ecphora (known from the Oligocene onward), “it would be unwise to separate this species from the genus Ecphora purely on the basis of time lapse.”  (Neogastropoda Opisthobranchia and Basommatophora From the Ripley, Owl Creek, and Prairie Bluff Formations, Geological Survey Professional Paper 331-B, 1964.)  In contrast, paleobiologist Geerat J. Vermeij, citing the arguments of other authors, agreed strongly that this Late Cretaceous species is definitely not a true Ecphora.  (Morphology and Possible Relationships of Ecphora (Cenozoic Gastropod:  Muricidae, The Nautilus, Volume 109, Number 4, 1995.)

If I were to drop both of those questionable entries, then this graphic squares quite nicely with at least one school of thought about the early evolution of the Ecphora species.  As I noted in a previous post on the chemical composition of the Ecphora shells, some have posited that E. wheeleri was the first true species in this genus and that E. tampaensis evolved from it.

But, until someone takes the step to rename E. proquadricostata moving it to another genus, analysis of the Ecphora genus using the PBDB will give the genus an exceedingly long time range that may, in fact, not be accurate.  Caveat emptor.  (To explore the risks and limits of computer analyses of big data sets, one might start with writer Kalev Leetaru’s provocative piece titled How Bad Data Practice is Leading to Bad Research which appeared in Forbes on February 19, 2018.  It's not focused on the sciences per se but does identify some of the issues the may affect big data use in all fields.)

Nevertheless, the PBDB offers users a taste of this other aspect of the life of a paleontologist – basking in the glow of the computer screen, being much too sedentary for much too long a time.  Field work anyone?

Nature Blog Network