What is LinkedEarth?

LinkedEarth
CyberPaleo
Published in
9 min readSep 1, 2021

--

In our first real post on this Medium publication, it is fitting for us to define what LinkedEarth is, as there is lingering confusion about that, and we can’t exactly fault you, dear reader. LinkedEarth has been in existence for 6 years and a lot has happened.

Origin Story

The year is 2015 and the time is greatly overdue for standardizing paleoclimate data to enable new kinds of science. Nick McKay, Yolanda Gil and myself (Julien Emile-Geay) get our first grant funded under the auspices of the EarthCube program. The 2-year project, called LinkedEarth, aimed to build a platform to facilitate a crowdsourced model of paleoclimate data curation, as well as standards to make those data FAIR (Findable, Accessible, Interoperable, and Reusable). This required developing several things:

  • data standards for paleoclimatology.
  • code that uses those standards to do useful things (see below)

The most important piece was getting the right people involved, because people make the world go round. First, we recruited someone with the rare (possibly unique) combination of talents to help us pull this off: someone with knowledge of both computer science and paleoclimatology, incredible efficiency, organizational skills and work ethic, and more importantly the drive and flexibility of mind to keep learning new things. I want to speak, of course, of Deborah Khider. Deborah was the key reason anything good came out of LinkedEarth, but Deborah alone couldn’t do it. Another postdoc, Daniel Garijo, pulled a lot of weight on the information sciences side of things.

And like all EarthCube projects, it was to be community-driven. LinkedEarth relied heavily on our connections with the PAGES organization and collaborations with the NCEI/NOAA World Data Service for Paleoclimatology to spread the word about various community events, including the first ever International Workshop on Paleoclimate Data Standards (June 2016), which gathered many in that community for very fruitful discussions and decisions.

At this point you might wonder why, of all things, we chose to call the project LinkedEarth. Linked is easy: the vision was to bring to life a world where many paleoclimate datasets could live in cyberspace as linked open data. OK, you say, but what about “Earth”? We’re not dealing with all geoscience data, after all.

That is a fair point (a FAIR one, perhaps). The answer is that our original attempt at getting a crowdsourced data curation system off the ground dates back to 2013. At the time, we were ambitious enough to envision building this for paleobiology, geobiology, and paleoclimatology (silly us!) . We couldn’t find a better common term for those three fields than “Earth Sciences”, so LinkedEarth was born. This initial attempt was not funded (in part, because of its overly broad ambitions), so in 2015 we refocused the project on paleoclimatology — but kept the original name, as we still believe the approach of broad applicability to many realms of the Earth sciences.

What has LinkedEarth ever done for paleoclimatologists?

That is another FAIR question. LinkedEarth has advanced paleoclimate standards on two fronts:

  1. a data model (Linked Paleo Data, or LiPD, described in this publication), which was extended in the LinkedEarth Ontology.
  2. a reporting standard, PaCTS

The LiPD data model can serve to build universal, self-describing containers that are tightly shrinkwrapped around all manner of paleoclimate data (e.g. wood, sediment, coral or glacier-based archives, even documentary data), with the metadata shipping right alongside the data it is meant to describe (if you are familiar with netCDF, it’s a lot like that, but espousing the particular needs of paleoclimate datasets). What is the point of that, you ask? Shipping information in such a unified container allows to build codes that can natively interact with the LiPD data model and perform high-level tasks like age modeling, spectral analysis, or multivariate decompositions. Indeed, geoChronR is built around LiPD, and that is also an important (though not obligatory) component of Pyleoclim. LiPD has also served as the backbone of several large community synthesis projects, including iso2k and Temperature12k, and several more that are in progress. All of these projects and datasets can be found on the LiPDverse.

Yet we quickly realized that there was more to a data standard than a format. As an analogy, consider how libraries were run in the olds days, which each book represented on library cards neatly packed into old‐fashioned file cabinets. For this system to function, one needed (1) a set of compartments and drawers to house the card, (2) labels to identify and classify the contents of the drawers, and (3) a disciplined adherence to the classification system. This entailed including essential information required for application and reuse of the cards and the information they contain. Every user had to follow similar guidelines to generate, use, and file the library cards; otherwise, the classification fell apart and the cards may as well lay about in a random pile, and be therefore useless.

LiPD served as the compartments and drawers, but one also needed to standardize the labels for the drawers (terminology) and define best practices before we could have a system ready for use by everyone. The ontology formalized the relationships among the compartments and their terminology, and served as a skeleton to build the LinkedEarth wiki, a platform able to house paleoclimate datasets that could be curated by paleoclimatologists themselves. The wiki was nucleated around the recently released PAGES 2k database, which comprised nearly 700 LiPD-formatted datasets with relatively uniform naming conventions. The wiki is many things:

  • a data catalog, allowing registered users to upload and edit paleo datasets at will;
  • a data server, allowing people to download datasets corresponding to sophisticated queries (the type a paleo scientist would need to perform in their work);
  • an online educational resource containing information about major paleoclimate archives;
  • a community hub, allowing timestamped discussions on data standards, eventually leading to the PaCTS reporting standard;
  • an ontology-editing tool, allowing to capture new terms and their definitions to grow the original ontology proposed by the core LinkedEarth team.

It’s possible that the LinkedEarth wiki, might, in fact, have been too many things. Notably, it became obvious from the start that the data server dimension of the wiki was going to be perceived as competing with established repositories like WDS-Paleo and PANGAEA, though this was never our intention. From our perspective, the LinkedEarth wiki as a technology incubator, a way to develop ideas about crowdsourcing the curation of paleoclimate data in a style reminiscent of other community-curated data repositories (e.g. Neotoma or the Paleobiology database), but leveraging the semantics of linked open data to allowed data to be truly FAIR and integrated across the greater EarthCube nebula. In this goal we succeeded, as our semantic model allowed LinkedEarth datasets to become some of the first to be findable via the GeoCODES search engine. They were also the first paleoclimate data to be findable through Google’s dataset search tool.

Ontologies allow to richly capture the relationships among entities, supporting the complex queries mentioned above, as well as allowing to build scientific code that can make effective use of the data. As mentioned above, the wiki also enabled community consultation on the first community reporting standard, PaCTS. Simply put, PaCTS was an attempt to bring together a range of international stakeholders in the paleoscience space to decide on which data fields should be archived to enable the greatest possible re-usability of paleoclimate datasets (the R in FAIR), with a view toward the long-term. Through patient and regular community engagement over the course of several years (lasting long after the funded project rode into the sunset), LinkedEarth managed to elicit feedback from nearly 100 of these stakeholders, covering every major paleoclimate archive, to draw up an initial set of recommendations. More on this here. Of course, this was always meant to be a first step. The hard task is now for this standard to be adopted, and/or to evolve with the needs of the broader community.

Lessons Learned

The first phase of LinkedEarth has been rich in lessons. Firstly, we have learned that it is easier to propose standards than to have them adopted. While we have found many stakeholders (particularly publishers) enthusiastic at the idea of data standards, their adoption has not generated quite as much enthusiasm. Anybody who has ever proposed standards will tell you that users of anything do not readily relinquish their old tools or approaches, however imperfect those might be. There are two ways this might conceivably change: carrots, or sticks. The latter are in the hands of funding bodies and publishers, who have so far been unwilling to use them. But we can provide carrots.

While most paleoclimatologists are fully aware that an Excel spreadsheet is not a data standard, the more complex LiPD data model undeniably poses a barrier to access: while this remains a long-term dream, there is so far no app installed on all Windows, Apple or Linux computers that can open a LiPD file by double-clicking, the way one can do with Excel (and in some cases, netCDF). Nick has worked relentlessly to lower acces barriers, for instance through the web-based LiPD playground. In parallel, our growing ecosystem of LiPD-aware software tools and educational resources (e.g. vignettes, hackathons, seminars, tutorials), is helping introduce LinkedEarth-forged tools to the paleoclimate community with minimal coding skills. The idea is simple: by sharing examples of the awesome scientific capabilities supported by our cyberinfrastructure efforts, we hope to entice and inspire a new generation of paleoscientists to adopt, expand and funkify our tools. We already have quite a few examples of the high-profile science that LinkedEarth tools support (e.g. here, here, here and here), and we are busy making more. The hope is that science wins over the most recalcitrant luddites into LiPD Shangri-La, but there is no doubt that this is a slow process. Rome wasn’t built in a day.

Secondly, we have learned to meet people where they are at. While we have found LiPD exceedingly useful in our work, it is by no means an obligatory step. Certainly, the LiPD data model is key to building high-level code that can perform complex workflows in a few keystrokes. However, in many cases the user need not see this data model in the first place. We are working on tools to enable users to lift datasets directly from PANGAEA or WDS-Paleo and import them into their R or Python workspace — without ever getting involved with a LiPD file. WDS-Paleo has also developed a vocabulary for paleoclimate variables (the Paleoenvironmental Standard Terms (PaST) Thesaurus), which provides labels for the quantities in the drawers defined in LiPD and the LinkedEarth Ontology. This rich vocabulary can also be exploited to improve upon Pyleoclim’s automated capabilities, something we plan to do in the near future.

Thirdly, the EarthCube landscape has changed radically since the first round of EarthCube funding. In the atmospheric & oceanic science community, pangeo has taken off like wildfire, Jupyter notebooks are now the preferred way of sharing geoscientific workflows, and data semantics have been confirmed as the only way to glue EarthCube data together.

Next steps

So what is LinkedEarth in 2021, and where is it headed in the next 3 years? As Odin reminds Thor in Ragnarok:

Well, he did not say that exactly: he said Asgard is not a place, it’s a people. Similarly, LinkedEarth is not a database, a wiki, some sleek code, or 2 data standards. It’s the people of LinkedEarth, all of you who have participated and contributed over the years, that make LinkedEarth what it is. At the moment, the leadership of LinkedEarth is:

We all have different interests and a bunch of irons in the fire, but what unites us all is the use of technology to solve big questions in paleoclimatology. The newly revamped LinkedEarth website shows our current and past activities:

Broadly speaking, LinkedEarth does paleoclimate data science, research & training, working synergistically with existing cyberinfrastructure. For more details on all these components, feel free to dive into the site, and let us know if more details are needed anywhere.

In the next post, Deborah will describe an exciting new round of EarthCube funding we just obtained, which will allow to further build on these foundations to make it easier and faster for climate scientists to do better science, unlocking the rich (and sometimes messy) information hidden in paleoclimate archives.

Julien, for LinkedEarth

--

--

LinkedEarth
CyberPaleo

An organization dedicated to manifesting the future of the paleosciences (the study of past environments, climates, and ecosystems)