Abstract Detail



Biodiversity Informatics & Herbarium Digitization

Mast, Austin [1], Tian, Shubo [2], He, Zhe [3], Krimmel, Erica [3], Pichardo-Marcano, Fritz [4], Buckley, Mikayla [5], Gomez, Sophia [5], Hennessey, Ashley [5], Horn, Allyson [5], Howell, Olivia [5].

Demonstration of the use of computational linguistics and machine learning to identify phenological anomalies described in the world’s biodiversity specimen records.

Biodiversity specimen collectors are on the front lines of observing biotic anomalies, some of which herald early stages of significant changes (e.g., the arrival of a new disease or the emergence of phenological mismatches). However, the mechanisms by which those valuable observations reach stakeholders who would use the information have been idiosyncratic. Here, we explore the use of computational linguistics and machine learning to identify all of one type of anomaly (phenological; related to the timing of life history events) amongst the world's biodiversity specimen records. As a first step, we coded every occasion of six focal words that seemed likely to be used in most descriptions of phenological anomalies (early, earlier, earliest, late, later, latest; e.g., “flowering earlier than I’ve ever seen”) in a promising data field (occurrenceRemarks) from the 130 million+ records aggregated by iDigBio. This resulted in a subset of 59,420 records, and our classification of them found that only about 3% of those records contained a description of a phenological anomaly. The remaining 97% of examined records used the focal words in ways not suggesting anomalies (e.g., “collected in early morning”). This initial discovery highlights the value of identifying features of anomaly descriptions that can be used to find them in this large and ever-expanding recordset. As a second step, we used a bag-of-words representation of ngrams of 1–5 words in length, transforming that representation into numeric vectors using term frequent-inverse document frequency and classifying the records using the XGBoost machine learning algorithm. Our initial application of that method trained the algorithm using half of the occurrenceRemarks records and tested it using the other half. Encouragingly, the method classified 91% of the test dataset correctly. We have since coded all occasions of our six focal words in the iDigBio-aggregated data with the intention of using the approach to find all phenological anomalies described in the more than 2 billion records aggregated at GBIF, and we will share our latest discoveries in the talk.


Related Links:
iDigBio


1 - Florida State University, Department Of Biological Science, 319 Stadium Drive, Tallahassee, FL, 32306, United States
2 - Florida State University, Department of Statistics, 117 N. Woodward Ave., Tallahassee, FL, 32306, USA
3 - Florida State University, College of Communication and Information, 142 Collegiate Loop, Tallahassee, FL, 32306, USA
4 - Florida State University, Department of Biological Science, 319 Stadium Dr., Tallahassee, FL, 32306, USA
5 - Florida State University, Department of Biological Science, 319 Stadium Dr., Tallahassee, FL, 32306, United States

Keywords:
phenology
Herbarium
Computational linguistics
Machine Learning
Anomaly detection
Global change
conservation
Biodiversity Informatics.

Presentation Type: Oral Paper
Number: BI&HD II002
Abstract ID:248
Candidate for Awards:None


Copyright © 2000-2022, Botanical Society of America. All rights reserved