Home Data Science and GovernanceArtificial Intelligence Research on Climate Change with statistics methods

Research on Climate Change with statistics methods

by Massimo
Data Science R&D

Predicting climate change effects

In the past I supported agrobiodiversity scientists demonstrating the causal effects of climate change on the living of the plants. 

Scientists monitor and track more than 15.000 different plant species worldwide.

More than 1.300 Institutes worldwide collect “observations” of plants in a huge portal, the Global Biodiversity Information Facility GBIF https://www.gbif.org/

Today GBIF stores more than 1,3 billion of records catalogued using the data-standard Darwin Core, including observational data (GPS coordinates, altitude, etc.). 



Challenges in statistics

How can you find evidence that climate changes is moving plants and/or change their dispersion on the territory. For example, some plants disappeared from where they were observed during ‘60s or appeared in new zones or at higher altitudes. Some other plants spread over mountains because the climate was more favorable due to the increase of temperature. Example: the plant Melampyrum nemorosum which lives in temperate climates has been observed now in Croatia mountains, which has very cold winters.

I collected the observation data needed, described data with descriptive statistics and performed regression analysis to find correlation among plants behavior and changes of the climate.


The Dataset

Datasets I worked on:

– Plants: GBIF portal, 250 million of observations made from 1920 to today

– Climate data from World Meteorological Organization GCOS: I used about 200k records of historical climate data observations

Variables considered:

– 152 variables from plants’ observations (Darwin Core data standard + Observational data)

– 13 variables from climate observations: precipitation, climate air temperature, humidity, wind speed, pressure, lightning, upper-air temperature, carbon dioxide (CO2), ozone, date, location (GPS coordinates)


0 plants' observations
0 climate's observations

Data challenges

  • Direct plants’ observations made in 60 years were affected by many quality issues:
  1. Human errors: name of the plants wrong, recorded date invalid, date invalid, geodetic datum invalid, elevation non-numeric or swapped, etc.
  2. Technology for observation changed, like the GPS tools (just few units in the 60s, today quite common with smartphones)
  3. Process to create observations changed: Initially scientists catalogued plants in printed forms; then scientists used Word documents, and, only in the last 30/40 years, relational databases have been adopted – I led a project to convert data from thousands printed forms to database of observations with metadata (presentation at TDWG 2010 conference)
Sampling was affected by many (non-sampling) biases. Some examples I remember:
  1. Undercoverage bias:
    • Number of observations of plants in a specific area depends on the funds allocated to the Institutes in doing this process. Therefore, the presence of plants in a location depended on the number of observations made in that area. It means only some Institutes had good historical data.
    • Scientists observed species only in some areas, probably close to their institutes, and at different times. This means I had to group by time the observations of plants in a map in order to consider which plants live in that area.
  2. Information bias:
    1. Each Institute uses its own set of tools and procedures to observe data. This means descriptive statistics are difficult to compare each other
    2. Observations happened at different seasons worldwide; in extreme climates (like cold winters or hot months) many plants naturally die so frequency of observations of course change.
    3. Climate data refers to the place where is located the meteorological station 
  3. Time varying confounders: external factors to the climate change can modify the presence of plants in a zone, like pollution and human activities

Methodology for the problem

For some operations, like interpolation of climate data or calculation of the centroids of plants in a geographical map, I used the methodologies commonly used by agrobiodiversity scientists and climate specialists.

The methodology I followed was tailored for the data, and required in some transformations and assumptions:

  1. GROUPING OF OBSERVATIONS BY DATE: To calculate correctly descriptive measures, I grouped observations of plants by date of observation: all plants were considered “living organism” in the quarter of the year where observation was recorded. So, if a plant was observed in February 1998 then it became part of the group of plants living in the quarter January-March 1998 in that area.
  2. IDENTIFICATION OF CENTRES OF MASS OF SAMPLED PLANTS: I considered the geographical distribution of each sampled plant as concentrated in a single point, the centroid; for each plant sampled, I calculated also its mean and dispersion considering 10 hectares around its centroid. I considered the distance in meters centroids moved in a quarter. The hypothesis was that centroids moved because plants grow differently because of changes of climate factors.
  3. INTERPOLATION OF CLIMATE DATA: For the climate data, since the observations are done in meteorological stations, I performed interpolation of climate data to identify values in the exact location of the plants sampled. I used a deterministic interpolation algorithm called Inverse Distance Weighting (IDW), commonly used by climate scientists.

Unfortunately, I had to consider only those variables having good historical data, in the geographical areas identified after the sampling of the plants: humidity, temperature, pressure, precipitation.



A standard process of statistic was followed, considering the quality issues and biases described above:

Multistage samplings of plants observations

Stage 1: Cluster sampling based on Institute names; only some Institutes collected good historical data

Stage 2: Stratified sampling based on qualitative variables of the plants (like kingdom, family name, genus name, taxonRank) – made by biologists

Stage 3: Systematic sampling – this avoided manipulation of the whole frame of Stage 2

The sample size resulted 1.2 million of samples of about 1250 plant species.

Definition of response variables

  1. Variation of GPS position of the centroids in meters (∆centroid) from quarter to quarter: the centroid is the centre of mass of the plants in a map, in a period of a quarter of year. I found the centroids using haversine formula with GPS coordinates
  2. Number of occurrences found in the area of 10 hectares around the centroid
  3. Variation of the Quartile Coefficient of Dispersion (∆QCD): the dispersion of plants in a region of 10 hectares around their centroids were measured by the Quartile Coefficient of Dispersion (QCD). Please note that the computation complexity for the QCD algorithm is excellent: O(n log n), n=number of units observed from a sampled plant

Inverse Distance Weighting (IDW) algorithm

  • The dataframe was created locally considering the response variables above defined
  • I accessed the historical observations of climate data from the GCOS institute and applied Inverse Distance Weighting (IDW) algorithm to interpolate the climate data in the exact location of the centroids of the sampled plants
  • I determined the mean values of the climate data in that point in a quarter period (very time-consuming iterations; today accelerating adaptive inverse distance weighting algorithm could be used)
  • I created then a dataframe joining data of sampled plants, climate data, observations data, date expressed in quarter/year

Data analysis

  • Analysis of the dataframe and some findings:
    • From time plots of samples, I found seasonality and some trends on number of plants in a region and their spread in that region
    • With Polar seasonal plots and scatterplot matrices, I found good correlations of some groups of plants with the climate data: with the help of biologists, we identified group of plants which responded more to variations of climate data. Example: plants that live at high altitudes or plants with good autochthone resilience showed high correlation of dispersion with temperature and humidity
    • Forecasts from regression model highlighted that at ground level plants should increase dispersion in their region (∆QCD) and their occurrences in their region should drastically drop
    • Some plants found the habitat at different locations: this was evident mostly in dried areas where the ∆centroid values were higher
    • Putting altitude of observations as a response variable, I found some plants changed their altitude because of the change of climate in their region. So even if their dispersion in the territory was low (∆QCD) they actually spread vertically over mountains and hills

Getting data

  • Data accessed and manipulated via dedicated RESTFUL API of the data provider GBIF
  • Data engineering of local datasets done in dedicated on-premises servers
  • Development language: python and some R scripts, T-SQL for some data manipulation functions
  • Languages for data engineering: SQL and stored procedures in Microsoft SQL Server
  • Geographical tools: ArcGIS, google maps
  • Development tools: Eclipse


You may also like

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More