Database

The PLANTS database

On Apr 14, 2023, Gerry Moore emailed the PLANTS 3 schema (dated Oct 17, 2018), along with an Excel output of the plant_characteristic table.

A few of the column names were different – for example, the schema had "anerobic_tolerance_extent_cd" while the Excel had "ANERB_TOLR_CD".

This wasn't surprising, because he had written,

It needs to be updated, as we have added some new data types (e.g., pollinator, ethnobotany) and made a few other minor tweaks.

This wasn't a problem at the time – I simply built the database based on the column names in the spreadsheet. However, there were a number of issues – specifically the lack of cultivars.

In September, we determined that we actually did have cultivar information – we just had to pull the data from three Excel files:

This was especially tricky, because PLANTS_Characteristics_Data_2022.xlsx had separate rows for each cultivar, but without the cultivar name.

So not only did I need to match SYMBOL → plant_symbol → plant_master_id → cultivar_name, but I also needed to match on the following columns – and even that wasn't straightforward:

PLANTS_Characteristics_Data_2022

plant_growth_requirements

Notes

PLANTS_Characteristics_Data_2022

plant_growth_requirements

Notes

TEMP_TOLR_MIN

temperature_tolerance_min

 

PRECIP_TOLR_MIN

precipitation_tolerance_min

 

PRECIP_TOLR_MAX

precipitation_tolerance_max

 

FROST_FREE_DAY_MIN

frost_free_days_min

 

PLNT_DEN_LOW

planting_density_min

 

ROOT_DPTH_MIN

root_depth_min

 

SOIL_PH_TOLR_MIN

soil_ph_tolerance_min

 

SOIL_ADP_C_TXT_IND

coarse_texture_soil_adaptable_ind

['No', 'Yes'] in PLANTS_Characteristics_Data_2022 corresponds to [0, 1] in plant_growth_requirements.

SOIL_ADP_F_TXT_IND

fine_texture_soil_adaptable_ind

SOIL_ADP_M_TXT_IND

medium_texture_soil_adaptable_ind

FIRE_TOLR_CD

fire_tolerance_id

['High', 'Low', 'Medium', 'None'] in PLANTS_Characteristics_Data_2022 corresponds to [1, 2, 3, 4] in plant_growth_requirements.

HEDG_TOLR_CD

hedge_tolerance_id

CACO3_TOLR_CD

caco3_tolerance_id

DRGHT_TOLR_CD

drought_tolerance_id

ANERB_TOLR_CD

anaerobic_tolerance_id

SLIN_TOLR_CD

salinity_tolerance_id

In addition, multiple cultivars would occasionally match on all columns. Monica Pokorny went to the trouble of differentiating them.

I now had everything needed to build VegSpec, except there was the nagging issue of how to get updates when the PLANTS database changed, and how to get our work back into the PLANTS database.

We eventually got a SQL dump of the database on Sep 27, 2023.

Unfortunately, everything changed. None of the columns are named the same. For example, what was originally grwth_prd_actv_cd was now active_growth, but that was based on a view that included the d_season table's season_id column.

To replicate the existing functionality of VegSpec, I now needed to do 42 JOINs on 18 tables: plant_master_tbl, plant_classifications_tbl, plant_duration, d_plant_duration, plant_growth_habit, d_plant_growth_habit, plant_morphology_physiology, plant_growth_requirements, plant_reproduction, plant_suitability_use, plant_location_characteristic, plant_location, d_plant_nativity, d_season, d_rate, d_extent, d_color, and d_foliage_porosity.

This is the code I came up with, which gives us this output.

Database transition challenges

The PLANTS database is in SQL Server. The schema and data were exported as .sql files, which were then imported into a PostgreSQL database.

SQL Server has unique SQL syntax, so I wrote a program to translate it into PostgreSQL syntax.

Some of the differences between the database systems are listed below.

CREATE TABLE differences

SQL Server

PostgreSQL

Notes

SQL Server

PostgreSQL

Notes

CREATE TABLE [dbo].[d_plant_reference_type](

CREATE TABLE plants3.d_plant_reference_type(

SQL Server uses square brackets to delimit columns.

PostgreSQL uses double quotes, which is the ANSI standard.
But quotes aren't necessary unless the columns have special characters such as spaces.

[plant_reference_type_id] [int] IDENTITY(1,1) NOT NULL,

plant_reference_type_id SERIAL PRIMARY KEY NOT NULL

In SQL Server, [int] IDENTITY(1,1) creates an auto-incrementing column.

The PostgreSQL equivalent is SERIAL.

[plant_reference_media_type] [varchar](10) NULL,

plant_reference_media_type VARCHAR(10) NULL,

 

[plant_reference_source_type] [varchar](25) NULL,

plant_reference_source_type VARCHAR(25) NULL,

 

[active_record_ind] [bit] NOT NULL,

active_record_ind BOOLEAN NOT NULL,

In SQL Server, [bit] creates a Boolean (true/false) column.

The PostgreSQL equivalent is BOOLEAN.

[creation_date] [datetime2](7) NOT NULL,

creation_date TIMESTAMP(6) NOT NULL,

In SQL Server, [datetime2](7) specifies a date/time column with 7 digits of fractional seconds (100 nanoseconds).

The PostgreSQL equivalent to DATETIME2 is TIMESTAMP.
PostgreSQL is limited to 6 digits of precision (microseconds), but I can't imagine why we'd need to go below that.

[created_by] [varchar](23) NULL,

created_by VARCHAR(23) NULL,

 

[last_change_date] [datetime2](7) NULL,

last_change_date TIMESTAMP(6) NULL,

 

[last_changed_by] [varchar](23) NULL,

last_changed_by VARCHAR(23) NULL);

 

CONSTRAINT [PK_REFERENCE_TYPE_ID1] PRIMARY KEY CLUSTERED
(
  [plant_reference_type_id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY] ) ON [PRIMARY]

 

Constraints aren't necessary for VegSpec purposes.

INSERT differences

SQL Server

PostgreSQL

Notes

SQL Server

PostgreSQL

Notes

INSERT [dbo].[d_plant_reference_type] (

INSERT INTO plants3.d_plant_reference_type (

 

[plant_reference_type_id],

plant_reference_type_id,

 

[plant_reference_media_type],

plant_reference_media_type,

 

[plant_reference_source_type],

plant_reference_source_type,

 

[active_record_ind],

active_record_ind,

 

[creation_date],

creation_date,

 

[created_by],

created_by,

 

[last_change_date],

last_change_date,

 

[last_changed_by]

last_changed_by

 

)

)

 

VALUES (

VALUES (

 

1,

1,

 

N'CD-ROM',

'CD-ROM',

"N" is required in SQL Server to specify a Unicode string.

It's not needed in PostgreSQL.

N'database',

'database',

 

1,

true,

active_record_ind is a BIT column in SQL Server.
BIT columns take on the values 1 (true) or 0 (false).

PostgreSQL doesn't have BIT columns. The equivalent is Boolean.
The translator determines if a column is a BIT column and changes it to true or false as appropriate.

Note that MS Access uses Yes/No for a Boolean column, where negative 1 represents true and 0 represents false.
Don't know why Microsoft doesn't like the word "Boolean," or why they can't be consistent across their own products.

CAST(N'2018-08-29T20:20:28.4733333' AS DateTime2),

'2018-08-29T20:20:28.4733333',

PostgreSQL doesn't have a DateTime2 type.
But casting isn't even necessary here.

N'Original 3.0 Migration ',

'Original 3.0 Migration ',

 

CAST(N'2021-01-15T13:31:32.2716512' AS DateTime2),

'2021-01-15T13:31:32.2716512',

 

N'SQL ETL'

'SQL ETL'

 

)

);

Semicolons are needed at the end of PostgreSQL statements.