Database
The PLANTS database
On Apr 14, 2023, Gerry Moore emailed the PLANTS 3 schema (dated Oct 17, 2018), along with an Excel output of the plant_characteristic table.
A few of the column names were different – for example, the schema had "anerobic_tolerance_extent_cd" while the Excel had "ANERB_TOLR_CD".
This wasn't surprising, because he had written,
It needs to be updated, as we have added some new data types (e.g., pollinator, ethnobotany) and made a few other minor tweaks.
This wasn't a problem at the time – I simply built the database based on the column names in the spreadsheet. However, there were a number of issues – specifically the lack of cultivars.
In September, we determined that we actually did have cultivar information – we just had to pull the data from three Excel files:
PLANTS_Characteristics_Data_2022.xlsx (column SYMBOL)
plant master.xlsx (columns plant_symbol and plant_master_id)
plant_growth_requirements.xlsx (columns plant_master_id and cultivar_name)
This was especially tricky, because PLANTS_Characteristics_Data_2022.xlsx had separate rows for each cultivar, but without the cultivar name.
So not only did I need to match SYMBOL → plant_symbol → plant_master_id → cultivar_name, but I also needed to match on the following columns – and even that wasn't straightforward:
PLANTS_Characteristics_Data_2022 | plant_growth_requirements | Notes |
---|---|---|
TEMP_TOLR_MIN | temperature_tolerance_min |
|
PRECIP_TOLR_MIN | precipitation_tolerance_min |
|
PRECIP_TOLR_MAX | precipitation_tolerance_max |
|
FROST_FREE_DAY_MIN | frost_free_days_min |
|
PLNT_DEN_LOW | planting_density_min |
|
ROOT_DPTH_MIN | root_depth_min |
|
SOIL_PH_TOLR_MIN | soil_ph_tolerance_min |
|
SOIL_ADP_C_TXT_IND | coarse_texture_soil_adaptable_ind | ['No', 'Yes'] in PLANTS_Characteristics_Data_2022 corresponds to [0, 1] in plant_growth_requirements. |
SOIL_ADP_F_TXT_IND | fine_texture_soil_adaptable_ind | ” |
SOIL_ADP_M_TXT_IND | medium_texture_soil_adaptable_ind | ” |
FIRE_TOLR_CD | fire_tolerance_id | ['High', 'Low', 'Medium', 'None'] in PLANTS_Characteristics_Data_2022 corresponds to [1, 2, 3, 4] in plant_growth_requirements. |
HEDG_TOLR_CD | hedge_tolerance_id | ” |
CACO3_TOLR_CD | caco3_tolerance_id | ” |
DRGHT_TOLR_CD | drought_tolerance_id | ” |
ANERB_TOLR_CD | anaerobic_tolerance_id | ” |
SLIN_TOLR_CD | salinity_tolerance_id | ” |
In addition, multiple cultivars would occasionally match on all columns. Monica Pokorny went to the trouble of differentiating them.
I now had everything needed to build VegSpec, except there was the nagging issue of how to get updates when the PLANTS database changed, and how to get our work back into the PLANTS database.
We eventually got a SQL dump of the database on Sep 27, 2023.
Unfortunately, everything changed. None of the columns are named the same. For example, what was originally grwth_prd_actv_cd was now active_growth, but that was based on a view that included the d_season table's season_id column.
To replicate the existing functionality of VegSpec, I now needed to do 42 JOINs on 18 tables: plant_master_tbl, plant_classifications_tbl, plant_duration, d_plant_duration, plant_growth_habit, d_plant_growth_habit, plant_morphology_physiology, plant_growth_requirements, plant_reproduction, plant_suitability_use, plant_location_characteristic, plant_location, d_plant_nativity, d_season, d_rate, d_extent, d_color, and d_foliage_porosity.
This is the code I came up with, which gives us this output.
Database transition challenges
The PLANTS database is in SQL Server. The schema and data were exported as .sql files, which were then imported into a PostgreSQL database.
SQL Server has unique SQL syntax, so I wrote a program to translate it into PostgreSQL syntax.
Some of the differences between the database systems are listed below.
CREATE TABLE differences
SQL Server | PostgreSQL | Notes |
---|---|---|
CREATE TABLE [dbo].[d_plant_reference_type]( | CREATE TABLE plants3.d_plant_reference_type( | SQL Server uses square brackets to delimit columns. PostgreSQL uses double quotes, which is the ANSI standard. |
[plant_reference_type_id] [int] IDENTITY(1,1) NOT NULL, | plant_reference_type_id SERIAL PRIMARY KEY NOT NULL | In SQL Server, [int] IDENTITY(1,1) creates an auto-incrementing column. The PostgreSQL equivalent is SERIAL. |
[plant_reference_media_type] [varchar](10) NULL, | plant_reference_media_type VARCHAR(10) NULL, |
|
[plant_reference_source_type] [varchar](25) NULL, | plant_reference_source_type VARCHAR(25) NULL, |
|
[active_record_ind] [bit] NOT NULL, | active_record_ind BOOLEAN NOT NULL, | In SQL Server, [bit] creates a Boolean (true/false) column. The PostgreSQL equivalent is BOOLEAN. |
[creation_date] [datetime2](7) NOT NULL, | creation_date TIMESTAMP(6) NOT NULL, | In SQL Server, [datetime2](7) specifies a date/time column with 7 digits of fractional seconds (100 nanoseconds). The PostgreSQL equivalent to DATETIME2 is TIMESTAMP. |
[created_by] [varchar](23) NULL, | created_by VARCHAR(23) NULL, |
|
[last_change_date] [datetime2](7) NULL, | last_change_date TIMESTAMP(6) NULL, |
|
[last_changed_by] [varchar](23) NULL, | last_changed_by VARCHAR(23) NULL); |
|
CONSTRAINT [PK_REFERENCE_TYPE_ID1] PRIMARY KEY CLUSTERED |
| Constraints aren't necessary for VegSpec purposes. |
INSERT differences
SQL Server | PostgreSQL | Notes |
---|---|---|
INSERT [dbo].[d_plant_reference_type] ( | INSERT INTO plants3.d_plant_reference_type ( |
|
[plant_reference_type_id], | plant_reference_type_id, |
|
[plant_reference_media_type], | plant_reference_media_type, |
|
[plant_reference_source_type], | plant_reference_source_type, |
|
[active_record_ind], | active_record_ind, |
|
[creation_date], | creation_date, |
|
[created_by], | created_by, |
|
[last_change_date], | last_change_date, |
|
[last_changed_by] | last_changed_by |
|
) | ) |
|
VALUES ( | VALUES ( |
|
1, | 1, |
|
N'CD-ROM', | 'CD-ROM', | "N" is required in SQL Server to specify a Unicode string. It's not needed in PostgreSQL. |
N'database', | 'database', |
|
1, | true, | active_record_ind is a BIT column in SQL Server. PostgreSQL doesn't have BIT columns. The equivalent is Boolean. Note that MS Access uses Yes/No for a Boolean column, where negative 1 represents true and 0 represents false. |
CAST(N'2018-08-29T20:20:28.4733333' AS DateTime2), | '2018-08-29T20:20:28.4733333', | PostgreSQL doesn't have a DateTime2 type. |
N'Original 3.0 Migration ', | 'Original 3.0 Migration ', |
|
CAST(N'2021-01-15T13:31:32.2716512' AS DateTime2), | '2021-01-15T13:31:32.2716512', |
|
N'SQL ETL' | 'SQL ETL' |
|
) | ); | Semicolons are needed at the end of PostgreSQL statements. |