Field data shepherding

This document describes the weedsimagerepo azure storage account structure and how the pre processing metadata tables relate to each other.

Storage structure

There are 5 tables and 1 blob of interest for the purpose of reporting back to partners. The blob contains the image files which are saved as in two formats, JPG and ARW, whereas the tables contain all the pre processing metadata associated with those images.

Tables

The metadata is organized as a relational database.

Even though the idea was to avoid data repetition, that is not the case in this storage. Also it’s important to note that in spite of there being 3 levels of data organization, the 1st and 3rd level tables are related as well as the 1st and 2nd and 2nd and 3rd.

 

image-20240906-161152.png
Diagram representing the 5 pre processing metadata tables and how they relate.

These are the 5 tables of relevance for reporting back to our partners and brief descriptions of their contents:

  • wirmastermeta: first level table. Bellow are the most relevant fields that it contains:

    • PatitionKey (string, autogenerated): azure storage name

    • RowKey (string, autogenerated): unique id for each table entry. Used to relate this table to others.

    • UsState (string, user input, dropdown): Unique partner code. There are inconsistencies in the partner code since we have more than 1 partner for some of the states which results in having some 2 characters codes and other 4 characters codes. I would like to modify this so all codes are 4 characters, the first 2 letters for the state followed by 2 numbers (01, 02, etc). The back end parter codes are called affiliations in the front end and are formed by the US state initials + the primary investigator’s last name for that group (e.g. MD-Mirsky).

    • PlantType (string, user input, dropdown): Three plant categories, all upper case, no spaces.

    • CloudCover (string, user input, dropdown):

    • GroundResidue (string, user input, dropdown and type): type of ground residue, e.g. previous crop in the rotation.

    • GroundCover (string, user input, dropdown): 5 ranges from 0 to 100% coverage.

    • Timestamp (date, autogenerated): is the date and time of upload to this storage.

    • Username (string, user input, type): This one is a free for all, we didn’t ask the users to enter anything specific. In some cases they did enter a name in others just a letter or initials. There are also empty cells sue to an early version of the app which didn’t require the users to complete this field. There can be multiple user names per partner code.

    • WeedsOrCrops: This column has a few entries and it’s just redundant information which repeats the contents of the PlantType field.

  • wircovercropsmeta: second level table. Contains PlantType = COVERCROP only data. Things to note about the data in this table: PartitionKey and Affiliation both contain the same information and these information already exists in the higher level table as UsState. CloudCover, GroundResidue and GroundCover are also repeated from the higher level table.

    • FlowerFruitOrSeeds (Boolean, user input, multiple choice): are there or are there not reproductive organs.

    • CoverCropSpecies (string, user input, dropdown): species of cover crops specifically selected for this repository.

    • CoverCropFamily (string, user input, dropdown): category of cover crop.

There is no reason for this table to exist since all the distinct variables that exist here belong in the higher level table.

  • wircropsmeta: second level table. Contains PlantType = CASHCROPS only data.

    • PartitionKey (string, autogenerated): azure storage name

    • RowKey (string, autogenerated): unique id for each table entry. I don’t see the use for this column, uid is not used to relate this table to others nor is it present in the blob.

    • CropName (string, user input, dropdown): cash crop name, 3 categories.

    • MasterRefID (string, autogenerated): unique id for each table entry. Used to relate this table to others.

    • SizeClass (string, user input, dropdown): determined by the size of the target plant. This collumn was added in the second year of image collection, previously we used height.

    • Height (string, user input, dropdown): ranges of heights. Determined by the size of the target plant. This field was only used the first year of image collection and was later on replaced by SizeClass.

    • GrowthStage (string, user input, dropdown): growth stages for cotton. This field was meant to be only used for Cotton, check on the app.

    • Timestamp (date, autogenerated): is the date and time of upload to this storage.

    • CottonVariety (string, user input, dropdown): cotton varieties which were specifically selected to be included in this repository.

There is no reason for this table to exist since all the distinct variables that exist here belong in the higher level table.

  • wirweedsmeta: second level table. Contains PlantType = WEEDS only data.

    • PartitionKey (string, autogenerated): azure storage name

    • RowKey (string, autogenerated): unique id for each table entry. I don’t see the use for this column, uid is not used to relate this table to others nor is it present in the blob.

    • CropOrFallow (string, user input, multiple choice): whether the field where the images of target weeds will be collected in a field where a crop was planted or not.

    • MasterRefID (string, autogenerated): unique id for each table entry. Used to relate this table to others.

    • SizeClass (string, user input, dropdown): determined by the size of the target plant. This collumn was added in the second year of image collection, previously we used height. Furthermore, there are 2 types of entries 1,2 and 3 and Small, Medium and Large; this is due to a change introduced at some point. The current levels are the later.

    • FlowerFruitOrSeeds (Boolean, user input, multiple choice): are there or are there not reproductive organs.

    • WeedType (string, user input, dropdown): Target weed species common name. These categories have changed since the app was first released, there may be categories that should be the same but are not, e.g. a species name with a binomial name where both words are camel case and the same name with only the first word camel case.

    • Height (string, user input, dropdown): ranges of heights. Determined by the size of the target plant. This field was only used the first year of image collection and was later on replaced by SizeClass.

    • CropType (string, user input, dropdown): crop that the target weeds are growing in. Only available if the answer to CropOrFallow is Crop.

    • Timestamp (date, autogenerated): is the date and time of upload to this storage.

  • wirimagerefs: third and lowest level table. This table contains metadata specific to each image file. The ImageURL field contains the image file name which is used to pair the data to the data.

    • PartitionKey (string, autogenerated): azure storage name

    • RowKey (string, autogenerated): unique id for each table entry. I don’t see the use for this column, uid is not used to relate this table to others nor is it present in the blob.

    • MasterRefID (string, autogenerated): unique id for each table entry. Used to relate this table to others.

    • ImageURL (string, autogenerated): url formed by the blob url + /image file name (e.g https://weedsimagerepo.blob.core.windows.net/weedsimagerepo/TXF03026.ARW, TXF03026.ARW being the image file name). This field is what allows to pair the tables metadata with the images stored in the blob.

    • ImageIndex (integer, autogenerated): order in which the images which are part of the same package were collected. Each package contains 10 images (0-9).

    • Timestamp (date, autogenerated): is the date and time of upload to this storage.

Blob

  • weedsimagerepo: contains all the image files. These files can be paired to the tables metadata by using the image “Name” on this blob and the ImageURL field from the tables.