/
SciNet Data Transfer Strategy

SciNet Data Transfer Strategy

1. Overview

Regarding moving data from NCSU’s NFS storage to SciNet, this document provides a general overview of the data and should help with planning. The focus should be to sync the important parts while leaving behind anything unnecessary, making sure everything is well-organized and easy to access for research.

This includes:

  • What’s in storage? Documenting where everything is currently stored.

  • What stays, what goes? Sorting out the must-have data from the stuff we don’t need.

  • How is it structured? Breaking down organization of key parts

  • Do we transfer everything or just fully processed data? Some backlog batches in semifield-developed-images may have an images folder but lack metadata, meta_masks, or reference folders, meaning they haven’t been fully processed through the annotation pipeline. Do we sync these unprocessed (unannotated) batches or only transfer those that are complete?


2. Existing Data Storage at NCSU

NFS "Lockers" and Contents

NFS Path

Dataset Type

Key Folders (Must Keep)

Non-Essential/Unrelated Folders

Size Estimate

NFS Path

Dataset Type

Key Folders (Must Keep)

Non-Essential/Unrelated Folders

Size Estimate

/rsstu/users/s/screberg/longterm_images

Semifield

semifield-uploads, semifield-developed-images, semifield-cutouts

hackathon-beginner-track, hackathon-advanced-track, 0804_cutouts.tar

29.2 TB

/rsstu/users/s/screberg/GROW_DATA

Semifield

semifield-uploads, semifield-developed-images, semifield-cutouts

roboflow_test, GROW WHACK 2022 UAS DATA, pseudo-label-data, semifield-outputs

29.95 TB

/rsstu/users/s/screberg/longterm_images2

Semifield

semifield-uploads, semifield-developed-images, semifield-cutouts, semifield-utils, semifield-tools, semifield-database

temp_testdata, semif-synthetic-datasets, to_del, train_val_test_512x512

14.6 TB

/rsstu/users/r/raatwell/longterm_images3

Field

TBD

TBD

TBD


3. Data Organization & Structure

Semifield Data (NFS locations 1-3):

1. semifield-uploads

  • Contains raw, unprocessed images collected from the BenchBot (bbot) across different versions:

    • Version 2 – Vention + Sony camera

    • Version 3 – Amiga + Sony camera

    • Version 3.1 – Amiga + SVS camera

  • Image file types:

    • Sony cam images.ARW (fully processed)

    • SVS cam images.RAW (not yet processed)

    • Not used directly for CV/DL training but preprocessed into semifield-developed-images.

  • Some batches include an additional SONY folder containing Sony camera captures.

  • All the raw images in semifield-upload are not and should not be used directly for computer vision or deep learning training. Instead, they undergo color correction before being stored in semifield-developed-images.

  • These images represent the “rawest” form of our data.

2. semifield-developed-images

  • Contains color-corrected images and associated metadata.

  • Key subfolders:

    • images – Color-corrected JPGs + .pp3 profiles.

    • metadata – JSON files for each processed image ([Link to metadata Confluence page])

    • meta_masksKeep semantic masks | Ignore instance masks (accuracy issues).

    • reference – Camera position & orientation from SfM reconstruction.

  • Other subfolders may exist but are unnecessary, such as:

    • plant-detections, masks, prediction_images, autosfmShould not be transferred.

  • Some batches only contain an images folder, meaning they haven’t gone through the annotation pipeline yet.

    • Question: Do we sync these unprocessed batches, or only fully processed batches that include metadata, meta_masks, and reference?

3. semifield-cutouts

  • Batch folders contain cutout segments used for training.

  • Each cutout segment consists of four files with no subfolders:

    • .png – Cutout image

    • .jpg – Cutout image

    • .json – Metadata file

    • *_mask.png – Corresponding mask

  • [Link to Confluence page on data products]


4. Additional Folders in longterm_images2

1. semifield-utils

  • Internal processing/reference folder – Likely should not be public but can be moved.

  • autosfm – Contains shapefiles and ground control points for species assignment in annotation workflows. These files are not directly usable and should not be public.

  • image_development – Holds .pp3 profiles and color matrix for bbot v3.1 images.

    • Still in development, so should not be shared publicly but can be moved.

  • species_information – Important! Stores species class data used in metadata files (developed-images, cutouts).

    • These are the final set-in-stone species names.

    • [Link to Confluence page with species classification]

2. semifield-tools

  • Contains trained ML models for data processing – Should not be public, but can be moved.

  • Non-target weed classifier – Used for data cleaning.

  • Plant segmentation model – Generates masks & cutouts.

3. semifield-database

MUST BE MOVED & SHARED PUBLICLY

  • Contains SQLite3 database file (.db).

5. Data Products & Prioritization

Must-Have Data Products (In order of priority)

  • semifield-developed-images

  • semifield-cutouts

  • semifield-database

  • semifield-utils (speceis_information only)

  • semifield-uploads

  • Zipped folders in GROW_DATA/semifield-uploads (Decision Needed: Transfer as-is, unzip first, or exclude?)

Data Not Needed for Transfer (Low Priority / Exclude)

  • semifield-tools (in longterm_images2)

  • Hackathon-related folders

  • Temporary/intermediate datasets

  • Redundant or old test datasets

  • Any batch folders within semifield-upload, semifield-developed-images, and semifield-cutouts folders that do not match the required state-date naming format.

  • instance_masks in meta_masks (cannot be verified for accuracy)

  • Exclude autosfm and image_development from public sharing (internal reference only)


5. Migration Plan & Considerations

Next Steps:

  • Decide if we sync unprocessed (unannotated) batches in semifield-developed-images, or only those with metadata, meta_masks, and reference data.

  • Decide on handling of zipped folders in GROW_DATA/semifield-uploads.

 

Related content