SciNet Data Transfer Strategy
1. Overview
Regarding moving data from NCSU’s NFS storage to SciNet, this document provides a general overview of the data and should help with planning. The focus should be to sync the important parts while leaving behind anything unnecessary, making sure everything is well-organized and easy to access for research.
This includes:
What’s in storage? Documenting where everything is currently stored.
What stays, what goes? Sorting out the must-have data from the stuff we don’t need.
How is it structured? Breaking down organization of key parts
Do we transfer everything or just fully processed data? Some backlog batches in
semifield-developed-images
may have animages
folder but lackmetadata
,meta_masks
, orreference
folders, meaning they haven’t been fully processed through the annotation pipeline. Do we sync these unprocessed (unannotated) batches or only transfer those that are complete?
2. Existing Data Storage at NCSU
NFS "Lockers" and Contents
NFS Path | Dataset Type | Key Folders (Must Keep) | Non-Essential/Unrelated Folders | Size Estimate |
---|---|---|---|---|
| Semifield |
|
| 29.2 TB |
| Semifield |
|
| 29.95 TB |
| Semifield |
|
| 14.6 TB |
| Field | TBD | TBD | TBD |
3. Data Organization & Structure
Semifield Data (NFS locations 1-3):
1. semifield-uploads
Contains raw, unprocessed images collected from the BenchBot (bbot) across different versions:
Version 2 – Vention + Sony camera
Version 3 – Amiga + Sony camera
Version 3.1 – Amiga + SVS camera
Image file types:
Sony cam images →
.ARW
(fully processed)SVS cam images →
.RAW
(not yet processed)Not used directly for CV/DL training but preprocessed into
semifield-developed-images
.
Some batches include an additional SONY folder containing Sony camera captures.
All the raw images in
semifield-upload
are not and should not be used directly for computer vision or deep learning training. Instead, they undergo color correction before being stored insemifield-developed-images
.These images represent the “rawest” form of our data.
2. semifield-developed-images
Contains color-corrected images and associated metadata.
Key subfolders:
images
– Color-corrected JPGs +.pp3
profiles.metadata
– JSON files for each processed image ([Link to metadata Confluence page])meta_masks
– Keep semantic masks | Ignore instance masks (accuracy issues).reference
– Camera position & orientation from SfM reconstruction.
Other subfolders may exist but are unnecessary, such as:
plant-detections
,masks
,prediction_images
,autosfm
→ Should not be transferred.
Some batches only contain an
images
folder, meaning they haven’t gone through the annotation pipeline yet.Question: Do we sync these unprocessed batches, or only fully processed batches that include
metadata
,meta_masks
, andreference
?
3. semifield-cutouts
Batch folders contain cutout segments used for training.
Each cutout segment consists of four files with no subfolders:
.png
– Cutout image.jpg
– Cutout image.json
– Metadata file*_mask.png
– Corresponding mask
[Link to Confluence page on data products]
4. Additional Folders in longterm_images2
1. semifield-utils
Internal processing/reference folder – Likely should not be public but can be moved.
autosfm
– Contains shapefiles and ground control points for species assignment in annotation workflows. These files are not directly usable and should not be public.image_development
– Holds.pp3
profiles and color matrix for bbot v3.1 images.Still in development, so should not be shared publicly but can be moved.
species_information
– Important! Stores species class data used in metadata files (developed-images
,cutouts
).These are the final set-in-stone species names.
[Link to Confluence page with species classification]
2. semifield-tools
Contains trained ML models for data processing – Should not be public, but can be moved.
Non-target weed classifier – Used for data cleaning.
Plant segmentation model – Generates masks & cutouts.
3. semifield-database
MUST BE MOVED & SHARED PUBLICLY
Contains SQLite3 database file (
.db
).
5. Data Products & Prioritization
Must-Have Data Products (In order of priority)
semifield-developed-images
semifield-cutouts
semifield-database
semifield-utils
(speceis_information
only)semifield-uploads
Zipped folders in
GROW_DATA/semifield-uploads
(Decision Needed: Transfer as-is, unzip first, or exclude?)
Data Not Needed for Transfer (Low Priority / Exclude)
semifield-tools
(inlongterm_images2
)Hackathon-related folders
Temporary/intermediate datasets
Redundant or old test datasets
Any batch folders within
semifield-upload
,semifield-developed-images
, andsemifield-cutouts
folders that do not match the required state-date naming format.instance_masks
inmeta_masks
(cannot be verified for accuracy)Exclude
autosfm
andimage_development
from public sharing (internal reference only)
5. Migration Plan & Considerations
Next Steps:
Decide if we sync unprocessed (unannotated) batches in
semifield-developed-images
, or only those with metadata, meta_masks, and reference data.Decide on handling of zipped folders in
GROW_DATA/semifield-uploads
.