Data Shepherding
SemiField
Bellow are the steps involved in processing the semi-field data and estimated people/h needed to accomplish those:
Assumptions: there will be 9 batches per week, 3 batches per each partner site (MD, NC and TX)
Glossary
Batch: Collection of images from a unique day and location
Season: Group of plants which are set up at the same time. For example, cover crops planted in fall 2023 and killed in 2024 are a season, weeds planted in spring 2023 and killed at the end of summer 2023 are another season.
Task #1: Pre-processing
Skills needed: basic command line and python skills, attention to detail.
Hours required to complete this task on average per batch: 0.5 h (30 min)
On average, 8 batches can be processed per day due to the time that it takes for those to go through the pipeline once the manual inspection has been done.
Subtask #1-1: Pre-processing Backlog
As of Nov 14, 2023 there are 179 batches that need to be processed to get us to the point where we can start maintaining this pipeline on a weekly basis.
Labor estimation:
Total: 90 h
Per day (limited to): 1.5 h
Weeks needed to complete the task: 12 (If only one person and done 5 days a week)
Possible candidates for this role: Zack, Jordan, Courtney
Goal: finish processing the backlog by the end of Jan.
Each person is assigned ⅓ of the total batches (179/3 ~60)
Who gets which batches will be on a shared spreadsheet (Preprocessing Backlog Sheet)
Minimum of 16 batches/week/person
Subtask #1-2: Pre-processing weekly
This happens on a weekly basis to keep the incoming data moving through the pipeline as it gets uploaded to the Azure storage.
Labor estimation:
Estimated batches per week: 9 (3 from each site)
Hours / week: 4.5h (hours have to be split at least in 3 days)
Candidates for this role: Zack, Jordan
Task #2: SfM bench reconstruction
This is the main bottleneck of the pipeline!!! Each time the QR codes are moved a lot of resources need to be pulled in to first identify the issues and then adjust parameters accordingly to rerun the reconstruction. Markers MUST be fixed for all sites moving forward. With BBot 2.0 with RTK capability it’s possible that the QR codes won’t be needed anymore, until then, we have to guarantee that marker position stays the same.
Skills needed Task#2 - #5: Each task demands a mix of technical skills (such as command line usage, Python programming, and GIS knowledge) and soft skills (like attention to detail and problem-solving).
Programming and Scripting
Command line/bash
Python
Git
Logging
Visual inspection
Labor estimation:
If QR codes are not in the same position (Auto SfM needs to be rerun):
Identify issues: 2 hours
Rerun : 2 hours
If QR codes are in the same position
5 min/batch
45 min/week (Assuming 9 batches per week)
Candidates for this role: Nav and Matthew
It takes ~48 h to get the output of the Auto SfM each time a rerun is needed. If everything goes perfect it takes 4 h on average to process 500 to 700 images (full bench).
Task #3: Make shape file
The shape file contains the information on which set of pots corresponds to which species. It’s done manually, using QGIS to create a shapefile over the completed orthomosaic of the potting area.
Skills needed: see task #2
Labor estimation:
1.5 h per season/location
3-4 seasons/year
3 locations
18 h total
Task #4: Put all previously processes and generated data through the rest of the pipeline
This step includes plant detection, labeling, segmentation and reviewing results on a random sample of ~ 10 output images.
Skills needed: See task #2
Labor estimation:
Putting data through pipeline and reviewing results: 15 min/batch
Move final images to local storage and azure and checking upload: 10 min/batch
3.75 h/week total
Field
Task #1: Data Exploration
Skills needed: basic command line and python skills, attention to detail, data science background is a plus.
Hours required to complete this task:
Task #1-1: Assessing Volume and Storage
Volume Assessment: Assess the total volume of the dataset in terms of the number of images and the size of the metadata.
Storage and Accessibility Evaluation: Review current data storage solutions, assess accessibility for processing and analysis, and identify any data retrieval or integrity issues.
JPGs vs Raw
Task #1-2: Data Organization and Inspection
Metadata Table Inspection: Examine the metadata table’s structure and content, focusing on key columns and the type of information they contain, like species and location, identifying gaps or potential enhancements.
Correcting metadata errors, standardizing formats, and filtering out unusable images.
Data reports on key metrics
Data Organization Review: Analyze current data organization methods, such as naming conventions, and identify areas for improvement.
Generating batches
Task #1-3: Quality Assessment and Anomaly Detection
Visual Inspection of Images: Conduct a preliminary visual inspection of a representative sample of images covering various data sources and plant types to evaluate quality factors like resolution and clarity. Identify any missing image types.
Metadata-Image Correlation: Ensure accurate linking between images and their metadata, checking for discrepancies or missing connections.
Validate plant species: Confirm that the plant species in the image matches metadata information
Anomaly Detection: Spot anomalies or outliers in image sizes, formats, or metadata entries.
Reporting Findings: Document findings, highlight issues, and make recommendations for data cleaning and processing improvements.
Task #2: Defining Data Product Goals
Clearly articulate desired outputs from the image processing pipeline, such as cutouts, segmentation masks, and bounding box labels. Specify the details of each data product.
Establish a system for managing cutouts or sub-images, focusing on naming conventions, storage structure, and indexing.
Task #3: Image Processing
Skills needed:
Hours to complete: 3-4 weeks
Task #2-1: Preprocessing
Skills needed:
Hours required to complete: Backlog and ongoing
Potential candidates: Courtney, Jordan, Zack
Image inspection. Color correction, extracting image metadata (this requires downloading the images).
Task #2-2: Pipeline Development Planning
Plan the development of a segmentation pipeline, incorporating insights from previous approaches like semifield, SAM, or in-house models.
Identify technical requirements and necessary tools for the pipeline.
Plan for an iterative development process, allowing for continuous improvements and adjustments based on testing and feedback.
Develop a realistic timeline with key milestones for stages like initial development, testing, and deployment.
Task #2-3: Preliminary Dataset Development - Image Quality Dataset
Creation of the Dataset: During the visual inspection of images, create a preliminary "Image Quality Dataset." This dataset will involve labeling a subsample of images based on their quality.
Labeling Criteria: Labels should be assigned on a per-image level, focusing on various aspects of image quality such as exposure, brightness, focus, saturation, and other relevant factors.
Purpose of the Dataset: The primary goal is to develop a dataset that can be used to train an "image quality" classifier. This classifier will serve as an automated tool to assist in future image quality inspections and streamlining the process.
Data Documentation and Preparation: Document the criteria used for labeling and prepare the dataset for integration into the on-going data inspection pipeline.
Task #4: Automation and Efficiency
Automate tasks such as image/species validation, duplication checking, and metadata validation using scripts or other data management tools
Create scripts and protocols that Data Liaison can use to perform day-to-day inspections.
Task #3: Processing Images and Continuous Monitoring
Skills needed:
Hours to complete: Ongoing
Task #3-1: Image Processing Using Segmentation Pipeline
Processing Implementation: Utilize the developed segmentation pipeline to process all accumulated and incoming images
Debugging and Troubleshooting: Actively debug and troubleshoot any issues that arise during the image processing phase. This involves coordinating with the dev team to identify and resolve technical glitches or inaccuracies in the segmentation process.
Validation of Results: Conduct thorough validation of the processed images to ensure that the results meet the established quality and accuracy standards. This may involve manually inspecting a random subsample of images, labels, and other results.
Task #3-2:
Monitor data uploads to ensure proper submission of all data products and quickly identify any issues.
Monitoring Data Uploads:
Monitor data uploads to ensure proper submission of all data products and quickly identify any issues.
Data Storage and Backup:
Implement and maintain robust data storage and backup protocols.
Process Review and Updates:
Regularly review and update processing protocols to accommodate growing data volumes and evolving repository requirements.
Coordination and Communication:
Coordinate with liaisons to understand data submission schedules and unique characteristics of data from the three teams.
Communicate feedback regarding any inconsistencies or issues that need to be addressed.
Reporting on Data Status:
Regularly report on the status, volume, and quality of incoming data, ensuring transparency and informed decision-making.