Skills needed: See task #2
Labor estimation:
- Putting data through pipeline and reviewing results: 15 min/batch
- Move final images to local storage and azure and checking upload: 10 min/batch
- 3.75 h/week total

Field

Task #1: Data Exploration

Skills needed: basic command line and python skills, attention to detail, data science background is a plus.

Hours required to complete this task:

Task #1-1: Assessing Volume and Storage

Volume Assessment: Assess the total volume of the dataset in terms of the number of images and the size of the metadata.
Storage and Accessibility Evaluation: Review current data storage solutions, assess accessibility for processing and analysis, and identify any data retrieval or integrity issues.
JPGs vs Raw

Task #1-2: Data Organization and Inspection

Data Organization Review: Analyze current data organization methods, such as naming conventions, and identify areas for improvement.
Metadata Table Inspection: Examine the metadata table’s structure and content, focusing on key columns and the type of information they contain, like species and location, identifying gaps or potential enhancements.

Task #1-3: Quality Assessment and Anomaly Detection

Visual Inspection of Images: Conduct a preliminary visual inspection of a representative sample of images covering various data sources and plant types to evaluate quality factors like resolution and clarity. Identify any missing image types.
Metadata-Image Correlation: Ensure accurate linking between images and their metadata, checking for discrepancies or missing connections.
Validate plant species: Confirm that the plant species in the image matches metadata information
Anomaly Detection: Spot anomalies or outliers in image sizes, formats, or metadata entries.
Reporting Findings: Document findings, highlight issues, and make recommendations for data cleaning and processing improvements.

Task #1-4: Preliminary Dataset Development - Image Quality Dataset

Creation of the Dataset: During the visual inspection of images, create a preliminary "Image Quality Dataset." This dataset will involve labeling a subsample of images based on their quality.
Labeling Criteria: Labels should be assigned on a per-image level, focusing on various aspects of image quality such as exposure, brightness, focus, saturation, and other relevant factors.
Purpose of the Dataset: The primary goal is to develop a dataset that can be used to train an "image quality" classifier. This classifier will serve as an automated tool to assist in future image quality inspections and streamlining the process.
Data Documentation and Preparation: Document the criteria used for labeling and prepare the dataset for integration into the on-going data inspection pipeline.

Task #1-5: Data Cleaning and Preprocessing

Begin by correcting metadata errors, standardizing formats, and filtering out unusable images.

Task #2: Pipeline Development

Skills needed:

Hours to complete: 3-4 weeks

Task #2-1: Defining Data Product Goals

Clearly articulate desired outputs from the image processing pipeline, such as cutouts, segmentation masks, and bounding box labels. Specify the details of each data product.
Establish a system for managing cutouts or sub-images, focusing on naming conventions, storage structure, and indexing.

Task #2-2: Pipeline Development Planning

Plan the development of a segmentation pipeline, incorporating insights from previous approaches like semifield, SAM, or in-house models.
Identify technical requirements and necessary tools for the pipeline.
Plan for an iterative development process, allowing for continuous improvements and adjustments based on testing and feedback.
Develop a realistic timeline with key milestones for stages like initial development, testing, and deployment.

Task #2-2: Automation and Efficiency

Automate tasks such as image/species validation, duplication checking, and metadata validation using scripts or other data management tools
Create scripts and protocols that Data Liaison can use to perform day-to-day inspections.

Task #3: Processing Images and Continuous Monitoring

Skills needed:

Hours to complete: Ongoing

Task #3-1: Image Processing Using Segmentation Pipeline

Processing Implementation: Utilize the developed segmentation pipeline to process all accumulated and incoming images
Debugging and Troubleshooting: Actively debug and troubleshoot any issues that arise during the image processing phase. This involves coordinating with the dev team to identify and resolve technical glitches or inaccuracies in the segmentation process.
Validation of Results: Conduct thorough validation of the processed images to ensure that the results meet the established quality and accuracy standards. This may involve manually inspecting a random subsample of images, labels, and other results.

Task #3-2:

Monitor data uploads to ensure proper submission of all data products and quickly identify any issues.

Versions Compared

Old Version 2

New Version 3

Key

Field

Task #1: Data Exploration

Task #1-1: Assessing Volume and Storage

Task #1-2: Data Organization and Inspection

Task #1-3: Quality Assessment and Anomaly Detection

Task #1-4: Preliminary Dataset Development - Image Quality Dataset

Task #1-5: Data Cleaning and Preprocessing

Task #2: Pipeline Development

Task #2-1: Defining Data Product Goals

Task #2-2: Pipeline Development Planning

Task #2-2: Automation and Efficiency

Task #3: Processing Images and Continuous Monitoring

Task #3-1: Image Processing Using Segmentation Pipeline

Task #3-2:

Page Comparison

Versions Compared

Old Version 2

New Version 3

Key

Field

Task #1: Data Exploration

Task #1-1: Assessing Volume and Storage

Task #1-2: Data Organization and Inspection

Task #1-3: Quality Assessment and Anomaly Detection

Task #1-4: Preliminary Dataset Development - Image Quality Dataset

Task #1-5: Data Cleaning and Preprocessing

Task #2: Pipeline Development

Task #2-1: Defining Data Product Goals

Task #2-2: Pipeline Development Planning

Task #2-2: Automation and Efficiency

Task #3: Processing Images and Continuous Monitoring

Task #3-1: Image Processing Using Segmentation Pipeline

Task #3-2: