...
Skills needed: See task #2
Labor estimation:
Putting data through pipeline and reviewing results: 15 min/batch
Move final images to local storage and azure and checking upload: 10 min/batch
3.75 h/week total
Field
Task #1: Data Exploration
Skills needed: basic command line and python skills, attention to detail, data science background is a plus.
Hours required to complete this task:
Task #1-1: Assessing Volume and Storage
Volume Assessment: Assess the total volume of the dataset in terms of the number of images and the size of the metadata.
Storage and Accessibility Evaluation: Review current data storage solutions, assess accessibility for processing and analysis, and identify any data retrieval or integrity issues.
JPGs vs Raw
Task #1-2: Data Organization and Inspection
Data Organization Review: Analyze current data organization methods, such as naming conventions, and identify areas for improvement.
Metadata Table Inspection: Examine the metadata table’s structure and content, focusing on key columns and the type of information they contain, like species and location, identifying gaps or potential enhancements.
Task #1-3: Quality Assessment and Anomaly Detection
Visual Inspection of Images: Conduct a preliminary visual inspection of a representative sample of images covering various data sources and plant types to evaluate quality factors like resolution and clarity. Identify any missing image types.
Metadata-Image Correlation: Ensure accurate linking between images and their metadata, checking for discrepancies or missing connections.
Validate plant species: Confirm that the plant species in the image matches metadata information
Anomaly Detection: Spot anomalies or outliers in image sizes, formats, or metadata entries.
Reporting Findings: Document findings, highlight issues, and make recommendations for data cleaning and processing improvements.
Task #1-4: Preliminary Dataset Development - Image Quality Dataset
Creation of the Dataset: During the visual inspection of images, create a preliminary "Image Quality Dataset." This dataset will involve labeling a subsample of images based on their quality.
Labeling Criteria: Labels should be assigned on a per-image level, focusing on various aspects of image quality such as exposure, brightness, focus, saturation, and other relevant factors.
Purpose of the Dataset: The primary goal is to develop a dataset that can be used to train an "image quality" classifier. This classifier will serve as an automated tool to assist in future image quality inspections and streamlining the process.
Data Documentation and Preparation: Document the criteria used for labeling and prepare the dataset for integration into the on-going data inspection pipeline.
Task #1-5: Data Cleaning and Preprocessing
Begin by correcting metadata errors, standardizing formats, and filtering out unusable images.
Task #2: Pipeline Development
Skills needed:
Hours to complete: 3-4 weeks
Task #2-1: Defining Data Product Goals
Clearly articulate desired outputs from the image processing pipeline, such as cutouts, segmentation masks, and bounding box labels. Specify the details of each data product.
Establish a system for managing cutouts or sub-images, focusing on naming conventions, storage structure, and indexing.
Task #2-2: Pipeline Development Planning
Plan the development of a segmentation pipeline, incorporating insights from previous approaches like semifield, SAM, or in-house models.
Identify technical requirements and necessary tools for the pipeline.
Plan for an iterative development process, allowing for continuous improvements and adjustments based on testing and feedback.
Develop a realistic timeline with key milestones for stages like initial development, testing, and deployment.
Task #2-2: Automation and Efficiency
Automate tasks such as image/species validation, duplication checking, and metadata validation using scripts or other data management tools
Create scripts and protocols that Data Liaison can use to perform day-to-day inspections.
Task #3: Processing Images and Continuous Monitoring
Skills needed:
Hours to complete: Ongoing
Task #3-1: Image Processing Using Segmentation Pipeline
Processing Implementation: Utilize the developed segmentation pipeline to process all accumulated and incoming images
Debugging and Troubleshooting: Actively debug and troubleshoot any issues that arise during the image processing phase. This involves coordinating with the dev team to identify and resolve technical glitches or inaccuracies in the segmentation process.
Validation of Results: Conduct thorough validation of the processed images to ensure that the results meet the established quality and accuracy standards. This may involve manually inspecting a random subsample of images, labels, and other results.
Task #3-2:
Monitor data uploads to ensure proper submission of all data products and quickly identify any issues.