Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Skills needed: See task #2

  • Labor estimation:

    • Putting data through pipeline and reviewing results: 15 min/batch

    • Move final images to local storage and azure and checking upload: 10 min/batch

    • 3.75 h/week total

Field

Task #1: Data Exploration

Skills needed: basic command line and python skills, attention to detail, data science background is a plus.

Hours required to complete this task:

Task #1-1: Assessing Volume and Storage

  • Volume Assessment: Assess the total volume of the dataset in terms of the number of images and the size of the metadata.

  • Storage and Accessibility Evaluation: Review current data storage solutions, assess accessibility for processing and analysis, and identify any data retrieval or integrity issues.

  • JPGs vs Raw

Task #1-2: Data Organization and Inspection

  • Metadata Table Inspection: Examine the metadata table’s structure and content, focusing on key columns and the type of information they contain, like species and location, identifying gaps or potential enhancements.

  • Correcting metadata errors, standardizing formats, and filtering out unusable images.

  • Data reports on key metrics

  • Data Organization Review: Analyze current data organization methods, such as naming conventions, and identify areas for improvement.

  • Generating batches

Task #1-3: Quality Assessment and Anomaly Detection

  • Visual Inspection of Images: Conduct a preliminary visual inspection of a representative sample of images covering various data sources and plant types to evaluate quality factors like resolution and clarity. Identify any missing image types.

  • Metadata-Image Correlation: Ensure accurate linking between images and their metadata, checking for discrepancies or missing connections.

  • Validate plant species: Confirm that the plant species in the image matches metadata information

  • Anomaly Detection: Spot anomalies or outliers in image sizes, formats, or metadata entries.

  • Reporting Findings: Document findings, highlight issues, and make recommendations for data cleaning and processing improvements.

Task #2: Defining Data Product Goals

  • Clearly articulate desired outputs from the image processing pipeline, such as cutouts, segmentation masks, and bounding box labels. Specify the details of each data product.

  • Establish a system for managing cutouts or sub-images, focusing on naming conventions, storage structure, and indexing.

Task #3: Image Processing

Skills needed:

Hours to complete: 3-4 weeks

Task #2-1: Preprocessing

Skills needed:

Hours required to complete: Backlog and ongoing

Potential candidates: Courtney, Jordan, Zack

Image inspection. Color correction, extracting image metadata (this requires downloading the images).

Task #2-2: Pipeline Development Planning

  • Plan the development of a segmentation pipeline, incorporating insights from previous approaches like semifield, SAM, or in-house models.

  • Identify technical requirements and necessary tools for the pipeline.

  • Plan for an iterative development process, allowing for continuous improvements and adjustments based on testing and feedback.

  • Develop a realistic timeline with key milestones for stages like initial development, testing, and deployment.

Task #2-3: Preliminary Dataset Development - Image Quality Dataset

  • Creation of the Dataset: During the visual inspection of images, create a preliminary "Image Quality Dataset." This dataset will involve labeling a subsample of images based on their quality.

  • Labeling Criteria: Labels should be assigned on a per-image level, focusing on various aspects of image quality such as exposure, brightness, focus, saturation, and other relevant factors.

  • Purpose of the Dataset: The primary goal is to develop a dataset that can be used to train an "image quality" classifier. This classifier will serve as an automated tool to assist in future image quality inspections and streamlining the process.

  • Data Documentation and Preparation: Document the criteria used for labeling and prepare the dataset for integration into the on-going data inspection pipeline.

Task #4: Automation and Efficiency

  • Automate tasks such as image/species validation, duplication checking, and metadata validation using scripts or other data management tools

  • Create scripts and protocols that Data Liaison can use to perform day-to-day inspections.

Task #3: Processing Images and Continuous Monitoring

Skills needed:

Hours to complete: Ongoing

Task #3-1: Image Processing Using Segmentation Pipeline

  • Processing Implementation: Utilize the developed segmentation pipeline to process all accumulated and incoming images

  • Debugging and Troubleshooting: Actively debug and troubleshoot any issues that arise during the image processing phase. This involves coordinating with the dev team to identify and resolve technical glitches or inaccuracies in the segmentation process.

  • Validation of Results: Conduct thorough validation of the processed images to ensure that the results meet the established quality and accuracy standards. This may involve manually inspecting a random subsample of images, labels, and other results.

Task #3-2:

Monitor data uploads to ensure proper submission of all data products and quickly identify any issues.

  1. Monitoring Data Uploads:

  • Monitor data uploads to ensure proper submission of all data products and quickly identify any issues.

  1. Data Storage and Backup:

  • Implement and maintain robust data storage and backup protocols.

  1. Process Review and Updates:

  • Regularly review and update processing protocols to accommodate growing data volumes and evolving repository requirements.

  1. Coordination and Communication:

  • Coordinate with liaisons to understand data submission schedules and unique characteristics of data from the three teams.

  • Communicate feedback regarding any inconsistencies or issues that need to be addressed.

  1. Reporting on Data Status:

  • Regularly report on the status, volume, and quality of incoming data, ensuring transparency and informed decision-making.