Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Github repo: https://github.com/precision-sustainable-ag/SemiF-DataReporting

Azure blob containers are read using pre-authorized URL which is set to expire on 2025-06-30

Semifield data reporting tool goes through semifield data in azure blob containers, and longterm storage locations: longterm_images, longterm_images2, GROW_DATA to get the details about the files stores.

...

All the information gathered is sent to slack as it is generated to #semifield-datareports channel.

Reports generated

Data report

  • storage used by each category of data (in TB)

    • uploaded images

    • cutouts

    • developed images

  • Graph depicting number of processed vs unprocessed batches grouped by location

Actionable items report

  • Number of batches not preprocessed yet

  • Number of batches not processed yet

  • Number of batches present in azure but not in longterm storage

    • semifield-uploads

    • semifield-developed-images

    • semifield-cutouts

  • Number of uploaded and developed images

  • batch_details.csv: Master table showing batch details across LTS, azure (includes “deduplicated batch”)

  • semif_developed_duplicates_lts.csv: Details about batches present in multiple LTS locations

General distribution stats

  • general_dist_stats.txt

    • developed image stats (total processed images): total count, by common name, by category, by location, by year

    • cutouts stats: total count, by common name, by category, by location, by year

    • primary cutouts stats: total count, by common name, by category, by location, by year

  • area_by_common_name.csv: Counts grouped by common name, primary vs non-primary cutouts, bucketed by bbox area.

Configuration

Code usage

Github repo: https://github.com/precision-sustainable-ag/SemiF-DataReporting

The tool runs automatically using a cronjob running on SUNNY server.

The current cron job runs on jbshah user’s crontab at 9 am every Monday

  • list_blob_contents.py:

    • goes through azure blob containers

    • categorizes into processed, unprocessed, preprocessed, unpreprocessed

    • adds batch stats

    • creates separate csvs

  • local_batch_table_generator.py:

    • goes through lts locations

    • categorizes into different types

    • generates batch stats

    • creates separate csvs

  • report.py:

    • copies the separate csvs generated from list_blob_contents.py and local_batch_table_generator.py into a report folder

    • combines azure and lts csvs for uploads, developed images and cutouts

    • performs deduplication between batches when they are present in multiple lts locations

    • generates summary report, actionable report messages and sends it to slack

    • queries the agir.db SQLite database (expected in the code folder) to get general distribution stats for developed images, cutouts and primary cutouts

    • generates area_by_common_name csv after fetching the results from the database

    • sends general distribution stats to slack.

  • cronjob.sh:

    • cron shell script to trigger in crontab. currently goes to jbshah user’s codebase

    • copies database to the codebase to reduce network latency

    • triggers the pipeline

    • crontab: 0 9 * * 1 /home/jbshah/SemiF-DataReporting/cronjob.sh