SemiField Data Reporting
Semifield data reporting tool goes through semifield data in azure blob containers, and longterm storage locations: longterm_images
, longterm_images2
, GROW_DATA
to get the details about the files stores.
The tool then runs a report to get insights from the files to generate “batch stats”, and goes through the sqlite database to generate “general distribution stats” on cutouts and developed images.
All the information gathered is sent to slack as it is generated to #semifield-datareports
channel.
Reports generated
Data report
storage used by each category of data (in TB)
uploaded images
cutouts
developed images
Graph depicting number of processed vs unprocessed batches grouped by location
Actionable items report
Number of batches not preprocessed yet
Number of batches not processed yet
Number of batches present in azure but not in longterm storage
semifield-uploads
semifield-developed-images
semifield-cutouts
Number of uploaded and developed images
batch_details.csv
: Master table showing batch details across LTS, azure (includes “deduplicated batch”)semif_developed_duplicates_lts.csv
: Details about batches present in multiple LTS locations
General distribution stats
general_dist_stats.txt
developed image stats (total processed images): total count, by common name, by category, by location, by year
cutouts stats: total count, by common name, by category, by location, by year
primary cutouts stats: total count, by common name, by category, by location, by year
area_by_common_name.csv
: Counts grouped by common name, primary vs non-primary cutouts, bucketed by bbox area.
Configuration
Azure blob containers are read using pre-authorized URL which is set to expire on
2025-06-30
slack bot configuration: https://crownteamworkspace.slack.com/marketplace/A0831MD2TPZ-semif-datareporting?settings=1&tab=settings
Code usage
Github repo: https://github.com/precision-sustainable-ag/SemiF-DataReporting
The tool runs automatically using a cronjob
running on SUNNY server.
The current cron job runs on jbshah
user’s crontab at 9 am every Monday
list_blob_contents.py
:goes through azure blob containers
categorizes into processed, unprocessed, preprocessed, unpreprocessed
adds batch stats
creates separate csvs
local_batch_table_generator.py
:goes through lts locations
categorizes into different types
generates batch stats
creates separate csvs
report.py
:copies the separate csvs generated from
list_blob_contents.py
andlocal_batch_table_generator.py
into a report foldercombines azure and lts csvs for uploads, developed images and cutouts
performs deduplication between batches when they are present in multiple lts locations
generates summary report, actionable report messages and sends it to slack
queries the
agir.db
SQLite database (expected in the code folder) to get general distribution stats for developed images, cutouts and primary cutoutsgenerates
area_by_common_name
csv after fetching the results from the databasesends general distribution stats to slack.
cronjob.sh
:cron shell script to trigger in crontab. currently goes to
jbshah
user’s codebasecopies database to the codebase to reduce network latency
triggers the pipeline
crontab:
0 9 * * 1 /home/jbshah/SemiF-DataReporting/cronjob.sh