Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 13 Next »

Installing Python and Jupyter Notebook

You can download and install Python from https://www.python.org/downloads/

Check Add python.exe to PATH, then Install Now.

To confirm installation, from a command prompt enter:

py --version
python --version

Both commands should show the version that you downloaded.

Jupyter Notebook is a popular tool for running Python scripts, offering the added benefit of displaying plots directly within the interface. You can install it by running:

pip install notebook

You can then run:

jupyter lab /path/to/folder


Basic conversions

R

Python

Arithmetic operators

x + y
x - y
x * y
x / y
x ^ y
x + y
x - y
x * y
x / y
x ** y

The only difference is that R uses ^ for exponentiation, while Python uses **.

Comparison operators

# R and Python
x == y
x != y
x > y
x < y
x >= y
x <= y

No differences here.

Logical operators


x && y
x || y
!x
v1 & v2
v1 | v2
import numpy as np

x and y
x or y
not x
np.logical_and(v1, v2)
np.logical_or(v1, v2)

Python doesn’t have built-in logical operations for vectors.

numpy is a popular library for this purpose, but it’s not as clean as R’s syntax.

Assignment operator

name <- "John Doe"

name = "John Doe"

R does accept the = syntax for assignment. However, it’s non-idiomatic and can lead to unexpected errors.

For example,

my_function <- function(y = 3, x = 5) {
  print(paste(y, x))
}

my_function(x = 10)   # Passes 10 to the x parameter, and outputs "3 10"

my_function(x <- 20)  # Assigns 20 to the global x variable, which it passes to the y parameter. Outputs "20 5"

To avoid confusion, always use <- for assignment, and use = for passing arguments within function calls.

Ranges

1:10
seq(1, 20, 2)
print(1:10)
range(1, 11)
range(1, 21, 2)
print(list(range(1, 11)))

Ouch. Python’s code is not only longer, but it excludes the final number.
range(1, 11) creates a range object in Python, which represents the numbers 1 through 10.

list(range(1, 11)) creates a list object in Python, which represents [1, 2, 3, … 10]
Both function similarly to the 1:10 vector in R. However, you need to change a range to a list in order to output it.

Conditional Statements

if (x > 5) {
  print("x is greater than 5")
} else if (x >= 2) {
  print("x is between 2 and 5")
} else {
  print("x is less than 2")
}
if x > 5:
    print("x is greater than 5")
elif x >= 2:
    print("x is between 2 and 5")
else:
    print("x is less than 2")

R requires the condition to be in parentheses. That would also work in Python, but it’s not required.

R begins a block statement with an opening curly brace ({), and it ends it with a closing curly brace (}).

Python is the oddball here, using a colon (:) to begin a block statement. It then uses indentation for the block’s statements. To end a block, you stop indenting.

This can take some getting used to, because very few computer languages have similar syntax.

Note that the recommended indentation for Python is 4 spaces, but consistency is all that matters.

For Loops

for (i in 1:5) {
  print(i)
}
for i in range(1, 6):
    print(i)

Note the similarity to the if statement comparisons above – particularly the absence of parentheses in the iteration expression, and the block syntax.

Functions

my_function <- function(x, y) {
  return(x + y)
}
def my_function(x, y):
    return x + y
    

In Python, the return statement does not require parentheses, unlike R.

Indexing

my_list[1]
my_list[0]

R uses 1-based indexing.

Python uses 0-based indexing.

R is actually the oddball here. Most modern languages use 0-based indexing.

Slicing

my_list <- c(1, 2, 3, 4, 5)
print(my_list[2:4])  # Outputs 2 3 4
my_list = [1, 2, 3, 4, 5]
print(my_list[1:4])  # Outputs [2, 3, 4]

Because Python uses 0-based indexing, the second element is at position 1 instead of 2.

Python supports similar start:end syntax to R. However, the end number is non-inclusive, meaning the slice stops just before the end index

Comments

# This is a comment

No difference.

Libraries

library(dplyr)
library(terra)
library(sf)
library(exactextractr)
import pandas as pd
import rasterio
import geopandas as gpd
from rasterstats import zonal_stats

For data manipulation and geospatial analysis, R and Python rely on external libraries.

You can alias libraries in Python using as, which simply saves typing when calling them in your program.

You can selectively import functions from a Python library using from. To do the same in R you need to use ::, such as dplyr::select().


Example conversions

The following examples demonstrate how to convert the R code at https://github.com/jjennewein/MDA_2023-24 into equivalent Python code.

0.field buffer.R

0.field buffer.py

library(sf)
import geopandas as gpd

Loads a library for handling geospatial data, including working with shapefiles.

In the Python code, gpd is an alias for the geopandas library Using the alias simply reduces typing and improves code readability.

setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

RStudio defaults to a global working directory, which means file paths must be specified explicitly.

Python defaults to the folder where the source code is located.

Jupyter Notebook defaults to where you launch the notebook server.

  • You could launch it from the command line within the source folder:

    cd \psa\MDA_2023-24\python
    jupyter notebook
  • Or you could specify the folder directly when launching Jupyter:

    jupyter notebook \psa\MDA_2023-24\python

The original file path was "E:/Eastern_Shore/MD_enrollment_data_2023_24/fields/", but I don't have access to that location.

To avoid referencing "E:/Eastern_Shore..." explicitly, I saved all the necessary project files to the same folder as the source code, and I added a setwd() statement to the R source. I then removed all references to "E:/Eastern_Shore..."

This way, the R program can run without requiring the full path in filenames.

mda = st_read("MDAEnrollment2023-2024.shp") %>%
  st_transform(., crs = 32618) %>%
  st_buffer(., dist = -30) %>%
  dplyr::filter(!st_is_empty(.))
mda = gpd.read_file("MDAEnrollment2023-2024.shp") \
      .to_crs(epsg=32618)
mda["geometry"] = mda.geometry.buffer(-30)
mda = mda[~mda.geometry.is_empty]

This code:

  • Reads a shapefile into a spatial data frame (R) or a GeoDataFrame (Python).

  • Converts the coordinate reference system (CRS) to EPSG 32618 (WGS 84).

  • Applies a negative buffer of 30 meters to shrink the geometries.
    Note that sf objects in R have a designated geometry column, but in GeoPandas you need to explicitly reference the geometry column.

  • Filters out any empty geometries resulting from the buffering.
    In Python, ~ is a bitwise NOT operator.

R uses %>% for chaining operations, while Python uses . (dot notation).

In Python, a backslash (\) means line continuation. I used it here to align the code with the R version. But I could have also written it as:
mda = gpd.read_file("MDAEnrollment2023-2024.shp").to_crs(epsg=32618)

Also, note that the dots in st_transform(., crs = 32618) and st_buffer(., dist = -30) are optional but may improve readability.

st_write(mda, "MDAEnrollment2023-2024_Buff30m.shp", append=FALSE)
mda.to_file("MDAEnrollment2023-2024_Buff30m.shp")

Writes the buffered data frame to a new shapefile.

I added append=FALSE to the original source so the file would be overwritten if it exists. GeoPandas automatically overwrites existing files.

1.Extract_HLS_L30 .R

1.Extract_HLS_L30 .py

library(dplyr)
library(terra)
library(sf)
library(exactextractr)
library(glue)
import pandas as pd
import rasterio
import geopandas as gpd
import os
import glob
import numpy as np
from rasterstats import zonal_stats

Syntax is easy. Knowing which libraries to use – not so much!

Here’s a quick description of each Python library used for this program:

  • pandas: Provides data structures such as DataFrames. Aliased as pd .

  • rasterio: Provides functions for processing geospatial raster data.

  • geopandas: Extends pandas to support spatial data. Aliased as gpd.

  • os: Provides operating system features such as changing directories.

  • glob: Provides functions for listing files that match specific patterns.

  • numpy: Provides functions for multi-dimensional arrays and matrices. Aliased as np.

  • rasterstats: Summarizes geospatial raster datasets based on vector geometries. We only need the zonal_stats function from this library.

process_shapefile <- function(sensor, NIR_band) {
def process_shapefile(sensor, NIR_band):

Defines a function that takes the sensor (L30 or S30) and NIR_band ( B05 or B8A) as parameters.

  setwd(base_dir)  
  tiles <- list.files()
    os.chdir(base_dir)
    tiles = os.listdir()

Changes to the base directory and get a list of the tiles.

  extract = fields %>% #set up the output data frame extractions
    st_drop_geometry() %>%
    dplyr::select()
    extract = fields.iloc[:, 0:0]

Creates a dataframe with the number of rows in fields but with no columns.

Note the simple syntax in Python. The single colon : means “all rows.” 0:0 means “no columns.”

  for(i in 1:length(tiles)) {
    for i, tile in enumerate(tiles):

Creates a loop to iterate through the tiles.

Note that the Python code could be written as for tile in tiles. However, the i variable gives us the opportunity of showing the status of the program, as described later.

    tile_dir <- file.path(base_dir, tiles[i])
    setwd(tile_dir)
        tile_dir = os.path.join(base_dir, tile)
        os.chdir(tile_dir)

Changes to the directory of the current tile.
Note that the Python enumerate function allows us to grab both the tile and its index into the tiles list.
Therefore, we don’t need to refer to tiles[i], although that would work just as well.

    scenes <- list.files(pattern = glob2rx(glue("HLS.{sensor}*")))
        scenes = glob.glob(f"HLS.{sensor}*")

Grabs the files that match the pattern HLS.{sensor}.

Python’s f strings work similarly to R’s glue function.

    for (j in 1:length(scenes)) {
        for j, scene in enumerate(scenes):

Creates a loop to iterate through each tile's scenes.

      print(glue("Processing scene {j} of {length(scenes)} in tile {i} of {length(tiles)}"))
            print(f"Processing scene {j + 1} of {len(scenes)} in tile {i + 1} of {len(tiles)}")

Shows the current status.

      scene_dir <- file.path(tile_dir, scenes[j])
      setwd(scene_dir)
            scene_dir = os.path.join(tile_dir, scene)
            os.chdir(scene_dir)

Changes to the scene/date folder.

      bands = list.files() #list rasters in the scene/date folder
      list_NIR <- bands[substring(bands, 36, 38) == NIR_band]
      list_B04 <- bands[substring(bands, 36, 38) == "B04"]
            list_NIR = glob.glob(f"*.{NIR_band}.tif")[0]
            list_B04 = glob.glob("*.B04.tif")[0]

Grabs the filenames of the NIR and B04 bands.

Note how Python’s glob.glob function simplifies this.

      rast.stk <- rast(c(list_B04, list_NIR))
            rast_stk = [rasterio.open(list_B04), rasterio.open(list_NIR)]

Loads the raster stack of the B04 and NIR bands.

      names(rast.stk) <- c(substring(list_B04, 1, 38), substring(list_NIR, 1, 38))
            rast_stk_named = {name: rast for name, rast in zip(band_names, rast_stk)}

Sets the names of the raster stack based on file names.

      list_cloud_mask <- which(substring(bands, 36, 38) == "Fma")
      cloud_mask <- rast(bands[list_cloud_mask])
            list_cloud_mask = glob.glob("*.Fmask.tif")[0]
            with rasterio.open(list_cloud_mask) as cloud_mask_src:
                cloud_mask = cloud_mask_src.read(1).astype(float)

      cloud_mask0 <- cloud_mask
      cloud_mask64 <- cloud_mask
      cloud_mask128 <- cloud_mask
            cloud_mask0 = np.copy(cloud_mask)
            cloud_mask64 = np.copy(cloud_mask)
            cloud_mask128 = np.copy(cloud_mask)

The R code creates references to cloud_mask. So cloud_mask0, cloud_mask64, and cloud_mask128 essentially become “aliases” of cloud_mask.

This is an optimization technique, which prevents R from copying data that might not change.

However, the data does change in the following code. When that happens, R will automatically change the references to copies of cloud_mask.

The Python code makes this explicit by creating copies of cloud_mask to begin with.

      cloud_mask0[cloud_mask0 != 0] <- NA
      cloud_mask64[cloud_mask64 != 64] <- NA
      cloud_mask128[cloud_mask128 != 128] <- NA
            cloud_mask0[cloud_mask0 != 0] = np.nan
            cloud_mask64[cloud_mask64 != 64] = np.nan
            cloud_mask128[cloud_mask128 != 128] = np.nan

Sets values not matching the conditions to NaN.

      cloud_mask_merge <- terra::merge(cloud_mask0, cloud_mask64, cloud_mask128, na.rm = TRUE)
            stacked_masks = np.stack([cloud_mask128, cloud_mask64, cloud_mask0])
            cloud_mask_merge = np.nanmax(stacked_masks, axis=0)

Merges the cloud mask layers for 0, 64, and 128.

      stk.mask <- mask(rast.stk, mask = cloud_mask_merge)
            stk_mask = [np.where(~np.isnan(cloud_mask_merge), band, np.nan) for band in rast_stk]

Applies the cloud mask to the raster stack

      ext <- exact_extract(stk.mask,fields %>% st_transform(., crs = crs(stk.mask)),'median')
            fields_transformed = fields.to_crs(crs)
            ext = zonal_stats(fields_transformed, stk_mask.read(1), stats=['median'])

CONTINUE FROM HERE

fields = st_read("MDAEnrollment2023-2024_Buff30m.shp")
fields = gpd.read_file("MDAEnrollment2023-2024_Buff30m.shp")

Reads the buffered shapefile.

plot(fields[1])
fields.plot(edgecolor='black')

Plots the shapefile.

edgecolor isn’t required in the Python code. It simply makes the output consistent with the R output.

  • No labels