The grab_and_go Module

The grab_and_go module provides a streamlined interface for downloading and processing satellite data in a single operation. It handles the entire workflow from retrieving data from remote servers to extracting fields of interest and saving them to disk.

Functions

grab(aios_ds, t0, t1, verbose=True, skip_download=False)

Retrieves files from a data source within a given time range.

Parameters:
  • aios_ds (AIOS_DataSet) – Dataset object containing source and collection details

  • t0 (str or datetime) – Start time

  • t1 (str or datetime) – End time

  • verbose (bool) – Enable verbose output

  • skip_download (bool) – Skip the download step (for testing)

Returns:

List of local file paths if files are downloaded, otherwise None

Return type:

list or None

Raises:

ValueError – If the data source is not supported

extract(aios_ds, local_files, exdict, n_cores, debug=False, single=False, verbose=True)

Extracts data from local files using specified extraction parameters.

Parameters:
  • aios_ds (AIOS_DataSet) – Dataset object containing field information

  • local_files (list) – Path to local files to process

  • exdict (dict) – Dictionary of extraction parameters

  • n_cores (int) – Number of cores to use for multiprocessing

  • debug (bool) – Enable debugging mode

  • single (bool) – Enable single process mode

  • verbose (bool) – Enable verbose output

Returns:

Tuple of fields, inpainted masks, metadata, and times

Return type:

tuple

Raises:

ValueError – If the dataset field is not supported

run(dataset, tstart, tend, eoption_file, ex_file, tbl_file, n_cores, tdelta={'days': 1}, verbose=True, debug=False, debug_noasync=False, save_local_files=False)

Complete end-to-end pipeline to grab and extract data from a dataset.

Parameters:
  • dataset (str) – Name of the dataset (e.g., ‘VIIRS_NPP’)

  • tstart (str) – Start time in ISO format (e.g., ‘2020-01-01’)

  • tend (str) – End time in ISO format

  • eoption_file (str) – Filename of extraction options

  • ex_file (str) – Output HDF5 filename for extracted data

  • tbl_file (str) – Output parquet filename for metadata

  • n_cores (int) – Number of cores to use

  • tdelta (dict) – Time delta for processing chunks

  • verbose (bool) – Enable verbose output

  • debug (bool) – Enable debug mode

  • debug_noasync (bool) – Debug without async

  • save_local_files (bool) – Keep downloaded files after processing

Returns:

None

Extraction Parameters

The extraction options file (eoption_file) should be a JSON file with the following parameters:

  • field_size (int): Size of the field to extract in pixels

  • clear_threshold (float): Percentage threshold for clear conditions

  • nadir_offset (int): Offset from nadir in pixels

  • temp_bounds (list): Temperature bounds [min, max] in degrees Celsius

  • nrepeat (int): Number of repetitions for extraction

  • sub_grid_step (int): Step size for sub-grid extraction

  • grow_mask (bool): Whether to grow the cloud mask

  • inpaint (bool): Whether to perform inpainting on masked regions

Example Usage

Basic usage with VIIRS NPP data:

import asyncio
from wrangler.grab_and_go import run

# Define extraction options file
extract_file = 'extract_viirs_std.json'

# Run the pipeline to download and process data
run(
    dataset='VIIRS_NPP',          # Dataset name
    tstart='2024-01-01',          # Start date
    tend='2024-01-02',            # End date
    eoption_file=extract_file,    # Extraction options
    ex_file='output.h5',          # Output data file
    tbl_file='metadata.parquet',  # Output metadata file
    n_cores=4                     # Number of processing cores
)

Handling Larger Time Periods

For processing larger time periods efficiently:

import pandas as pd
from datetime import timedelta
from wrangler.grab_and_go import run

# Process one week at a time
start_date = pd.to_datetime('2024-01-01')
end_date = pd.to_datetime('2024-01-31')

current_date = start_date
while current_date < end_date:
    next_date = current_date + timedelta(days=7)

    # Ensure we don't go past the end date
    if next_date > end_date:
        next_date = end_date

    # Process this time chunk
    run(
        dataset='VIIRS_NPP',
        tstart=current_date.isoformat(),
        tend=next_date.isoformat(),
        eoption_file='extract_viirs_std.json',
        ex_file=f'viirs_{current_date.strftime("%Y%m%d")}.h5',
        tbl_file=f'viirs_meta_{current_date.strftime("%Y%m%d")}.parquet',
        n_cores=4
    )

    current_date = next_date

Output Structure

The extraction process produces two main outputs:

  1. HDF5 File (ex_file) - fields: Extracted field data (n_fields × field_size × field_size) - inpainted_masks: Inpainted mask data - metadata: Array of metadata for each field

  2. Parquet File (tbl_file) - Contains all metadata in tabular format:

    • filename: Original source file

    • row, col: Position in the original granule

    • lat, lon: Geographic coordinates

    • clear_fraction: Fraction of clear pixels

    • field_size: Size of the extracted field

    • datetime: Timestamp of the data

    • ex_filename: Path to the extraction file

Notes

  • Currently supports PODAAC data sources

  • Only SST (Sea Surface Temperature) fields are supported

  • Uses multiprocessing for parallel extraction of fields

  • Automatically validates the metadata table before saving