The grab_and_go Module ================== The ``grab_and_go`` module provides a streamlined interface for downloading and processing satellite data in a single operation. It handles the entire workflow from retrieving data from remote servers to extracting fields of interest and saving them to disk. Functions -------- .. function:: grab(aios_ds, t0, t1, verbose=True, skip_download=False) Retrieves files from a data source within a given time range. :param aios_ds: Dataset object containing source and collection details :type aios_ds: AIOS_DataSet :param t0: Start time :type t0: str or datetime :param t1: End time :type t1: str or datetime :param verbose: Enable verbose output :type verbose: bool :param skip_download: Skip the download step (for testing) :type skip_download: bool :return: List of local file paths if files are downloaded, otherwise None :rtype: list or None :raises ValueError: If the data source is not supported .. function:: extract(aios_ds, local_files, exdict, n_cores, debug=False, single=False, verbose=True) Extracts data from local files using specified extraction parameters. :param aios_ds: Dataset object containing field information :type aios_ds: AIOS_DataSet :param local_files: Path to local files to process :type local_files: list :param exdict: Dictionary of extraction parameters :type exdict: dict :param n_cores: Number of cores to use for multiprocessing :type n_cores: int :param debug: Enable debugging mode :type debug: bool :param single: Enable single process mode :type single: bool :param verbose: Enable verbose output :type verbose: bool :return: Tuple of fields, inpainted masks, metadata, and times :rtype: tuple :raises ValueError: If the dataset field is not supported .. function:: run(dataset, tstart, tend, eoption_file, ex_file, tbl_file, n_cores, tdelta={'days':1}, verbose=True, debug=False, debug_noasync=False, save_local_files=False) Complete end-to-end pipeline to grab and extract data from a dataset. :param dataset: Name of the dataset (e.g., 'VIIRS_NPP') :type dataset: str :param tstart: Start time in ISO format (e.g., '2020-01-01') :type tstart: str :param tend: End time in ISO format :type tend: str :param eoption_file: Filename of extraction options :type eoption_file: str :param ex_file: Output HDF5 filename for extracted data :type ex_file: str :param tbl_file: Output parquet filename for metadata :type tbl_file: str :param n_cores: Number of cores to use :type n_cores: int :param tdelta: Time delta for processing chunks :type tdelta: dict :param verbose: Enable verbose output :type verbose: bool :param debug: Enable debug mode :type debug: bool :param debug_noasync: Debug without async :type debug_noasync: bool :param save_local_files: Keep downloaded files after processing :type save_local_files: bool :return: None Extraction Parameters ------------------- The extraction options file (``eoption_file``) should be a JSON file with the following parameters: * ``field_size`` (int): Size of the field to extract in pixels * ``clear_threshold`` (float): Percentage threshold for clear conditions * ``nadir_offset`` (int): Offset from nadir in pixels * ``temp_bounds`` (list): Temperature bounds [min, max] in degrees Celsius * ``nrepeat`` (int): Number of repetitions for extraction * ``sub_grid_step`` (int): Step size for sub-grid extraction * ``grow_mask`` (bool): Whether to grow the cloud mask * ``inpaint`` (bool): Whether to perform inpainting on masked regions Example Usage ----------- Basic usage with VIIRS NPP data: .. code-block:: python import asyncio from wrangler.grab_and_go import run # Define extraction options file extract_file = 'extract_viirs_std.json' # Run the pipeline to download and process data run( dataset='VIIRS_NPP', # Dataset name tstart='2024-01-01', # Start date tend='2024-01-02', # End date eoption_file=extract_file, # Extraction options ex_file='output.h5', # Output data file tbl_file='metadata.parquet', # Output metadata file n_cores=4 # Number of processing cores ) Handling Larger Time Periods ------------------------- For processing larger time periods efficiently: .. code-block:: python import pandas as pd from datetime import timedelta from wrangler.grab_and_go import run # Process one week at a time start_date = pd.to_datetime('2024-01-01') end_date = pd.to_datetime('2024-01-31') current_date = start_date while current_date < end_date: next_date = current_date + timedelta(days=7) # Ensure we don't go past the end date if next_date > end_date: next_date = end_date # Process this time chunk run( dataset='VIIRS_NPP', tstart=current_date.isoformat(), tend=next_date.isoformat(), eoption_file='extract_viirs_std.json', ex_file=f'viirs_{current_date.strftime("%Y%m%d")}.h5', tbl_file=f'viirs_meta_{current_date.strftime("%Y%m%d")}.parquet', n_cores=4 ) current_date = next_date Output Structure -------------- The extraction process produces two main outputs: 1. HDF5 File (``ex_file``) - ``fields``: Extracted field data (n_fields × field_size × field_size) - ``inpainted_masks``: Inpainted mask data - ``metadata``: Array of metadata for each field 2. Parquet File (``tbl_file``) - Contains all metadata in tabular format: - ``filename``: Original source file - ``row``, ``col``: Position in the original granule - ``lat``, ``lon``: Geographic coordinates - ``clear_fraction``: Fraction of clear pixels - ``field_size``: Size of the extracted field - ``datetime``: Timestamp of the data - ``ex_filename``: Path to the extraction file Notes ----- * Currently supports PODAAC data sources * Only SST (Sea Surface Temperature) fields are supported * Uses multiprocessing for parallel extraction of fields * Automatically validates the metadata table before saving