GEONETClass: Downloading Data From Amazon AWS With Python and Rclone (Part II)

AWS_Tuto_Logo.png

Hi community!

In the first part of this Blog series, we have learned how to use the Rclone tool to download data from Amazon, first, using Rclone commands and then, using Python scripts.

As a follow up to the previous Blog post, now we are going to show:

  • How to download only GOES-R imagery from minutes 20 and 50 every hour (to compliment the data available on GNC-A).
  • Suggest another scheme to download data from minutes 20 and 50, using the awscli utility and adapting the example script provided by Dr. Marcial Garbanzo at this Blog Post (sugested by Demilson Quintão, a GNC-A user)
  • Suggest another Python solution, without using Rclone (as mentioned by Paulo Alexandre Mello in the Part I comments section)

GOES-R Imagery in GEONETCast-Americas

In GNC-A we have imagery from both GOES-16 and GOES-17 (Bands 02, 07, 08, 09, 13, 14 and 15). Right now there are 4 images available each hour, from minutes 00, 10, 30 and 40. Check out below an example list of files received in GNC-A today for Band 13, since 7 AM:

GNC_Reception_Times.png

Downloading imagery only from minutes 20 and 50

The code snipped below shows an approach to detect if a GOES-R imagery is from minute 20 or 50.

file_name = "OR_ABI-L2-CMIPF-M6C10_G16_s20191300020310_e20191300030029_c20191300030106.nc"
# Search in the file name if the image from GOES is from minute 20 or 50.
# You may change the "20" and "50" to the minute (s) you want.
regex = re.compile(r'(?:s.........20|s.........50)..._')
finder = re.findall(regex, file_name)
# If "matches" is "0", it is not from minute 20 or 50. If it is "1", we may download the file
matches = len(finder)

Please find below the full Python script used to download data from these minutes:

############################################################
# LICENSE
# Copyright (C) 2019 - INPE - NATIONAL INSTITUTE FOR SPACE RESEARCH
# This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
# This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
# You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
############################################################
# Required Modules
import os           # Miscellaneous operating system interfaces
import subprocess   # The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes.
import datetime     # Basic date and time types
import sys          # System-specific parameters and functions
import platform     # Access to underlying platform’s identifying data
import re           # Regular expression operations
osystem = platform.system()
if osystem == "Windows": extension = '.exe'

# Welcome message
print ("GOES-R Big Data Python / Rclone Downloader")

# Desired Data
BUCKET = 'noaa-goes16'     # For GOES-R the buckets are: ['noaa-goes16', 'noaa-goes17']
PRODUCT = 'ABI-L2-CMIPF'   # Choose from ['ABI-L1b-RadC', 'ABI-L1b-RadF', 'ABI-L1b-RadM', 'ABI-L2-CMIPC', 'ABI-L2-CMIPF', 'ABI-L2-CMIPM', 'ABI-L2-MCMIPC', 'ABI-L2-MCMIPF', 'ABI-L2-MCMIPM']
YEAR = str(datetime.datetime.now().year)                      # Year got from local machine
JULIAN_DAY = str(datetime.datetime.now().timetuple().tm_yday) # Julian day got from local machine
UTC_DIFF = +3                                                 # How many hours UTC is ahead (+) or behind (-) from your workstation time and date
HOUR = str(datetime.datetime.now().hour + UTC_DIFF).zfill(2)  # Hour got from local machine corrected for UTC, with 2 digits

print("Current year, julian day and hour based on your workstation:")
print("YEAR: ", YEAR)
print("JULIAN DAY: ", JULIAN_DAY)
print("HOUR (UTC): ", HOUR)

CHANNEL = ['C09', 'C13']    # Choose from ['C01', 'C02', 'C03', 'C04', 'C05', 'C06', 'C07', 'C08', 'C09', 'C10', 'C11', 'C12', 'C13', 'C14', 'C15', 'C16']
OUTDIR = "C:\\Rclone\\"     # Choose the output directory

# Loop through all channels chosen in the list
for CHANNEL in CHANNEL:
    # Get output from rclone command, based on the desired data
    files = subprocess.check_output('rclone' + extension + " " + 'ls publicAWS:' + BUCKET + "/" + PRODUCT + "/" + YEAR + "/" + JULIAN_DAY + "/" + HOUR + "/", shell=True)
    # Change type from 'bytes' to 'string'
    files = files.decode()
    # Split files based on the new line and remove the empty item at the end.
    files = files.split('\n')
    files.remove('')
    # Get only the file names for an specific channel
    files = [x for x in files if CHANNEL in x ]
    # Get only the file names, without the file sizes
    files = [i.split(" ")[-1] for i in files]
    # Print the file names list
    #print ("File list for this particular time, date and channel:")
    #for i in files:
    #    print(i)
    if not files:
        print("No files available yet... Exiting loop")
        break # No new files available in the cloud yet. Exiting the loop.
    print ("Checking if the file is on the daily log...")
    # If the log file doesn't exist yet, create one
    file = open('goes16_aws_log_' + str(datetime.datetime.now())[0:10] + '.txt', 'a')
    file.close()
    # Put all file names on the log in a list
    log = []
    with open('goes16_aws_log_' + str(datetime.datetime.now())[0:10] + '.txt') as f:
        log = f.readlines()
    # Remove the line feeds
    log = [x.strip() for x in log]
    if files[-1] not in log:
        print(files[-1])
        print ("Checking if the file is from minute 20 or 50...")
        # Search in the file name if the image from GOES is from minute 20 or 50.
        # You may change the "20" and "50" to the minute (s) you want.
        regex = re.compile(r'(?:s.........20|s.........50)..._')
        finder = re.findall(regex, files[-1])
        # If "matches" is "0", it is not from minute 20 or 50. If it is "1", we may download the file
        matches = len(finder)
        if matches == 0: # If there are no matches
            print("This is not an image from minute 20 or 50... Exiting loop.")
            break # This is not an image from minute 20 or 50. Exiting the loop.
        else:
            print("Image is from minute 20 or 50.")
        print ("Downloading the file for channel: ", CHANNEL)
        # Download the most recent file for this particular hour
        os.system('rclone' + extension + " " + 'copy publicAWS:' + BUCKET + "/" + PRODUCT + "/" + YEAR + "/" + JULIAN_DAY + "/" + HOUR + "/" + files[-1] + " " + OUTDIR)
        print ("Download finished!")
        print ("Putting the file name on the daily log...")
	    # Put the processed file on the log
        import datetime   # Basic Date and Time types
        with open('goes16_aws_log_' + str(datetime.datetime.now())[0:10] + '.txt', 'a') as log:
            log.write(str(datetime.datetime.now()))
            log.write('\n')
            log.write(files[-1] + '\n')
            log.write('\n')
    else:
        print("This file was already downloaded.")
        print(files[-1])

And please find below the “Python cron simulator” that will call the script above every 20 seconds (you may change this interval as you wish). Note: You may still use CRON, INCRON, Windows Task Scheduler, etc. This is just an alternative.

############################################################
# LICENSE
# Copyright (C) 2019 - INPE - NATIONAL INSTITUTE FOR SPACE RESEARCH
# This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
# This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
# You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
############################################################
import sched, time # Scheduler library
import os          # Miscellaneous operating system interfaces

# Interval in seconds
seconds = 20

# Call the function for the first time without the interval
print("\n")
print("------------- Calling Monitor Script --------------")
script = 'python aws_goes_downloader.py'
os.system(script)
print("------------- Monitor Script Executed -------------")
print("Waiting for next call. The interval is", seconds, "seconds.")

# Scheduler function
s = sched.scheduler(time.time, time.sleep)

def call_monitor(sc):
    print("\n")
    print("------------- Calling Monitor Script --------------")
    script = 'python aws_goes_downloader.py'
    os.system(script)
    print("------------- Monitor Script Executed -------------")
    print("Waiting for next call. The interval is", seconds, "seconds.")
    s.enter(seconds, 1, call_monitor, (sc,))
    # Keep calling the monitor

# Call the monitor
s.enter(seconds, 1, call_monitor, (s,))
s.run()

Another approach, suggested by Demilson Quintão (IPMET Bauru – Brazil)

Demilson, a GNC-A user, is complimenting his GNC-A station data using the example script from the following blog post:

https://geonetcast.wordpress.com/2018/01/10/script-to-download-goes-16-netcdfs-from-amazon-s3/

This is what he is doing:

  • When the GOES-R imagery from minutes 10 or 40 arrives at the GNC-A station, he downloads the data from minutes 20 and 50 from AWS, respectively.
  • Due to the rebroadcast latency from GNC-A, when the files from mintes 10 or 40 arrives, the files from minutes 20 and 50 are already available in AWS.
  • In order to detect that these files have arrived at his Linux Workstation, he uses INCRONTAB. This tool triggers processes based on events from the system. Among these events, there is the IN_CLOSE_WRITE, used by Demilson. This functionality detects when a file is written in the system.
  • INCRON works almos like CRONTAB (options -l, -e, etc…). However, it is much more efficient to use INCRON in this case for the sake of timing. It is activated only when a new file is written, and with CRONTAB you have to choose an interval.

Suggested by Blog Reader (Paulo Alexandre Mello): Downloading Data From AWS without using RClone

Paulo Alexandre Mello, from Brazil, suggested another solution in the first post comment session:

Comments_AWS_Paulo

You may check the goes-py utility at the following link:

https://github.com/palexandremello/goes-py

Thanks for the suggestion Paulo!

Stay tuned for news!