Skip to content

Json files from S3 downloading as text files #23514

@kalyani-subbiah

Description

@kalyani-subbiah

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

No response

Apache Airflow version

2.3.0 (latest released)

Operating System

Mac OS Mojave 10.14.6

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

What happened

When I download a json file from S3 using the S3Hook:

filename=s3_hook.download_file(bucket_name=self.source_s3_bucket, key=key, local_path="./data")

The file is being downloaded as a text file starting with airflow_temp_.

What you think should happen instead

It would be nice to have them download as a json file or keep the same filename as in S3. Since it requires additional code to go back and read the file as a dictionary (ast.literal_eval) and there is no guarantee that the json structure is maintained.

How to reproduce

Where s3_conn_id is the Airflow connection and s3_bucket is a bucket on AWS S3.
This is the custom operator class:

from airflow.models.baseoperator import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.hooks.S3_hook import S3Hook

import logging
class S3SearchFilingsOperator(BaseOperator):
	"""
	Queries the Datastore API and uploads the processed info as a csv to the S3 bucket. 

	:param source_s3_bucket:          Choose source s3 bucket
	:param source_s3_directory:       Source s3 directory
	:param s3_conn_id:	              S3 Connection ID
	:param destination_s3_bucket:		  S3 Bucket Destination
	"""

	@apply_defaults
	def __init__(
		self,
		source_s3_bucket=None,
		source_s3_directory=True,
		s3_conn_id=True,
		destination_s3_bucket=None,
		destination_s3_directory=None,
		search_terms=[],
		*args, 
		**kwargs) -> None: 
		
		super().__init__(*args, **kwargs)
		
		self.source_s3_bucket = source_s3_bucket
		self.source_s3_directory = source_s3_directory
		self.s3_conn_id = s3_conn_id
		self.destination_s3_bucket = destination_s3_bucket	
		self.destination_s3_directory = destination_s3_directory	

	def execute(self, context):
		"""
		Executes the operator.
		"""
		s3_hook = S3Hook(self.s3_conn_id)
		keys = s3_hook.list_keys(bucket_name=self.source_s3_bucket)
		for key in keys:
			# download file
			filename=s3_hook.download_file(bucket_name=self.source_s3_bucket, key=key, local_path="./data")
			logging.info(filename)
			with open(filename, 'rb') as handle:
				filing = handle.read()
			filing = pickle.loads(filing)
			logging.info(filing.keys())

And this is the dag file:

from keywordSearch.operators.s3_search_filings_operator import S3SearchFilingsOperator

from airflow import DAG
from airflow.utils.dates import days_ago

from datetime import timedelta

# from aws_pull import aws_pull

default_args = {
  "owner" : "airflow",
  "depends_on_past" : False,
  "start_date": days_ago(2),
  "email" : ["[email protected]"],
  "email_on_failure" : False,
  "email_on_retry" : False,
  "retries" : 1,
  "retry_delay": timedelta(seconds=30)
}

with DAG("keyword-search-full-load",
  default_args=default_args,
  description="Syntax Keyword Search",
  max_active_runs=1,
  schedule_interval=None) as dag:

  op3 = S3SearchFilingsOperator(
    task_id="s3_search_filings",
    source_s3_bucket="processed-filings",
    source_s3_directory="citations",
    s3_conn_id="Syntax_S3",
    destination_s3_bucket="keywordsearch",
    destination_s3_directory="results",
    dag=dag
  )

op3

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions