Skip to content

Comments

Update download_file method in s3_hook to accept files and directories#23654

Closed
kalyani-subbiah wants to merge 6 commits intoapache:mainfrom
kalyani-subbiah:main
Closed

Update download_file method in s3_hook to accept files and directories#23654
kalyani-subbiah wants to merge 6 commits intoapache:mainfrom
kalyani-subbiah:main

Conversation

@kalyani-subbiah
Copy link

@kalyani-subbiah kalyani-subbiah commented May 11, 2022

closes: #23514

Change the download_file method in s3_hook (aws provider) to use the download_file method from boto3 instead of download_fileobj. This allows a local filename to be passed as an argument, making it easier to track files in a bucket.

@boring-cyborg boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels May 11, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented May 11, 2022

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: [email protected]
    Slack: https://s.apache.org/airflow-slack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

@kalyani-subbiah kalyani-subbiah May 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also don't rename the parameter like that.
renaming it is a breaking change and we don't want to break user code if we can avoid it.
Users may use this function directly in thier code.

If you decide that it's best to rename (and can explain why this rename is needed) then you must support both local_path and filename.
for local_path you need to raise deprecation warning.

Copy link
Author

@kalyani-subbiah kalyani-subbiah May 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. In that case we can keep the argument local_path.
If the local_path argument is a path to a directory, then we run the old download_fileobj method from boto3. If its a path to a file, then we run download_file from boto3. I think this workflow would be non-breaking.
We would still have to update the docs to reflect this change though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't get me wrong.. if you think it's best to use the new name we can do the deprecation..
it's not that complex it just add 3-4 code lines :)
See for example:

def get_records(self, sql: str = "", parameters: Optional[dict] = None, hql: str = ""):
""":sphinx-autoapi-skip:"""
if hql:
warnings.warn(
"The hql parameter has been deprecated. You should pass the sql parameter.",
DeprecationWarning,
stacklevel=2,
)
sql = hql

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Deprecations should be when we have some external circumstances - such as changing naming conventions from libraries etc. But if we are in full control of the API and can choose, then it's better to keep it backwards compatible.

Copy link
Author

@kalyani-subbiah kalyani-subbiah May 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! I have updated this pull request based on the feedback above. It's now non-breaking. Thanks so much everyone.
I've added a conditional for using local_path as path to a directory or a file.
Also added an additional unit-test.

…oto3 instead of download_fileobj in order to keep track of the file name & update unit tests
@kalyani-subbiah kalyani-subbiah changed the title Update download_file method in s3_hook to accept filename argument Update download_file method in s3_hook to accept files and directories May 13, 2022
@potiuk
Copy link
Member

potiuk commented May 17, 2022

Needsrebase/fixing static checks.

s3_hook.get_key.assert_called_once_with(key, bucket)
s3_obj.download_fileobj.assert_called_once_with(mock_temp_file)

def test_download_file2(self, s3_bucket):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a more descriptive test/method name. What is this testing that test_download_file doesn't? These are test methods, so you can rename them to something like test_download_file_with_filename and test_download_file_with_directory or something like that, which is much more useful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for pointing it out - will do once I get some time.

@potiuk
Copy link
Member

potiuk commented Jun 4, 2022

How about fixing static checks and doc failures @kalyani-subbiah - we won't even look at merging it before they are fixed really.

@potiuk
Copy link
Member

potiuk commented Jun 4, 2022

Rebase and use pre-commit/breeze to build docs.

@potiuk
Copy link
Member

potiuk commented Jun 19, 2022

Docs build failing.

@eladkal
Copy link
Contributor

eladkal commented Jul 22, 2022

@kalyani-subbiah are you still working on this PR?

@github-actions
Copy link

github-actions bot commented Sep 6, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Sep 6, 2022
@github-actions github-actions bot closed this Sep 11, 2022
@o-nikolas
Copy link
Contributor

o-nikolas commented Sep 28, 2022

Checking in again here, @kalyani-subbiah are you still working on this PR? It looks like this one was close to the finish line but needed some rebasing and docs fixes.

@eladkal
Copy link
Contributor

eladkal commented Oct 5, 2022

Followup: #26886

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues stale Stale PRs per the .github/workflows/stale.yml policy file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Json files from S3 downloading as text files

7 participants