Skip to content

Unable to open a file in GCS #36993

@Fokko

Description

@Fokko

Describe the bug, including details regarding any error messages, version, and platform.

I'm writing integration tests against a local GCS instance using fake-gcs-server, however, the call when reading the file does not seem to work:

➜  python git:(fd-gcs) ✗ ipython
Python 3.9.17 (main, Jun 20 2023, 18:00:22) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.14.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from pyarrow.fs import GcsFileSystem
   ...: from datetime import datetime
   ...: 
   ...: fs = GcsFileSystem(
   ...:   access_token='anon',
   ...:   credential_token_expiration=datetime(2023, 8, 2, 16, 30, 4),
   ...:   scheme='http',
   ...:   endpoint_override='0.0.0.0:4443'
   ...: )

In [2]: location = 'warehouse/vo.txt'
   ...: 
   ...: with fs.open_output_stream(location) as f:
   ...:   print(f.write(b"foo"))
3

In [3]: print(fs.get_file_info(location))
<FileInfo for 'warehouse/vo.txt': type=FileType.File, size=3>

In [4]: with fs.open_input_file(location) as f:
   ...:   print(f.read())
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[4], line 1
----> 1 with fs.open_input_file(location) as f:
      2   print(f.read())

File ~/Library/Python/3.9/lib/python/site-packages/pyarrow/_fs.pyx:763, in pyarrow._fs.FileSystem.open_input_file()

File ~/Library/Python/3.9/lib/python/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File ~/Library/Python/3.9/lib/python/site-packages/pyarrow/error.pxi:113, in pyarrow.lib.check_status()

FileNotFoundError: [Errno 2] google::cloud::Status(NOT_FOUND: Permanent error in Read(): ). Detail: [errno 2] No such file or directory

Can be reproduced using:

from pyarrow.fs import GcsFileSystem
from datetime import datetime

fs = GcsFileSystem(
  access_token='anon',
  credential_token_expiration=datetime(2023, 8, 2, 16, 30, 4),
  scheme='http',
  endpoint_override='0.0.0.0:4443'
)

location = 'warehouse/vo.txt'

with fs.open_output_stream(location) as f:
  print(f.write(b"foo"))

print(fs.get_file_info(location))

with fs.open_input_file(location) as f:
  print(f.read())

Failing calls with PyArrow

time="2023-08-02T14:35:57Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:35:57 +0000] \"GET /storage/v1/b/warehouse/o/1bc68628-a1d3-4081-b3f1-9d69224ddd5c.txt HTTP/1.1\" 404 59"
time="2023-08-02T14:35:57Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:35:57 +0000] \"GET /storage/v1/b/warehouse/o?prefix=1bc68628-a1d3-4081-b3f1-9d69224ddd5c.txt%2F&pageToken= HTTP/1.1\" 200 27"
time="2023-08-02T14:35:57Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:35:57 +0000] \"POST /upload/storage/v1/b/warehouse/o?uploadType=resumable&name=1bc68628-a1d3-4081-b3f1-9d69224ddd5c.txt HTTP/1.1\" 200 335"
time="2023-08-02T14:35:57Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:35:57 +0000] \"PUT /upload/storage/v1/b/warehouse/o?uploadType=resumable&name=1bc68628-a1d3-4081-b3f1-9d69224ddd5c.txt&upload_id=43a8ec7bc33a15592b750fc916790750 HTTP/1.1\" 200 570"
time="2023-08-02T14:35:57Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:35:57 +0000] \"GET /storage/v1/b/warehouse/o/1bc68628-a1d3-4081-b3f1-9d69224ddd5c.txt HTTP/1.1\" 200 570"
time="2023-08-02T14:35:57Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:35:57 +0000] \"GET /warehouse/1bc68628-a1d3-4081-b3f1-9d69224ddd5c.txt HTTP/1.1\" 404 10"

The last call is causing the 404, and it seems to be missing /storage/v1/b/.

The equivalent code using GCSSpec:

time="2023-08-02T14:35:57Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:35:57 +0000] \"GET /warehouse/1bc68628-a1d3-4081-b3f1-9d69224ddd5c.txt HTTP/1.1\" 404 10"
time="2023-08-02T14:36:10Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:36:10 +0000] \"GET /storage/v1/b/warehouse/o/d3057e83-52ab-4ce4-b16f-d55af7ba3525.txt HTTP/1.1\" 404 59"
time="2023-08-02T14:36:10Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:36:10 +0000] \"GET /storage/v1/b/warehouse/o?delimiter=/&prefix=d3057e83-52ab-4ce4-b16f-d55af7ba3525.txt/ HTTP/1.1\" 200 27"
time="2023-08-02T14:36:10Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:36:10 +0000] \"GET /storage/v1/b/warehouse/o/d3057e83-52ab-4ce4-b16f-d55af7ba3525.txt HTTP/1.1\" 404 59"
time="2023-08-02T14:36:10Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:36:10 +0000] \"POST /upload/storage/v1/b/warehouse/o?uploadType=resumable HTTP/1.1\" 200 335"
time="2023-08-02T14:36:10Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:36:10 +0000] \"POST /upload/storage/v1/b/warehouse/o?uploadType=resumable&name=d3057e83-52ab-4ce4-b16f-d55af7ba3525.txt&upload_id=2b6f8d48acf8dd87cc86d1e51bd3120e HTTP/1.1\" 200 570"
time="2023-08-02T14:36:10Z" level=info msg="172.19.0.1 - - [02/Aug/2023:14:36:10 +0000] \"GET /storage/v1/b/warehouse/o/d3057e83-52ab-4ce4-b16f-d55af7ba3525.txt HTTP/1.1\" 200 570"

This only seems to happen when the endpoint_override is set

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions