Skip to content

Comments

Add GCSFile Storage#149

Merged
JacobHayes merged 4 commits intogoldenfrom
gcs
Mar 15, 2022
Merged

Add GCSFile Storage#149
JacobHayes merged 4 commits intogoldenfrom
gcs

Conversation

@JacobHayes
Copy link
Member

@JacobHayes JacobHayes commented Dec 29, 2021

Adds GCSFile and GCSFilePartition Storage types + I/O for JSON and Pickle formats.

Tests for external resources can be a bit tricky, but gcsfs (and original google-cloud-storage lib) support a STORAGE_EMULATOR_HOST env var. During gcs test setup, we spin up an instance of gcp-storage-emulator (there are a couple alternatives, namely fake-gcs-server) and give each test a separate bucket for isolation.

Closes #121

@codecov-commenter
Copy link

codecov-commenter commented Mar 13, 2022

Codecov Report

Merging #149 (680336e) into golden (38469ba) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            golden      #149   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           35        38    +3     
  Lines         2173      2242   +69     
  Branches       477       489   +12     
=========================================
+ Hits          2173      2242   +69     
Impacted Files Coverage Δ
src/arti/io/json_gcsfile_python.py 100.00% <100.00%> (ø)
src/arti/io/pickle_gcsfile_python.py 100.00% <100.00%> (ø)
src/arti/storage/google/cloud/storage.py 100.00% <100.00%> (ø)

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@JacobHayes
Copy link
Member Author

JacobHayes commented Mar 14, 2022

Development on https://gitlab.com/potato-oss/google-cloud/gcloud-storage-emulator seems slowed so I tried the https://github.com/oittaa/gcp-storage-emulator fork, which supports a bit more. However, I still had a couple issues (eg: oittaa/gcp-storage-emulator#147). A quick test with https://github.com/fsouza/fake-gcs-server (as GCSFS uses) seemed to work but it's not available as a python lib so would need to shell out/distribute somehow.

The gcp-storage-emulator maintainer has been quite responsive, so I'll go with that and fall back to fake-gcs-server.

--

For posterity, here's a few different (over-engineered) versions of the emulator fixture I've tried while going back and forth here...

gcp-storage-emulator
@pytest.fixture(scope="session")
def gcs_emulator() -> Generator[tuple[str, int], None, None]:
    # port=0 -> run on a random open port; though we have to lookup later.
    server = gcp_storage_emulator.server.create_server("localhost", 0, in_memory=True)
    server.start()
    try:
        host, port = server._api._httpd.socket.getsockname()
        with mock.patch.dict(os.environ, {"STORAGE_EMULATOR_HOST": f"http://{host}:{port}"}):
            yield host, port
    finally:
        server.stop()
fake-gcs-server docker

I originally tried this Docker version (similar to GCSFS's test suite), but the GitHub Mac runner doesn't have docker-for-mac installed due to EULA issues.

@pytest.fixture(scope="session")
def gcs_emulator(
    fake_gcs_server_version: str = "1.37.0", port: int = 2784
) -> Generator[str, None, None]:
    if not sh.which("docker"):
        raise pytest.skip("docker not available to run fake-gcs-server")
    container = "arti-test-gcs-emulator"
    url = f"http://localhost:{port}"
    sh.docker.run(
        "-d",
        f"--name={container}",
        f"-p={port}:{port}",
        f"fsouza/fake-gcs-server:{fake_gcs_server_version}",
        "-backend=memory",
        "-scheme=http",
        f"-external-url={url}",
        f"-port={port}",
        f"-public-host={url}",
    )
    try:
        time.sleep(0.25)
        requests.get(url + "/storage/v1/b").raise_for_status()
        with mock.patch.dict(os.environ, {"STORAGE_EMULATOR_HOST": url}):
            yield url
    finally:
        sh.docker.rm("-fv", container)
fake-gcs-server binaries
MACHINE_MAP = {
    "aarch64": "arm64",
    "arm64": "arm64",
    "x86_64": "amd64",
}
BIN_CACHE_DIR = Path(__file__).parent / ".bin_cache"
BIN_MACHINE = MACHINE_MAP[platform.machine()]
BIN_SYSTEM = platform.system()


def _get_fake_gcs_server_cmd(
    machine: str = BIN_MACHINE, system: str = BIN_SYSTEM, version: str = "1.37.0"
) -> sh.Command:
    binpath = BIN_CACHE_DIR / f"fake-gcs-server-{version}-{system}-{machine}"
    if not binpath.exists():
        tgz_name = f"{binpath}.tgz"
        url = f"https://github.com/fsouza/fake-gcs-server/releases/download/v{version}/fake-gcs-server_{version}_{system}_{machine}.tar.gz"
        with requests.get(url, stream=True) as resp:
            if resp.status_code != requests.codes.ok:
                pytest.skip(f"fake-gcs-server for {system} {machine} is not available")
            with open(tgz_name, "wb") as tgz:
                shutil.copyfileobj(resp.raw, tgz)
        sh.tar("-xf", tgz_name, "fake-gcs-server")
        sh.mv("fake-gcs-server", binpath)
        sh.rm(tgz_name)
    return sh.Command(binpath)


@pytest.fixture(scope="session")
def gcs_emulator(port: int = 2784) -> Generator[str, None, None]:
    url = f"http://localhost:{port}"
    fake_gcs_server_proc = _get_fake_gcs_server_cmd()(
        "-backend=memory",
        "-scheme=http",
        f"-external-url={url}",
        f"-port={port}",
        f"-public-host={url}",
        _bg=True,
    )
    try:
        time.sleep(0.25)
        requests.get(url + "/storage/v1/b").raise_for_status()
        with mock.patch.dict(os.environ, {"STORAGE_EMULATOR_HOST": url}):
            yield url
    finally:
        fake_gcs_server_proc.terminate()
        fake_gcs_server_proc.wait()

@JacobHayes JacobHayes force-pushed the gcs branch 5 times, most recently from 685edd4 to b005093 Compare March 15, 2022 02:59
@JacobHayes JacobHayes marked this pull request as ready for review March 15, 2022 03:37
@JacobHayes JacobHayes merged commit 5d94004 into golden Mar 15, 2022
@JacobHayes JacobHayes deleted the gcs branch March 15, 2022 03:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Partitioned GCS Artifact

2 participants