Skip to content

Run Unity Catalog in a Docker container#116

Merged
haogang merged 49 commits intounitycatalog:mainfrom
jeanboutros:ft_dockerise
Aug 1, 2024
Merged

Run Unity Catalog in a Docker container#116
haogang merged 49 commits intounitycatalog:mainfrom
jeanboutros:ft_dockerise

Conversation

@jeanboutros
Copy link
Contributor

Description of changes

For this PR I took an alternative approach to #18 and #22 and created a Dockerfile with build and start scripts that require minimal intervention and interaction with the codebase.

From the codebase the only change is the .gitignore where I added the .DS_Store which can be helpful in the future for contributors using Mac OS,

PR #22 is a great start but maybe oversimplified.

PR #18 Has good thought put into it but I wanted to stay close to the recommended way of running Unity Catalog as outlined in the project's README. I tried not to fiddle directly with the jars and use the provided /bin/start-uc-server to run the catalog. With this approach the Dockerfile remains focused on building the environment and any changes to how the environment should run can be made in the future inside the start-uc-server script rather than the Dockerfile.

Rationale of the PR

This pull request introduces a way to run Unity Catalog using Docker containers. It provides a Dockerfile that builds the necessary environment and separate bash scripts for building and starting the catalog. This simplifies the process for users by requiring minimal interaction with the codebase itself. The included README provides detailed instructions on how to use these scripts to build and run the Unity Catalog container.

Note

The README.md contains two API calls that create an external and an managed table.
These APIs are not working yet because they are not supported by the catalogue yet.

Signed-off-by: Jean Boutros [email protected]

@MrPowers
Copy link
Contributor

MrPowers commented Jul 1, 2024

PR looks cool, but I'm not qualified to give a technical review.

Can you just confirm that this PR moves us in the direction of the ultimate objective of having an image on Dockerhub that users can easily pull to run UC locally?

@jeanboutros
Copy link
Contributor Author

PR looks cool, but I'm not qualified to give a technical review.

Can you just confirm that this PR moves us in the direction of the ultimate objective of having an image on Dockerhub that users can easily pull to run UC locally?

@MrPowers I just checked and there isn't an official account for unitycatalog on hub.docker.com
If you can create one, or whoever owns the unitycatalog github account can create one, then we can next in another PR, create a github action to automatically push the docker image into dockerhub.

That would require some github actions configurations such as secrets, to authenticate with dockerhub and be able to push the image.

This PR is one step closer to the ultimate objective 😄

@jeanboutros jeanboutros requested a review from Fokko July 2, 2024 16:57
@jeanboutros
Copy link
Contributor Author

@Fokko Check out how it has become.
I still need to run additional tests but i think it's getting closer to where you wanted to take this.
Just one thing with the uber-jar, I still opted to pull the files from github rather than copy, because docker cannot copy from a parent directory. However I added a version argument that can be passed so that the build script can build a specific version such as 0.1.0-PREVIEW etc. Of course this means that we need to start adding tags to the repository which I will open an issue for soon.

@Fokko
Copy link
Contributor

Fokko commented Jul 2, 2024

@jeanboutros I think it makes sense to move the Dockerfile to the parent directory. Instead of an uber-jar, I think #96 is also a good option

…ild and run scripts. Also moved the build and run scripts to the bin folder
@jeanboutros
Copy link
Contributor Author

@Fokko thanks for the example Dockerfile. I took it and made small changes such as adding args in the beginning to give more flexibility for the build process to override the default values.

I think we're in a good place now. Check the build script and the run scripts.

I still think we need a docker compose but as soon as we stabilise the Dockerfiles.

One thing I am concerned about is that I am no longer able to make an API call to the container. Do you think this is related to the way we are building the JAR or is it related to the flavour of the docker image that we have chosen (alpine)?

@jeanboutros jeanboutros requested a review from Fokko July 2, 2024 23:30
@jeanboutros
Copy link
Contributor Author

@jeanboutros When testing the catalog the way it is described in the quickstart. Following commands work:

* Your example command:  `uc-cli catalog list`

* The schema listing command from the [quickstart](https://github.com/unitycatalog/unitycatalog/blob/main/docs/quickstart.md): `uc-cli table list --catalog unity --schema default`

However, when running the read table command from the quickstart, uc-cli table read --full_name unity.default.numbers, I get following error:

Exception in thread "main" java.lang.UnsatisfiedLinkError: /tmp/snappy-1.1.10-5a43f12f-fffd-4406-b20f-c2033f85e925-libsnappyjava.so: Error loading shared library ld-linux-x86-64.so.2: No such file or directory (needed by /tmp/snappy-1.1.10-5a43f12f-fffd-4406-b20f-c2033f85e925-libsnappyjava.so)
        at java.base/jdk.internal.loader.NativeLibraries.load(Native Method)
        at java.base/jdk.internal.loader.NativeLibraries$NativeLibraryImpl.open(Unknown Source)
        at java.base/jdk.internal.loader.NativeLibraries.loadLibrary(Unknown Source)
        at java.base/jdk.internal.loader.NativeLibraries.loadLibrary(Unknown Source)
        at java.base/java.lang.ClassLoader.loadLibrary(Unknown Source)
        at java.base/java.lang.Runtime.load0(Unknown Source)
        at java.base/java.lang.System.load(Unknown Source)
        at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:182)
        at org.xerial.snappy.SnappyLoader.loadSnappyApi(SnappyLoader.java:157)
        at org.xerial.snappy.Snappy.init(Snappy.java:70)
        at org.xerial.snappy.Snappy.<clinit>(Snappy.java:47)
        at org.apache.parquet.hadoop.codec.SnappyDecompressor.decompress(SnappyDecompressor.java:62)
        at org.apache.parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:51)
        at java.base/java.io.DataInputStream.readFully(Unknown Source)
        at java.base/java.io.DataInputStream.readFully(Unknown Source)
        at org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:286)
        at org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:237)
        at org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:246)
        at org.apache.parquet.column.impl.ColumnReaderBase.readPageV1(ColumnReaderBase.java:680)
        at org.apache.parquet.column.impl.ColumnReaderBase.access$300(ColumnReaderBase.java:57)
        at org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:623)
        at org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:620)
        at org.apache.parquet.column.page.DataPageV1.accept(DataPageV1.java:120)
        at org.apache.parquet.column.impl.ColumnReaderBase.readPage(ColumnReaderBase.java:620)
        at org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:594)
        at org.apache.parquet.column.impl.ColumnReaderBase.consume(ColumnReaderBase.java:735)
        at org.apache.parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:30)
        at org.apache.parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:47)
        at org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:82)
        at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:271)
        at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
        at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
        at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:177)
        at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
        at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:141)
        at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:230)
        at org.apache.parquet.hadoop.ParquetRecordReaderWrapper.nextKeyValue(ParquetRecordReaderWrapper.java:28)
        at io.delta.kernel.defaults.internal.parquet.ParquetFileReader$1.hasNext(ParquetFileReader.java:85)
        at io.delta.kernel.defaults.engine.DefaultParquetHandler$1.hasNext(DefaultParquetHandler.java:76)
        at io.delta.kernel.defaults.engine.DefaultParquetHandler$1.hasNext(DefaultParquetHandler.java:87)
        at io.delta.kernel.Scan$1.hasNext(Scan.java:160)
        at io.unitycatalog.cli.delta.DeltaKernelReadUtils.readData(DeltaKernelReadUtils.java:127)
        at io.unitycatalog.cli.delta.DeltaKernelUtils.readDeltaTable(DeltaKernelUtils.java:117)
        at io.unitycatalog.cli.TableCli.readTable(TableCli.java:150)
        at io.unitycatalog.cli.TableCli.handle(TableCli.java:52)
        at io.unitycatalog.cli.UnityCatalogCli.main(UnityCatalogCli.java:82)

If I understand it correctly, then a library is missing. Namely ld-linux-x86-64.so.2, which should be part of the glibc library.

On it now, sorry it took a while for me to respond.

@jeanboutros
Copy link
Contributor Author

@farkas93 the error in the CLI is fixed. Can you try now ?

@farkas93
Copy link

@farkas93 the error in the CLI is fixed. Can you try now ?

Will be back next week wednesday from holidays. Will try then :)

@haogang haogang merged commit 7564c77 into unitycatalog:main Aug 1, 2024
@dennyglee dennyglee added this to the 0.2 milestone Aug 2, 2024
This was referenced Aug 2, 2024
kevinzwang pushed a commit to kevinzwang/unitycatalog that referenced this pull request Oct 10, 2024
# Description of changes

For this PR I took an alternative approach to unitycatalog#18 and unitycatalog#22 and created a
Dockerfile with build and start scripts that require minimal
intervention and interaction with the codebase.

From the codebase the only change is the .gitignore where I added the
.DS_Store which can be helpful in the future for contributors using Mac
OS,

PR unitycatalog#22 is a great start but maybe oversimplified.

PR unitycatalog#18 Has good thought put into it but I wanted to stay close to the
recommended way of running Unity Catalog as outlined in the project's
README. I tried not to fiddle directly with the jars and use the
provided `/bin/start-uc-server` to run the catalog. With this approach
the Dockerfile remains focused on building the environment and any
changes to how the environment should run can be made in the future
inside the `start-uc-server` script rather than the Dockerfile.

# Rationale of the PR

This pull request introduces a way to run Unity Catalog using Docker
containers. It provides a Dockerfile that builds the necessary
environment and separate bash scripts for building and starting the
catalog. This simplifies the process for users by requiring minimal
interaction with the codebase itself. The included README provides
detailed instructions on how to use these scripts to build and run the
Unity Catalog container.

> [!NOTE]
> The `README.md` contains two API calls that create an external and an
managed table.
> These APIs are not working yet because they are not supported by the
catalogue yet.

Signed-off-by: Jean Boutros <[email protected]>

---------

Signed-off-by: Jean Boutros <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Denny Lee <[email protected]>
Signed-off-by: Kevin Wang <[email protected]>
@llvll0hsen
Copy link

llvll0hsen commented Feb 12, 2025

hey, I just faced the same problem when running the uc-cli table read --full_name unity.default.numbers on windows 11. running the following command and restarting the container fixed the issue:

apk update && apk add --no-cache gcompat

@jeanboutros
Copy link
Contributor Author

I think this is related to the compression of parquet files and decompressing then on the fly when reading a table from the catalog.
Good catch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.