ARROW-1780 - JDBC Adapter to convert Relational Data objects to Arrow Data Format Vector Objects#1759
Conversation
…or objects creation.
Used to YAML for the test data. Fixed issues in the vector creation for CLOB.
Fixed code to handle only one column in select query.
…or objects creation.
Used to YAML for the test data. Fixed issues in the vector creation for CLOB.
Fixed code to handle only one column in select query.
Added new test file
Conflicts: java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/JdbcToArrow.java java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/JdbcToArrowUtils.java java/adapter/jdbc/src/test/java/org/apache/arrow/adapter/jdbc/JdbcToArrowTest.java
…ush in forked branch
Pull Request created for merging Code Coverage related changes
laurentgo
left a comment
There was a problem hiding this comment.
You might want to checkstyle results: it looks like the way the code is indented is pretty different from the rest of the arrow code base...
| public static VectorSchemaRoot sqlToArrow(ResultSet resultSet) throws SQLException, IOException { | ||
| Preconditions.checkNotNull(resultSet, "JDBC ResultSet object can not be null"); | ||
|
|
||
| return sqlToArrow(resultSet, Calendar.getInstance()); |
There was a problem hiding this comment.
Timezone/Locale should always be specified (UTC, Locale.ROOT)?
|
|
||
| RootAllocator rootAllocator = new RootAllocator(Integer.MAX_VALUE); | ||
| VectorSchemaRoot root = sqlToArrow(resultSet, rootAllocator, calendar); | ||
| rootAllocator.close(); |
There was a problem hiding this comment.
if the allocator is closed, I guess it means data is invalidated? You might prefer not to provide this method...
There was a problem hiding this comment.
This (rootAllocator.close()) was already removed and probably is referring to earlier code. So, we should be okay now.
| case Types.VARBINARY: | ||
| case Types.LONGVARBINARY: | ||
| updateVector((VarBinaryVector)root.getVector(columnName), | ||
| // rs.getBytes(i), !rs.wasNull(), rowCount); |
| } | ||
|
|
||
| } catch (Exception e) { | ||
| e.printStackTrace(); |
There was a problem hiding this comment.
It should probably left as-is if you want the test framework to fail properly?
|
|
||
| } catch (Exception e) { | ||
| e.printStackTrace(); | ||
| } finally { |
There was a problem hiding this comment.
what about using try(with-resources) pattern?
java/adapter/jdbc/pom.xml
Outdated
|
|
||
| </dependencies> | ||
|
|
||
| <build> |
There was a problem hiding this comment.
can you check/fix the indentation?
| } | ||
|
|
||
| private static void updateVector(BitVector bitVector, boolean value, boolean isNonNull, int rowCount) { | ||
| NullableBitHolder holder = new NullableBitHolder(); |
There was a problem hiding this comment.
is it better to use the holder vs calling directly bitVector.setSafe(rowCount, isNonNull ? 1 : 0, value ? 1: 0) (cc @siddharthteotia )
There was a problem hiding this comment.
I think we can continue to use holder so as to be consistent with other parts of the code. What do you think?
There was a problem hiding this comment.
I think its fine to use any APIs exposed by vectors
| if (read == -1) { | ||
| break; | ||
| } | ||
| arrowBuf.setBytes(total, new ByteArrayInputStream(bytes, 0, read), read); |
There was a problem hiding this comment.
I think you don't need to wrap to a stream and that there's a method accepting an byte[] argument directly
| assertEquals(rowCount, varBinaryVector.getValueCount()); | ||
|
|
||
| for(int j = 0; j < varBinaryVector.getValueCount(); j++){ | ||
| assertEquals(Arrays.hashCode(values[j]), Arrays.hashCode(varBinaryVector.get(j))); |
There was a problem hiding this comment.
why not checking for array equality using assertArrayEquals?
| binary_field12 BINARY(100), varchar_field13 VARCHAR(256), blob_field14 BLOB, clob_field15 CLOB, char_field16 CHAR(16), bit_field17 BIT);' | ||
|
|
||
| data: | ||
| - 'INSERT INTO table1 VALUES (101, 1, 45, 12000, 92233720, 17345667789.23, 56478356785.345, 56478356785.345, PARSEDATETIME(''12:45:35 GMT'', ''HH:mm:ss z''), |
There was a problem hiding this comment.
I noticed that all the rows are basically the same for all the tests? is there any specific reason for it? (compared to only have one row for example...)
There was a problem hiding this comment.
There is no particular reason for that. It just gives you more number of rows.
Files committed to merge changes made for review comment implementation
|
@atuldambalkar @YashpalThakur @yashpal is this no longer WIP? If so, can you update the PR title? I think this needs a last look from @siddharthteotia before merging |
|
@wesm We are pretty much done with respect to code review comments changes and now waiting for last comments or PR merge from @laurentgo and @siddharthteotia |
|
Removed the [WIP] from the title. |
| Double[] arr = new Double[values.length]; | ||
| int i = 0; | ||
| for (String str: values) { | ||
| arr[i++] = Double.parseDouble(str); |
There was a problem hiding this comment.
I don't think this indentation is correct and consistent with what is followed in rest of the codebase
There was a problem hiding this comment.
Can you please do checkstyle validation on all files?
|
LGTM. But I think we need to fix indentation that has been followed in this patch. Seems like it is different from what is there is rest of the Java code base in Arrow. @atuldambalkar , can you please do checkstyle validation? I was under the impression that it was a part of travis-ci build |
|
Thanks, @siddharthteotia for your comments. I have fixed the indentation for all the JDBC adapter code files based on the checkstyle warnings. Please take a look. |
|
@siddharthteotia it may have been better to use the merge tool for this patch since there were multiple authors (at least 2) -- we can do some squashing ourselves to try to preserve at least the authorship (though in this case it would be a lot of work). GitHub doesn't handle multiple authors in the merge UI as far as I can tell. I think this is an exceptional case; if something like this happens again we should either clean up the commits before merging, or use the GitHub UI and write the author's names in the commit message |
… Data Format Vector Objects (apache#1759) This patch adds JDBC adapter support for Arrow
This code enhancement is for converting JDBC ResultSet Relational objects to Arrow columnar data Vector objects. Code is under director "java/adapter/jdbc/src/main".
The API has following static methods in the
class org.apache.arrow.adapter.jdbc.JdbcToArrow -
public static VectorSchemaRoot sqlToArrow(Connection connection, String query)
public static ArrowDataFetcher jdbcArrowDataFetcher(Connection connection, String tableName)
Utility uses following data mapping to convert JDBC/SQL data types to Arrow data types -
CHAR --> ArrowType.Utf8
NCHAR --> ArrowType.Utf8
VARCHAR --> ArrowType.Utf8
NVARCHAR --> ArrowType.Utf8
LONGVARCHAR --> ArrowType.Utf8
LONGNVARCHAR --> ArrowType.Utf8
NUMERIC --> ArrowType.Decimal(precision, scale)
DECIMAL --> ArrowType.Decimal(precision, scale)
BIT --> ArrowType.Bool
TINYINT --> ArrowType.Int(8, signed)
SMALLINT --> ArrowType.Int(16, signed)
INTEGER --> ArrowType.Int(32, signed)
BIGINT --> ArrowType.Int(64, signed)
REAL --> ArrowType.FloatingPoint(FloatingPointPrecision.SINGLE)
FLOAT --> ArrowType.FloatingPoint(FloatingPointPrecision.SINGLE)
DOUBLE --> ArrowType.FloatingPoint(FloatingPointPrecision.DOUBLE)
BINARY --> ArrowType.Binary
VARBINARY --> ArrowType.Binary
LONGVARBINARY --> ArrowType.Binary
DATE --> ArrowType.Date(DateUnit.MILLISECOND)
TIME --> ArrowType.Time(TimeUnit.MILLISECOND, 32)
TIMESTAMP --> ArrowType.Timestamp(TimeUnit.MILLISECOND, timezone=null)
CLOB --> ArrowType.Utf8
BLOB --> ArrowType.Binary
JUnit test cases under java/adapter/jdbc/src/test. Test cases uses H2 in-memory database.
I am still working on adding and automating additional test cases.