ZEPPELIN-1515. Notebook: HDFS as a backend storage (Use hadoop client jar)#2455
ZEPPELIN-1515. Notebook: HDFS as a backend storage (Use hadoop client jar)#2455zjffdu wants to merge 1 commit intoapache:masterfrom
Conversation
|
@Leemoonsoo @felixcheung @khalidhuseynov Could you help review it ? Thanks |
|
Hi, Recently, I've found some version of hadoop related library make a crash of jetty which is used in Zeppelin Server. I've also changed yarn cluster manager to use another classloader. How do you think you can use another classloader while running hdfs notebook storage? It would make more changes in Zeppelin server side but it will be better for the future. |
|
@jongyoul What kind of issue do you see in cluster manager PR ? Usually this is due to jar version conflict. You can exclude this transitive jar of hadoop client library. |
conf/zeppelin-site.xml.template
Outdated
| <property> | ||
| <name>zeppelin.notebook.storage</name> | ||
| <value>org.apache.zeppelin.notebook.repo.HdfsNotebookRepo</value> | ||
| <description>local notebook persistence layer implementation</description> |
There was a problem hiding this comment.
local notebook persistence layer -> hdfs notebook persistence layer ?
docs/setup/storage/storage.md
Outdated
|
|
||
| ## Notebook Storage in Hdfs repository <a name="Hdfs"></a> | ||
|
|
||
| Notes may be stored in hdfs, so that multiple zeppelin instance can share the same notes. It supported all the versions of hadoop 2.x. If you use `HdfsNotebookRepo`, then `zeppelin.notebook.dir` is the path on hdfs. And you need to specify `HADOOP_CONF_DIR` in `zeppelin-env.sh` so that zeppelin can find the right hadoop configuration files. |
There was a problem hiding this comment.
capital Zeppelin
Notes may be stored in hdfs, so that multiple zeppelin instance can share the same notes.
- if this is the intended goal, will there be any race condition on write when several instances are changing it the same time?
There was a problem hiding this comment.
Right, but I think it is supposed to be handled in a base class.
zeppelin-zengine/pom.xml
Outdated
| <dependency> | ||
| <groupId>org.apache.hadoop</groupId> | ||
| <artifactId>hadoop-client</artifactId> | ||
| <version>2.6.0</version> |
There was a problem hiding this comment.
use a property, hadoop.version instead of hardcoding to 2.6
| InputStream in = new ByteArrayInputStream(note.toJson().getBytes("UTF-8")); | ||
| IOUtils.copyBytes(in, fs.create(tmpNotePath), hadoopConf); | ||
| Path notePath = new Path(notebookDir.toString() + "/" + note.getId() + "/note.json"); | ||
| fs.delete(notePath); |
There was a problem hiding this comment.
should it overwrite without warning like this?
There was a problem hiding this comment.
I am afraid this is what we only can do for now. Each time when we save a note, it would overwrite the previous one. And actually you don't know the file is the previous note or other file.
docs/setup/storage/storage.md
Outdated
|
|
||
| ## Notebook Storage in Hdfs repository <a name="Hdfs"></a> | ||
|
|
||
| Notes may be stored in hdfs, so that multiple zeppelin instance can share the same notes. It supported all the versions of hadoop 2.x. If you use `HdfsNotebookRepo`, then `zeppelin.notebook.dir` is the path on hdfs. And you need to specify `HADOOP_CONF_DIR` in `zeppelin-env.sh` so that zeppelin can find the right hadoop configuration files. |
docs/setup/storage/storage.md
Outdated
| <property> | ||
| <name>zeppelin.notebook.storage</name> | ||
| <value>org.apache.zeppelin.notebook.repo.HdfsNotebookRepo</value> | ||
| <description>notebook persistence layer implementation</description> |
There was a problem hiding this comment.
i guess should be changed to same description as above with hdfs
e31a803 to
960d9f3
Compare
| if (fs.exists(tmpNotePath)) { | ||
| fs.delete(tmpNotePath, true); | ||
| } | ||
| InputStream in = new ByteArrayInputStream(note.toJson().getBytes("UTF-8")); |
There was a problem hiding this comment.
i think we would need zConf.ZEPPELIN_ENCODING instead of hard-coding UTF-8 in all these encoding cases
| public List<NoteInfo> list(AuthenticationInfo subject) throws IOException { | ||
| List<NoteInfo> noteInfos = new ArrayList<>(); | ||
| for (FileStatus status : fs.globStatus(new Path(notebookDir, "*/note.json"))) { | ||
| NoteInfo noteInfo = new NoteInfo(status.getPath().getParent().getName(), "", null); |
There was a problem hiding this comment.
shouldn't we create Note object first and then new NoteInfo(note)? otherwise it would be missing name and config
There was a problem hiding this comment.
This is what I am confused. I don't know the purpose of NoteInfo, previously I thought NoteInfo represent the basic metadata of Note, not include the note content. And from my testing, this PR works well. Besides, I create ZEPPELIN-2712 that we would read Note twice, the root cause is that we construct NoteInfo from Note.
There was a problem hiding this comment.
good point, as i checked only Id field of NoteInfo is used to retrieve the whole note which would lead to duplicate get operation. from another point of view if there's no special folder structure as noteId/note.json it would make more sense to retrieve List<Note> in list() method. then the way used here looks good for now
|
@zjffdu Almost conflicts will be removed by extracting jars but how we guarantee those libraries are not related my features? I'm just curious. |
|
@jongyoul Yeah, since your PR is also using the hadoop jar. I think we need to do 2 things:
Let me know your concern, thanks. |
|
@zjffdu I think that the zeppelin.sh file should be updated with the HADOOP_CONF_DIR in the classpath |
|
Thanks for review. @hayssams I did that in zeppelin-daemon.sh. |
|
Yes but sometimes we need to launch zeppelin through zeppelin.sh equally in Mesos by Marathon |
|
@hayssams Any reason why |
|
@zjffdu Yes when you want to start zeppelin not as a daemon. |
|
Also add |
54bb18e to
7059244
Compare
|
I also verified the kerberized hdfs manually in my local box. @jongyoul @Leemoonsoo @felixcheung @prabhjyotsingh @hayssams Please help review it, Thanks |
|
|
||
| public <T> T callHdfsOperation(final HdfsOperation<T> func) throws IOException { | ||
| if (isSecurityEnabled) { | ||
| UserGroupInformation.getLoginUser().reloginFromKeytab(); |
There was a problem hiding this comment.
re login on every operation?
There was a problem hiding this comment.
reloginFromKeytab() will check whether the keytab is expired, if not it won't relogin
|
Tested on local, works as expected, have tried on both environments, with and without Kerberos. |
|
Thanks for review @prabhjyotsingh @hayssams , will merge it if no more comments |
| public Void call() throws IOException { | ||
| Path noteFolder = new Path(notebookDir.toString() + "/" + noteId); | ||
| fs.delete(noteFolder, true); | ||
| return null; |
There was a problem hiding this comment.
there's no return value from this method as declared in the signature?
There was a problem hiding this comment.
Right, there's nothing needs to be return for this method.
… jar) This PR is trying to add hdfs as another implementation for `NotebookRepo`. There's another PR about using webhdfs to implement that. Actually hdfs client library is compatibility cross major versions. See http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Compatibility.html#Wire_compatibility, if using webhdfs, the code become more complicated and may lose some features of hdfs. This PR is also required for HA of zeppelin, so that multiple zeppelin instances can share notes via hdfs. I add hadoop-client in pom file. So zeppelin will package hadoop client jar into its binary distribution. This is because zeppelin may be installed in a gateway machine where no hadoop is installed (only hadoop configuration file is existed in this machine) And since the hadoop client will work with multiple versions of hadoop, so it is fine to package into binary distribution. Spark also package hadoop client jar in its binary distribution. [Feature] * [ ] - Task * https://issues.apache.org/jira/browse/ZEPPELIN-1515 Unit test is added. Also manually verify it in a single node cluster. * Does the licenses files need update? No * Is there breaking changes for older versions? No * Does this needs documentation? No Author: Jeff Zhang <[email protected]> Closes apache#2455 from zjffdu/ZEPPELIN-1515 and squashes the following commits: b3e83ab [Jeff Zhang] ZEPPELIN-1515. Notebook: HDFS as a backend storage (Read & Write Mode) (cherry picked from commit 30bfcae) Change-Id: Iecfbd1b5d2f1cb6b3c167906091e2c90524eb787
What is this PR for?
This PR is trying to add hdfs as another implementation for
NotebookRepo. There's another PR about using webhdfs to implement that. Actually hdfs client library is compatibility cross major versions. See http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Compatibility.html#Wire_compatibility, if using webhdfs, the code become more complicated and may lose some features of hdfs.This PR is also required for HA of zeppelin, so that multiple zeppelin instances can share notes via hdfs. I add hadoop-client in pom file. So zeppelin will package hadoop client jar into its binary distribution. This is because zeppelin may be installed in a gateway machine where no hadoop is installed (only hadoop configuration file is existed in this machine) And since the hadoop client will work with multiple versions of hadoop, so it is fine to package into binary distribution. Spark also package hadoop client jar in its binary distribution.
What type of PR is it?
[Feature]
Todos
What is the Jira issue?
How should this be tested?
Unit test is added. Also manually verify it in a single node cluster.
Screenshots (if appropriate)
Questions: