Skip to content

ZEPPELIN-1515. Notebook: HDFS as a backend storage (Use hadoop client jar)#2455

Closed
zjffdu wants to merge 1 commit intoapache:masterfrom
zjffdu:ZEPPELIN-1515
Closed

ZEPPELIN-1515. Notebook: HDFS as a backend storage (Use hadoop client jar)#2455
zjffdu wants to merge 1 commit intoapache:masterfrom
zjffdu:ZEPPELIN-1515

Conversation

@zjffdu
Copy link
Copy Markdown
Contributor

@zjffdu zjffdu commented Jun 30, 2017

What is this PR for?

This PR is trying to add hdfs as another implementation for NotebookRepo. There's another PR about using webhdfs to implement that. Actually hdfs client library is compatibility cross major versions. See http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Compatibility.html#Wire_compatibility, if using webhdfs, the code become more complicated and may lose some features of hdfs.

This PR is also required for HA of zeppelin, so that multiple zeppelin instances can share notes via hdfs. I add hadoop-client in pom file. So zeppelin will package hadoop client jar into its binary distribution. This is because zeppelin may be installed in a gateway machine where no hadoop is installed (only hadoop configuration file is existed in this machine) And since the hadoop client will work with multiple versions of hadoop, so it is fine to package into binary distribution. Spark also package hadoop client jar in its binary distribution.

What type of PR is it?

[Feature]

Todos

  • - Task

What is the Jira issue?

How should this be tested?

Unit test is added. Also manually verify it in a single node cluster.

Screenshots (if appropriate)

Questions:

  • Does the licenses files need update? No
  • Is there breaking changes for older versions? No
  • Does this needs documentation? No

@zjffdu
Copy link
Copy Markdown
Contributor Author

zjffdu commented Jun 30, 2017

@Leemoonsoo @felixcheung @khalidhuseynov Could you help review it ? Thanks

@jongyoul
Copy link
Copy Markdown
Member

Hi, Recently, I've found some version of hadoop related library make a crash of jetty which is used in Zeppelin Server. I've also changed yarn cluster manager to use another classloader. How do you think you can use another classloader while running hdfs notebook storage? It would make more changes in Zeppelin server side but it will be better for the future.

@zjffdu
Copy link
Copy Markdown
Contributor Author

zjffdu commented Jun 30, 2017

@jongyoul What kind of issue do you see in cluster manager PR ? Usually this is due to jar version conflict. You can exclude this transitive jar of hadoop client library.

<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.HdfsNotebookRepo</value>
<description>local notebook persistence layer implementation</description>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local notebook persistence layer -> hdfs notebook persistence layer ?


## Notebook Storage in Hdfs repository <a name="Hdfs"></a>

Notes may be stored in hdfs, so that multiple zeppelin instance can share the same notes. It supported all the versions of hadoop 2.x. If you use `HdfsNotebookRepo`, then `zeppelin.notebook.dir` is the path on hdfs. And you need to specify `HADOOP_CONF_DIR` in `zeppelin-env.sh` so that zeppelin can find the right hadoop configuration files.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capital Zeppelin

Notes may be stored in hdfs, so that multiple zeppelin instance can share the same notes.

  • if this is the intended goal, will there be any race condition on write when several instances are changing it the same time?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but I think it is supposed to be handled in a base class.

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.0</version>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a property, hadoop.version instead of hardcoding to 2.6

InputStream in = new ByteArrayInputStream(note.toJson().getBytes("UTF-8"));
IOUtils.copyBytes(in, fs.create(tmpNotePath), hadoopConf);
Path notePath = new Path(notebookDir.toString() + "/" + note.getId() + "/note.json");
fs.delete(notePath);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it overwrite without warning like this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid this is what we only can do for now. Each time when we save a note, it would overwrite the previous one. And actually you don't know the file is the previous note or other file.


## Notebook Storage in Hdfs repository <a name="Hdfs"></a>

Notes may be stored in hdfs, so that multiple zeppelin instance can share the same notes. It supported all the versions of hadoop 2.x. If you use `HdfsNotebookRepo`, then `zeppelin.notebook.dir` is the path on hdfs. And you need to specify `HADOOP_CONF_DIR` in `zeppelin-env.sh` so that zeppelin can find the right hadoop configuration files.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: It supports?

<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.HdfsNotebookRepo</value>
<description>notebook persistence layer implementation</description>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess should be changed to same description as above with hdfs

@zjffdu zjffdu force-pushed the ZEPPELIN-1515 branch 2 times, most recently from e31a803 to 960d9f3 Compare July 1, 2017 12:32
if (fs.exists(tmpNotePath)) {
fs.delete(tmpNotePath, true);
}
InputStream in = new ByteArrayInputStream(note.toJson().getBytes("UTF-8"));
Copy link
Copy Markdown
Member

@khalidhuseynov khalidhuseynov Jul 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we would need zConf.ZEPPELIN_ENCODING instead of hard-coding UTF-8 in all these encoding cases

public List<NoteInfo> list(AuthenticationInfo subject) throws IOException {
List<NoteInfo> noteInfos = new ArrayList<>();
for (FileStatus status : fs.globStatus(new Path(notebookDir, "*/note.json"))) {
NoteInfo noteInfo = new NoteInfo(status.getPath().getParent().getName(), "", null);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we create Note object first and then new NoteInfo(note)? otherwise it would be missing name and config

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I am confused. I don't know the purpose of NoteInfo, previously I thought NoteInfo represent the basic metadata of Note, not include the note content. And from my testing, this PR works well. Besides, I create ZEPPELIN-2712 that we would read Note twice, the root cause is that we construct NoteInfo from Note.

Copy link
Copy Markdown
Member

@khalidhuseynov khalidhuseynov Jul 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, as i checked only Id field of NoteInfo is used to retrieve the whole note which would lead to duplicate get operation. from another point of view if there's no special folder structure as noteId/note.json it would make more sense to retrieve List<Note> in list() method. then the way used here looks good for now

@zjffdu zjffdu changed the title ZEPPELIN-1515. Notebook: HDFS as a backend storage (Read & Write Mode) ZEPPELIN-1515. Notebook: HDFS as a backend storage (Use hadoop client jar) Jul 2, 2017
@jongyoul
Copy link
Copy Markdown
Member

jongyoul commented Jul 6, 2017

@zjffdu Almost conflicts will be removed by extracting jars but how we guarantee those libraries are not related my features? I'm just curious.

@zjffdu
Copy link
Copy Markdown
Contributor Author

zjffdu commented Jul 6, 2017

@jongyoul Yeah, since your PR is also using the hadoop jar. I think we need to do 2 things:

  • Use the same hadoop jar version
  • Decide whether include it in distribution or not. For this, I personally prefer to include the hadoop jar in distribution. Several Reasons:
    • It is much easy to implement, we don't need to find the jars at runtime by ourselves.
    • Zeppelin might be installed in a gateway machine where hadoop is not installed, in this case, zeppelin would not work because if could not find the hadoop jars.
    • Spark put hadoop jar in its distribution and approve it can work, so I think we can trust this approach.

Let me know your concern, thanks.

@zjffdu zjffdu closed this Jul 17, 2017
@zjffdu zjffdu reopened this Jul 17, 2017
@hayssams
Copy link
Copy Markdown
Contributor

@zjffdu I think that the zeppelin.sh file should be updated with the HADOOP_CONF_DIR in the classpath

@zjffdu
Copy link
Copy Markdown
Contributor Author

zjffdu commented Aug 1, 2017

Thanks for review. @hayssams I did that in zeppelin-daemon.sh.

@hayssams
Copy link
Copy Markdown
Contributor

hayssams commented Aug 1, 2017

Yes but sometimes we need to launch zeppelin through zeppelin.sh equally in Mesos by Marathon

@zjffdu
Copy link
Copy Markdown
Contributor Author

zjffdu commented Aug 2, 2017

@hayssams Any reason why zeppelin-daemon.sh doesn't fit for you ? Because zeppelin does not guarantee zeppelin.sh can launch zeppelin server properly. zeppelin-daemon.sh not only add HADOOP_CONF_DIR but also other libraries, if you don't use zeppelin-daemon.sh, zeppelin-server may fail to launch.

@hayssams
Copy link
Copy Markdown
Contributor

hayssams commented Aug 2, 2017

@zjffdu Yes when you want to start zeppelin not as a daemon.
Please also note the docker image rely on zeppelin.sh

@zjffdu
Copy link
Copy Markdown
Contributor Author

zjffdu commented Aug 4, 2017

Also add HADOOP_CONF_DIR in zeppelin.sh, but it looks like there's some code duplication between zeppelin.sh and zeppelin-daemon.sh, we need a followup PR to remove these duplication.

@zjffdu
Copy link
Copy Markdown
Contributor Author

zjffdu commented Aug 21, 2017

I also verified the kerberized hdfs manually in my local box. @jongyoul @Leemoonsoo @felixcheung @prabhjyotsingh @hayssams Please help review it, Thanks


public <T> T callHdfsOperation(final HdfsOperation<T> func) throws IOException {
if (isSecurityEnabled) {
UserGroupInformation.getLoginUser().reloginFromKeytab();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re login on every operation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prabhjyotsingh
Copy link
Copy Markdown
Contributor

Tested on local, works as expected, have tried on both environments, with and without Kerberos.
LGTM, except a minor comment.

@zjffdu zjffdu closed this Aug 21, 2017
@zjffdu zjffdu reopened this Aug 21, 2017
@zjffdu
Copy link
Copy Markdown
Contributor Author

zjffdu commented Aug 22, 2017

Thanks for review @prabhjyotsingh @hayssams , will merge it if no more comments

public Void call() throws IOException {
Path noteFolder = new Path(notebookDir.toString() + "/" + noteId);
fs.delete(noteFolder, true);
return null;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no return value from this method as declared in the signature?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, there's nothing needs to be return for this method.

@asfgit asfgit closed this in 30bfcae Aug 24, 2017
prabhjyotsingh pushed a commit to prabhjyotsingh/zeppelin that referenced this pull request Oct 23, 2017
… jar)

This PR is trying to add hdfs as another implementation for `NotebookRepo`. There's another PR about using webhdfs to implement that. Actually hdfs client library is compatibility cross major versions. See http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Compatibility.html#Wire_compatibility, if using webhdfs, the code become more complicated and may lose some features of hdfs.

This PR is also required for HA of zeppelin, so that multiple zeppelin instances can share notes via hdfs.  I add hadoop-client in pom file. So zeppelin will package hadoop client jar into its binary distribution. This is because zeppelin may be installed in a gateway machine where no hadoop is installed (only hadoop configuration file is existed in this machine) And since the hadoop client will work with multiple versions of hadoop, so it is fine to package into binary distribution. Spark also package hadoop client jar in its binary distribution.

[Feature]

* [ ] - Task

* https://issues.apache.org/jira/browse/ZEPPELIN-1515

Unit test is added.  Also manually verify it in a single node cluster.

* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? No

Author: Jeff Zhang <[email protected]>

Closes apache#2455 from zjffdu/ZEPPELIN-1515 and squashes the following commits:

b3e83ab [Jeff Zhang] ZEPPELIN-1515. Notebook: HDFS as a backend storage (Read & Write Mode)

(cherry picked from commit 30bfcae)

Change-Id: Iecfbd1b5d2f1cb6b3c167906091e2c90524eb787
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants