See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/320882227
A STUDY ON DISTRIBUTED FILE SYSTEMS: An example of NFS
Conference Paper · December 2016
CITATION READS
1 6,193
2 authors:
Mahmut Ünver Atilla Erguzen
Kırıkkale University Kırıkkale University
23 PUBLICATIONS 63 CITATIONS 49 PUBLICATIONS 258 CITATIONS
SEE PROFILE SEE PROFILE
All content following this page was uploaded by Mahmut Ünver on 07 November 2017.
The user has requested enhancement of the downloaded file.
A Study On Distributed File Systems
A STUDY ON DISTRIBUTED FILE SYSTEMS: An example of NFS,
Ceph, Hadoop
1
MAHMUT UNVER, 2ATILLA ERGUZEN
1,2ComputerEngineering, Kırıkkale University, Turkey
Email:
[email protected],
[email protected]Abstract: Distributed File Systems (DFS), are systems used in both local network and wide area networks by using discs,
storage areas and sources together. Nowadays, DFS makes possible operating big data, large-scale computings and
transactions. The classifications are performed based on their working logics and architectures. These classifications were
performed based on fault tolerance, replications, naming, synchronization and purpose of design. In this study, firstly the
examinations on general design of DFSs were performed. Later, the results were discussed based on the advantages and
disadvantages of Ceph and Hadoop and Network File Systems(NFS), commonly used in these days.
Keywords: Distributed file system, Network file system (NFS), Hadoop, Ceph, fault tolerance, synchronization, replication,
naming, operating system.
computers, called clients, access those files as if they
1. INTRODUCTION were a single file [4].
Computer systems had large evolutions until DFSs were designed for different goals. For
now. The first one is development of strong example, the purpose of Andrew File System (AFS) is
microprocessors on 1980s from 8 bit to 64 bit DS which can support up to 5000 clients [5]. Network
processing. The strengths of these computers were as File System (NFS) uses RPC Remote Procedure Call
mainframe computers and command processing costs (RPC) communication model. RPC creates
were low at the same time. The second evolution is intermediate layer between server and client. The
commonly using local networks with high speed and client performs operations without knowing the
large scale nodes, This helped transferring 1 gigabit server's file systems. This method allows clients and
data in a second. At the end of these developments, servers with different file systems to run smoothly [6].
distributed systems using multiple computers with The purpose of Google File System (GFS) is to work
high speed networks appeared rather than a strong with big data. This is achieved by using a lot of low
computer having one processor [1]. cost equipment. Another DFS that has a very different
structure is XFS. It keeps very large files stable. Also,
The first DFSs were developed on 1970s. XFS does not have a generic server. The entire file
These were storage system connected with FTP-like system is distributed over the clients. In Ceph DFS, it
structure and they were not commonly used due to decomposes the metadata holding the data and data
their limited storage spaces. L. Svoboda reported the information. It replicates and increases the system's
first study on DFSs [2] and Svoboda developed various fault tolerance.
DFS in this year such as LOCUS, ACORN,
SWALLOW, and XDFS. The studies continued on In this study, DFSs were compared using
DFSs until now. Today’s DFSs are generally designed specific classifications. Introduction of this work gives
analogously to classical time sharing systems. These general information about DFS. In the second part,
generally take base the UNIX file systems. The general architectural structures of DFS are mentioned.
purpose of this system is combination of different The basic concepts are explained in this section. In the
computer files and storage sytems [3]. third chapter, the classification criteria to be compared
are determined and explained. In the fourth chapter,
DFSs process differently generated data on currently active DFSs are described according to the
numeric data platforms. It also performs this safely, criteria specified in the third chapter. In the last part,
efficiently and rapidly. The need for rapid growth of results and comparisons were performed.
data and rapid access to them has caused the growth of
data storage resources. The big increase on data 2. GENERAL STRUCTURE OF
created a new concept, BigData. At the same time, DISTRIBUTED FILE SYSTEMS
distributed file systems are used to process big data
and to perform operations quickly. Distributed file The overall design goal of DFSs is to use less
systems have emerged and are now being used local hardware resources by sharing hardware
effectively by cloud systems. A DFS file is stored on resources. Besides the hardware advantages, it also has
one or more computers, each of which is a server, and advantages in managing the files. This is also
1
A Study On Distributed File Systems
important in general design. For example, attention has system. One of the servers is the master server. The
been paid to the level of transparency of the DSF in master server keeps the metadata of the data. Other
order to overcome access problems caused by the servers are chunk servers. With more than one server,
network [7]. While DFS is designed, they are designed Chuk can handle multiple clients at the same time.
to provide file services to file system clients. In this With this architecture, very large data can be
structure, clients use the interfaces to create files, processed. An example of this architecture is the
delete files, read files, write files, perform directory Google File System (GFS).
operations. The operating system used to perform
these operations may be a distributed operating system
File name, chunk index
or an intermediate layer between the operating system GFS client Master
and the distributed file system [8]. Contact address
Chunk-
instructions server state
Client Server
Chunk ID, range
Chunk Chunk Chunk
server server server
Chunk data
Requests from
client to access File stays
remote file on server
Fig.3. Clustered-based architecture.
The most important difference between DFS
Fig.1. The Remote access model. servers with a symmetric architecture is whether they
create a file system on a distributed storage layer, or
The architecture of DFS is generally based on that all files are stored in the nodes that are created.
3 structures. These; This architecture consists of three separate layers. The
-Server-Client based structures first layer is basic decentralized lookup facilities. The
-Cluster based structures middle layer is a fully distributed block-oriented
-Symmetric structures storage layer. Top layer is a layer implementing a file
system [1].
The Server-Client based architecture has been
used extensively in DFS architecture. There are two 3. CLASSIFICATION CRITERIA
models in this architecture.
DFSs have several classifications that affect
1- File moved
to Client server qualities. The most important of these
Client Server classifications are as follows:
Old file
A. Fault Tolerance: When any part of the
distributed site becomes corrupted, it is tolerated
New file without being felt in the client [1].
2- Accesses are
done on client 3- When client
is done, file is B. Transparency: The distributed system looks
returned to like a single server by the client. It is the most
server
Fig.2. Upload/download access model.
important criterion affecting system design.
The first is the remote access model. In this C. Replication: More than one copy of the files
model, the client provides an interface with various file used in the system is created and stored in the
operations. File operations are performed through this distributed system. Reliability is improved on this. If a
interface. The server has to respond to this request. The copy is not accessible, the system continues to work
second model is the upload / download model. Unlike using the other copy.
the client / server model, this model downloads the file
that the client will process, and accesses the file D. Synchronization: There are copies of the file on
locally. Server / Client model is used in NFS DFS. different servers. The change of client in one copy is
Nowadays, NFS is becoming the most used DFS [1]. also made in the other copies.
Clustered based architecture also does not E. Naming: Names are all sources in the
have a single server. There are multiple servers in the distributed system. These are computers, services,
users and remote objects. Distributed system is to
2
A Study On Distributed File Systems
make a consistent naming of objects. If it does not [11] [12]. With NFS, only the file system can be
provide, it will not access the objects. shared, the printer and the modem can not be shared.
The objects to be shared may be part of a directory or
4. DISTRIBUTED FILE SYSTEMS a file.
In NFS, the installation of local diskette for
4.1. Network File System (NFS) each application is not required. Applications can be
shared via Server. The same machine can be both
NFS was started to be developed in 1984. The server and client. As a result, NFS reduces the cost of
project was developed by Sun Microsystems. It is the data storage.
most used and implemented DFS on UNIX systems. It
uses Remote Procedure Call (RPC) model for 4.2. CEPH
communication [9].
Ceph is open source code. Today, the object-
Client Server based distributed file system is increasing in
System Call Layer System Call Layer
popularity. First report is done by Weil et al. in 2006
[13]. It was purchased by Redhat in 2014 and seven
major versions were developed. The last stable version
Virtual File System Virtual File System is "Giant". It offers three different storage architectures
Layer Layer
at the same time. Those are Object-based storing,
block-based storing, file system.
Local File NFS Client NFS Server Local File
System Interface System Interface
The most important features of Ceph are
reliability and scalability. Metadata is the data holding
RPC Client RPC Server the information of the data. In general distributed
Stub Stub
systems, the metadata that holds the data and data
information is located on separate servers. Data can
not be accessed when metadata is not available. Ceph
Network
does not need a metadata server. Instead of a metadata
server, it uses an algorithm that determines the location
of the print job. This algorithm is called CRUSH.
Clients use this algorithm to determine and read the
Fig.4. NFS architecture. position of the dataset. With this algorithm, there is no
problem of not reaching the metadata.
The latest version is NFS version 4. The basic
design structure is the distributed execution of the In Ceph DFS, more than one copy of the data
classic Unix file system. Virtual file system is used. is kept as distributed on the serve. It performs
The virtual file system works like an intermediate replication with this way.
layer. This allows clients to easily work with different
file systems. The operating system is an interfaced call According to the workload measurements,
placed between calls and file system calls. More than Ceph has very good Input / Output performance. It has
one command can be sent from an RPC in the last scalability metadata management that allows up to
version. 250,000 metadata transactions per second. It can be
integrated into known cloud and virtualization systems
Fault tolerance is high on NFS. Information (such as OpenStack, CloudStack, VMware).
about the status of files is kept. In case of an error
originating from the client, the server is notified. No
Clients Metadata Cluster
file replication is done on NFS. The entire system is
Metadata operations
replicated. Files are cached. The copy on the cache is
compared to the copy on the server. If the times are
different, the file is changed and the cache is discarded
[10]. Metadata
Storage
bash
client
NFS does not use a synchronization method ls libfuse
because the files are not replicated. Single server Object Storage Cluster
…
processes are performed. This is the download / upload
model described earlier. It sends the latest data to the
vfs fuse client
server that made the last modification on the file. Linux kernel myproc
When a client wants to open a file that is cached
memorized and locked, it is updated by revalidating it
from the server. Consistency is ensured in this way Fig.5. Ceph system architecture.
3
A Study On Distributed File Systems
developer and more than 50 supporter companies. This
4.3. HADOOP is ideal for low-budget institutions.
Hadoop provides a distributed file system known Hadoop can handle very large data that many
as Hadoop Distributed File System (HDFS). Hadoop companies nowadays prefer. It contains a lot of stack
uses the high-level Java programming language [14] servers. Like Ceph, it keeps data and metadata on
and the MapReduce architecture. It is a framework that separate servers. Fault tolerance makes it safe and
allows analysis and conversion of very big data scalable. It also establishes a base for cloud technology
clusters. heavily used at the present days. It will be the most
widely used DFS in the future.
REFERENCES
[1] A.S.Tanenbaum and M.V.Steen, "Distributed Systems
Principles and Paradigms", USA: Pearson Prentice Hall,
2.Edition, 2006.
[2] L. Svoboda, "File Server For Network-based Distributed
Systems," ACM Computing Surveys, vol. 16, no. 4, pp. 353-
Fig.5. Hadoop architecture.
398, 1984.
The Hadoop Distributed File System (HDFS)
[3] U.Ergun, S.Eken and A.Sayar, "Guncel Dagitik Dosya
can reliably store very large data sets. HDFS is also Sistemlerinin Karsilastirilmali Analizi," in 6. Muhendislik ve
designed to provide streaming of these data clusters to Teknoloji Sempozyumu, Ankara,TURKEY, 2013.
the client application with high bandwidth [15].
[4] P.J.Braam, "The Coda Distributed File System," Linux
Hadoop works with a backup copy of the Journal, vol. 50, no. 6, pp. 10-20, 1998.
data, assuming that computed elements and storage
may fail [16]. If a fault occurs, the copy that resides on [5] M. Santaranayanan, "Scalable, Secure And Highly Available
another node is copied again. This keeps the data Distributed File Access," Computer, vol. 23, no. 5, pp. 9-18,
secure. Hadoop is scalable and it works on petabyte 1990.
dimensions [17].
[6] A. Siegel, K. Birman and K. Marzullo, "Deceit: a flexible
Hadoop is used by many large companies distributed file system," in Management of Replicated Data,
today. It is generally preferred in industrial and Houston, TXT, USA, 1990.
academic areas. Companies like Twitter, LinkedIn,
Ebay, AOL, Alibaba, Yahoo, Facebook, Adobe, IBM [7] "CODA web sitesi," [Online]. Available:
are using Hadoop [18] http://www.coda.cs.cmu.edu/ljpaper/lj.html. [Accessed 16
11 2016].
5. COMPARISON AND CONCLUSIONS
[8] E. Levy, "Distributed File Systems:Concepts and
Today, distributed file systems are used in large- Examples," ACM Computing Surveys, vol. 22, no. 4, pp.
scale companies as well as small-scale companies and 321-374, 1990.
projects. With the growth of the data produced in
recent years, it is preferred to store and manage [9] R. Sandberg, D. Goldberg, S. Kleimen, D. Walsh and B.
especially big data. Cloud data storage is a subservice Lyon, "Design and Implemantation of The Sun Network File
System," in Proceedings of the Usenix Conference, Portland,
of distributed systems. Distributed systems are used in
1985.
cloud systems within the Internet.
[10] G. Coulouris, J. Dollimore and T. Kindberg, Distributed
NFS, one of the DFS, mentioned in this work
Systems: Concepts and Design, USA: Addison-Wesley
seems to be based on distributed systems and it is Publishing Company, 2011.
distributed version of the UNIX system. It is used to
store Big data and to share hardware and software
[11] C. Juszczak, "Improving the Performance and Correctness of
resources. The fact that Ceph does not need the an NFS Server," in Proc. Summer,USENIX, USA, 1990.
metadata server, which is the biggest feature, ensures
that the system can work reliably. This feature makes
it more prominent. Due to the scalability feature of
Ceph, bigdata can be processed. There are over 200
4
A Study On Distributed File Systems
[12] B. Karasulu and S.Korukoğlu, "Modern Dağıtık Dosya
Sistemlerinin Yapısal Karşılaştırılması," in X. Akademik
Bilisim, Çanakkale,Turkey, 2008.
[13] S. Weil, S. Brandt, E. Miller, D. Long and C. Maltzahn,
"Ceph: A scalable, high-performance distributed file system
(pp. 307-320). USENIX A," in 7th Symposium on Operating
Systems Design and Implementation, USENIX, USA, 2006.
[14] M. Grossman, M. Breternitz and V. Sarkar, "Hadoopcl2:
Motivating the design of a distributed, heterogeneous
programming system with machine-learning applications,"
IEEE Transactions on Parallel and Distributed Systems, vol.
27, no. 3, pp. 762-775, 2016.
[15] N. Shankaran and R. Sharma, "Cloud Storage Systems-A
Survey," Indiana University, USA, 2011.
[16] K. Shvachko, H. Kuang, S. Radia and R. Chansler, "The
hadoop distributed file system (pp. 1-10)," in IEEE 26th
symposium on mass storage systems and technologies
(MSST), NV, USA, 2010.
[17] G. Yavuz, S. Aytekin and M. Akçay, "Apache Hadoop ve
Dagıtık Sistemler Üzerindeki Rolü," Journal of the Institute
of Science & Technology of Dumlupinar University, no. 27,
pp. 43-54, 2012.
[18] [Online]. Available:
https://wiki.apache.org/hadoop/PoweredBy. [Accessed 15 11
2016].
View publication stats