Survey EDIT
Survey EDIT
Introduction
Disk space in secondary storage used for physical storage of program for logical execution in
main memory. Operating system allocates space for program in physical memory. The
allocation of disk space has various phenomenons like search for range of space, indexing on
data disk, retrieving and ordering of content which enables efficient utilization of disk space.
Much of research work has been carried out earlier in terms of ordered pair of sets like buddy
systems.
Earlier work carried out on Contiguous allocation, Linked allocation and Indexed allocation
where major contribution is on defining data structure and methods to enhance the effective
utilization of disk space.
Prior to 1970, Disc allocation was administrated by means of a map held in core, with one
bit per disc page managed by a method found by J.L. Smith.
B. J. Austin retains all of Smith's procedure, except that the technique for searching the map
has been improved. Allocation was dynamic that space given to file only as needed, not being
reserved in advance. List also maintained a list of hole's in the disc map. This list was added
to whenever a search for allocation found a string of adjacent free pages whose length was
less than required for request in hand.
File loading experiments have shown that proposed allocation algorithm has 80 tables when
compared to previous allocation algorithms has 540 page tables. A programmer may produce
a file in which the logical addressing space is non-contiguous. Such a file always has a page
table. Time for file loading process has been considerably reduced due to causes; the number
of disc operations is reduced, the time spent searching the map is reduced.
The dumping time and reloading of file storage takes a little less than hours of machine timeThe load takes about three quarters of an hour and the dump, which involves verification of
the tapes produced, takes about an hour. With previous algorithm the number of page tables
immediately after file load was 500 to 1,000 and during a day additional 100 to 200 would be
formed.
The System no longer spends a large amount of time searching the allocation map when the
disc is practically full, but no quantitative evaluation of this effect is available. Algorithm
seems to be satisfactory for both file loading and normal operations.
D.E. Gold et.,al proposed a basic permutation algorithm and its variations. Their work deals
with masking rotational latency. Algorithm for permutation of blocks in memory hierarchy
has been proposed like Slow-to Fast Transition algorithm, Fast to Slow transition algorithm,
Special cases of Permutation are addition and deletion of blocks. Arbitrary permutation
algorithm, Standard permutation algorithm Permutation assignment algorithms were
proposed. The preceding method is implemented for a head per- track disk system by
performing row permutations in intermediate (bulk storage) random access memory as
described in the earlier example. The first row permutation is accomplished by accumulating
blocks in an output buffer as they come from the primary memory. When a row is
accumulated, these blocks start to be output to the disk in their permuted order. The final row
permutations are performed similarly by accumulating a row of blocks in the input buffer
before transmission to primary memory in permuted order.
Howard Lee Morgan proposed an optimization model for the assignment of flies to disk
packs, and packs to either resident or nonresident status is presented. Heuristics are suggested
for those cases in which it is inefficient to compute the actual optimum.
When the amount of space required for file storage exceeds the amount which can be kept
online, decisions must be made as to which files are to be permanently resident and which
mountable. These decisions will affect the number of mount requests issued to the operators.
This is often a bottleneck in a computing facility, and reducing the number of mounts thus
decreases turnaround time.
In summary, it is clear that the problem of optimally using disk storage devices is a complex
one, effects contributed by queuing and scheduling as well as space allocation. This paper has
attempted to describe where space allocation may fit in, and to prescribe some methods for
handling this important problem. Test cases of multiple Knapsack algorithms executed.
Kenneth K. Shen et...al, presented an extension of the buddy method, called the weighted
buddy method, for dynamic storage allocation is presented. The weighted buddy method
allows block sizes of 2k and 3.2k, whereas the original buddy method allowed only block sizes
of 2k. This extension is achieved at an additional cost of only two bits per block. Simulation
2
results are presented which compare this method with the buddy method. These results
indicate that, for a uniform request distribution, the buddy system has less total memory
fragmentation than the weighted buddy algorithm. However, the total fragmentation is
smaller for the weighted buddy method when the requests are for exponentially distributed
block sizes.
James A. Hinds proposed a simple scheme for the determination of the location of a block of
storage relative to other blocks is described. This scheme is applicable to buddy type storage
allocation system.
Warren Burton worked on generalization of the buddy system for storage allocation. The set
of permitted block sizes {SIZEi}i=0 to n must satisfy the condition SIZE~ SIZE~_I ~SIZE~_k~) where k may be any meaningful integral-valued function. This makes it possible
to force logical storage blocks to coincide with physical storage blocks, such as tracks and
cylinders.
James L.Peterson presented two algorithms for implementing any of a class of buddy
systems for dynamic storage allocation. Each buddy system corresponds to a set of recurrence
relations which relate the block sizes provided to each other. Analyses of the internal
fragmentation of the binary buddy system, the Fibonacci buddy system, and the weighted
buddy system are given. Comparative simulation results are also presented for internal,
external, and total fragmentation.
The total fragmentation of the weighted buddy system is generally worse than that of the
Fibonacci buddy binary system. The total fragmentation of the binary buddy Fibonacci
system varies widely because of its internal fragmentation characteristics. Still the variation
among these buddy systems is not great, and the lower execution time of the binary buddy
would therefore seem to recommend it for general use, although the execution time of the
Fibonacci buddy system is not much greater. The weighted buddy system seems to be less
desirable than either the binary or the Fibonacci system owing to its higher execution time
and greater external fragmentation,
Cartesian product files, which can be stated as follows: Given a k-attribute Cartesian product
file and an m-disk system, allocate buckets among the m disks in such a way that, for all
possible partial match queries, the concurrency of disk accesses is maximized. The Risk
Modulo (DM) allocation method is described first, and it is shown to be strict optimal under
many conditions commonly occurring in practice, including all possible partial match queries
when the number of disks is 2 or 3. It is also shown that although it has good performance,
the DM allocation method is not strict optimal for all possible partial match queries when the
number of disks is greater than 3. The General Disk Modulo (GDM) allocation method is
then described, and a sufficient but not necessary condition for strict optimality of the GDM
method for all partial match queries and any number of disks is then derived. Simulation
studies comparing the DM and random allocation methods in terms of the average number of
disk accesses, in response to various classes of partial match queries, show the former to be
significantly more effective even when the number of disks is greater than 3, that is, even in
cases where the DM method is not strict optimal. The results that have been derived formally
and shown by simulation can be used for more effective design of optimal file systems for
partial match queries. When considering multiple-disk systems with independent access
paths, it is important to ensure that similar records are clustered into the same or similar
buckets, while similar buckets should be dispersed uniformly among the disks.
Free disk space Management Matthew S. Hecht et..al, Schemes for managing free disk
pages are so widely known that they must be considered part of the folklore of computer
science. Two popular data structures are bitmaps and linked lists. The bitmap has bit position
i set when disk page number i is free, and cleared when disk page number i is in use. The
linked list contains the page numbers of free disk pages. Less widely known, however, are
variations of such schemes that preserve the consistency of these data across failures. While
recoverable schemes for managing free disk pages are all based on the principle of
maintaining a copy (complete, or incremental with a base) on memory media with an
independent failure mode, the details of such schemes vary considerably. The general
problem we consider here is how to make the free-disk-space data structures survive two
kinds of failures: (1) failure of main memory (e.g., loss of power) resulting in loss of its
contents, and (2) failure of a disk transfer resulting in an unreadable disk page. This paper
presents a programming technique, using a linked list for managing the free disk pages of a
file system and using shadowing (also known as careful replacement [lo]) for failure
recovery, which enjoys the following properties:
4
(1) The state of allocation at the previous checkpoint (a consistent system state preserved on
disk) is always maintained on the disk.
(2) The data structure describing free space on disk is never copied during a checkpoint or
recover (from a main-memory failure) operation.
(3) A window of only two pages of main memory is required for accessing and maintaining
the data describing free space.
(4) System information need be written to disk only during a checkpoint, rather than every
time it changes.
Lorie [7] describes a scheme similar to ours that uses bitmaps and shadowing. Gray [l, 21
describes the update-in-place with logging paradigm that can be applied to the problem of
managing free disk pages across failures. Sturgis, Mitchell, and Israel [9] (see also Mitchell
and Dion [8]) describe an abstraction called stable storage whereby a page-allocation bitmap
is recorded redundantly on disk; the second page is not written until the first has been written
successfully.
Distributed File System Bruce Walker et..al, presented LOCUS Is a distributed operating
system which supports transparent access to data through a network wide flle system, permits
automatic replication of storage supports transparent distributed process execution, supplies a
number of high reliability functions such as nested transactions, and is upward compatible
with Unix. Partitioned operation of subnet and their dynamic merge is also supported. The
system has been operational for about two years at UCLA and extensive experience In its use
has been obtained. The complete system architecture is outlined in this paper, and that
experience is summarized. The most obvious conclusion to be drawn from the LOCUS work
is that a high performance, network transparent, distributed file system which contains all of
the various functions indicated throughout this paper, is feasible to design and implement,
even in a small machine environment. Replication of storage is valuable, both from the user
and the system's point of view. However, much of the work is in recovery and in dealing with
the various races and failures that can exist. Nothing is free. In order to avoid performance
degradation when resources are local, the cost has been converted into additional code and
substantial care in implementation architecture. LOCUS is approximately a third bigger than
Unix and certainly more complex. The difficulties involved in dynamically reconfiguring an
5
operating system are both intrinsic to the problem, and dependent on the particular system.
Rebuilding lock tables and synchronizing processes running in separate environments are
problems of inherent difficulty. Most of the system dependent problems can be avoided,
however, with careful design. The fact that LOCUS uses specialized protocols for operating
system to operating system communication made it possible to control message traffic quite
selectively. The ability to alter specific protocols to simplify the reconfiguration solution was
particularly appreciated. The task of developing a protocol by which sites would agree about
the membership of s partition proved to be surprisingly difficult. Balancing the needs of
protocol synchronization and failure detection while maintaining good performance presented
a considerable challenge. Since reconfiguration software is run precisely when the network is
flaky, those problems are real, and not events that are unlikely. Nevertheless, it has been
possible to design and implement a solution that exhibits reasonably high performance.
Further work is still needed to assure that scaling to a large network will successfully
maintain that performance characteristic, but our experience with the present solution makes
us quite optimistic. In summary, however, use of LOCUS indicates the enormous value of a
highly transparent, distributed operating system. Since file activity often is the dominant part
of the operating system load, it seems clear that the LOCUS architecture, constructed on a
distributed file system base, is rather attractive.
PHILIP D. L. KOCH presented the buddy system is known for its speed and simplicity.
However, high internal and external fragmentation have made it unattractive for use in
operating system file layout. A variant of the binary buddy system that reduces fragmentation
is described. Files are allocated on up to t extents, and inoptimally allocated files are
periodically reallocated. The Dartmouth Time-Sharing System (DTSS) uses this method.
Several installations of DTSS, representing different classes of workload, are studied to
measure the methods performance. Internal fragmentation varies from 2-6 percent, and
external fragmentation varies from O-10 percent for expected request sizes. Less than 0.1
percent of the CPU is spent executing the algorithm. In addition, most files are stored
contiguously on disk. The mean number of extents per file is less than 1.5, and the upper
bound is t. Compared to the tile layout method used by UNIX, the buddy system results in
more efficient access but less efficient utilization of disk space. As disks become larger and
less expensive per byte, strategies that achieve efficient I/O throughput at the expense of
some storage loss become increasingly attractive.
criterion for optimality whose objective is to design allocation methods that yield response
time 1 for all queries with a minimum number of specified attributes. Using coding theory,
we determined this minimum number for binary files that have up to 16 attributes assuming
that the number of disks is a power of 2. Our approach contrasts with most previous results
where ad hoc methods were used [2,3,4,5,10]. B ase d on the results of coding theory, we
construct optimal allocation methods for response time 1, thus resulting in better performance
than previous methods in this case. For example, compared to [4], when p = 2, n = 8, and m =
16, our approach ensures a response time of one if five attributes are specified, while in [4]
six or more attributes must be specified to guarantee such a response time.
Khaled A. S. Abdel-Ghaffar, The problem of disk allocation addresses the issue of how to
distribute a file on several disks in order to maximize concurrent disk accesses in response to
a partial match query. In this paper a coding-theoretic analysis of this problem is presented,
and both necessary and sufficient conditions for the existence of strictly optimal allocation
methods are provided. Based on a class of optimal codes, known as maximum distance
separable codes, strictly optimal allocation methods are constructed. Using the necessary
conditions proved, we argue that the standard definition of strict optimality is too strong and
cannot be attained, in general. Hence, we reconsider the definition of optimality. Instead of
basing it on an abstract definition that may not be attainable, we propose a new definition
based on the best possible allocation method. Using coding theory, allocation methods that
are optimal according to our proposed criterion are developed. In this paper we have
presented a coding-theoretic analysis of the disk allocation problem. We have shown the
equivalence of the problem of strictly optimal disk allocation and the class of MDS codes.
One main open problem in this area is the development of tight necessary and sufficient
conditions for the existence of optimal disk allocation [8]. The results in Section 4 provide
both necessary and sufficient conditions for the existence of such methods. These results
formalize the intuitive ideas developed by Faloutsos and Metaxas [6], as well as extend and
generalize several other previous results, especially those presented by Sung [18]. Unlike the
necessary condition in [18] (Theorem 3.1), which considered binary Cartesian product files,
Theorem 4.1 applies to any p-ary Cartesian product file. Our results also compliment the
results in [5] and [8], where strictly optimal disk allocation methods are derived when p > m.
In this paper we have studied the case p < m (this is especially important since, often, p = 2).
For this case, our results imply better allocation methods than those mentioned in the
literature. For example, when p = 8, n == 6, and m = 64, the field exclusive method of [8]
8
gives a response time 2.4 on average for queries with two unspecified fields. How-ever, our
techniques result in a strictly optimal allocation method that, of course, has response time 1
for such queries. Using the necessary conditions proved, we have argued that the standard
definition of strict optimality is too strong and cannot be attained, in general. Therefore, we
have defined a new criterion for optimality whose objective is to design allocation methods
that yield response time 1 for all queries with a minimum number of specified attributes.
Using coding theory, we have determined this minimum number for binary Cartesian product
files that have up to 16 attributes, assuming that the number of disks is a power of 2. Our
approach contrasts with most previous results where ad hoc methods were used [4, 5, 6, 8,
17]. Based on results from coding theory, we have constructed optimal allocation methods for
response time 1, thus, in this case, resulting in better performance than previous methods. For
example, compared to [6], when p = 2, n = 8, and m = 16, our approach ensures a response
time 1 if five attributes are specified, whereas in [6] six or more attributes must be specified
to guarantee such a response time. Furthermore, we have shown that if all queries with s
specified attributes have response time p - /m, where m is a power of p, then all queries
with s < s specified attributes have response time p ns /m. This implies that, if an
allocation method can be optimized for all queries with a particular number of unspecified
attributes, then it is optimized for all queries with more unspecified attributes. Finally, we
have addressed some of the implications of our results on the design of multiple-disk file
systems. As mentioned earlier, the number of attributes of a given file is usually
predetermined by the application. Hence, the system designer has two parameters left to
determine: p, the number of partitions the attribute domains are partitioned into; and m, the
number of disks on which the file is distributed. In Section 4 we have shown that, for any
given p, the range of choices for m is fairly limited when the goal is to design a strictly
optimal allocation method. In contrast, our new optimality criterion, which ensures response
time 1 with a minimum number of specified attributes, provides the system designer with
much more flexibility in determining p and m. Although the results concerning our new
optimality criterion are based on the assumption that m is a power of p, our approach gives
tight bounds on the performance of the best possible allocation methods even if this
assumption does not hold. This can be achieved by replacing m, which is not necessarily a
power of p, by ~rlOgP1 and p~lOgP J, that is, by considering allocation methods where the
number of disks is a power of p close to m. For example, if p = 2, n = 8, and m = 5, we may
get a lower bound on SI, the minimum number such that any query with SI specified
attributes has a response time 1, by replacing m = 5 by m = 8. From Table III, we get SI > 7.
9
On the other hand, we know from the same table that there is an allocation method for m = 4
disks that gives a response time 1 for any query with seven specified attributes. This
allocation method can be applied to the case of five disks (by using only four disks) to
guarantee that every query with SI = 7 specified attributes has a response time 1. From the
lower bound, we know that this value of SI is optimal; that is, there is no allocation method
that gives a response time 1 for every query with six specified attributes if the number of
disks equals 5
1997 HFS: A Performance-Oriented Flexible File
System Based on Building-Block Compositions the problem of disk allocation addresses the
issue of how to distribute a file on several disks in order to maximize concurrent disk
accesses in response to a partial match query. In this paper a coding-theoretic analysis of this
problem is presented, and both necessary and sufficient conditions for the existence of strictly
optimal allocation methods are provided. Based on a class of optimal codes, known as
maximum distance separable codes, strictly optimal allocation methods are constructed.
Using the necessary conditions proved, we argue that the standard definition of strict
optimality is too strong and cannot be attained, in general. Hence, we reconsider the
definition of optimality. Instead of basing it on an abstract definition that may not be
attainable, we propose a new definition based on the best possible allocation method. Using
coding theory, allocation methods that are optimal according to our proposed criterion are
developed.
1998 - Efficient Disk Allocation for Fast Similarity Searching
Sunil Prabakar et..al, presented a new scheme which provides good declustering for similarity
searching. In particular, it does global declustering as opposed to local declustering, exploits
the availability of extra disks and does not limit the partitioning of the data space. Our
technique is based upon the Cyclic declustering schemes which were developed for range and
partial match queries. We establish, in general, that Cyclic declustering techniques
outperform previously proposed techniques. The problem of efficient similarity searching is
becoming important for databases as non-textual information is stored. The problem reduces
to one of finding nearest-neighbors in high-dimensional spaces. In this paper, a new disk
allocation method for declustering high-dimensional data to optimize nearest-neighbor
queries is developed. The new scheme, called cyclic allocation, is simple to implement and is
10
a general allocation method in that it imposes no restrictions on the partitioning of the data
space. Furthermore, it exploits the availability of any number of disks to improve
performance. Finally, by varying the skip values the method can be adapted to yield
allocations that are optimized for various criteria. We demonstrated the superior performance
of the cyclic approach compared to existing schemes both those that were originally designed
for range queries (FX, DM and HCAM) as well as those designed specifically for nearestneighbors (NOD). The FX and DM schemes are found to be inappropriate for nearestneighbor queries. HCAM performs reasonably well for odd numbers of disks, but extremely
poorly for even numbers. NOD was found not to achieve as much parallelism as Cyclic for
most cases, except when retrieving only direct neighbors with a small number of disks. NOD
also has the potential to give better performance for some dimensions when the number of
disks is close to that required to achieve near-optimality. On the other hand, NOD is restricted
to 2-way partitioning of each dimension, and its cost remains the same even when more disks
beyond those required for near-optimal declustering are available. This results in a saturation
of the gains produced by NOD beyond this point. In contrast, the Cyclic approach is not
restricted to 2-way partitioning and makes use of all available disks. In fact, its cost tracks the
lower bound and reduces as the number of disks increases. Overall we observe that the Cyclic
scheme gives the best performance for nearest-neighbor queries more consistently than any
other scheme. Given the success of the cyclic schemes for two dimensional range queries
[PAAE98], and the flexibility for nearest-neighbor queries, we expect that it will give good
performance for systems that require both types of queries.
2003 An application of a context-aware file system
Ubiquitous computing environments stretch the requirements of traditional infrastructures
used to facilitate the development of applications. Activities are often supported by
collections of applications, some of which are automatically launched with little or no human
intervention. This task-driven environment challenges existing application construction and
data management techniques. In this paper, we describe a file system that organises
application data based on contextual information, imports user data based on its physical
presence, and supports format conversions to accommodate device context. We describe
several applications that we have developed within our ubiquitous computing infrastructure
and show how they leverage the novel features of our file system to simplify their
complexity. Ubiquitous computing is poised to revolutionise computer systems design and
11
use. New challenges arise beyond what is required in typical distributed systems, such as how
to integrate context, account for mobility and integrate heterogeneous devices. Applications
are no longer tied to a single machine, but may be mapped to a physical space based on
resource availability or the role that the space is playing. Applications running in a physical
space may be affected by context, such as the location, what is happening in the space or who
is present. Our file system is used to address these issues by restricting the scope of
information to what is meaningful for the current context, such as application configurations
or application data, making personal information available conditioned on user presence, and
adapting content for resource availability through dynamic data types. While much work has
been conducted in location-aware data retrieval, we believe that other types of context
information can be useful in such environments.
A design philosophy of our work has been to create a programmable system in which the user
retains a level of control. By allowing agreements between a space and its applications, users
are able to tag data in such a way that it can become available to those applications
conditioned on the current context and the system is able to operate with deterministic
behaviour. The applications that we have built are able to take advantage of the features of the
system to simplify functions that usually require additional effort on the part of the developer
or user. We have found the performance of the system to easily satisfy the requirements of the
applications we have developed. As we build more applications, we will be able to determine
what types of different scenarios and applications are best supported by such a system.
Through our experiences with the system, we have identified several areas for improvement
and we will be incorporating some of these functions into the infrastructure. Future work will
also involve the inclusion of access control. Our research group is currently working on a
role-based security framework that will be integrated into the system.
2003 - The Google File System
We have designed and implemented the Google File System, a scalable distributed file
system for large distributed data-intensive applications. It provides fault tolerance while
running on inexpensive commodity hardware, and it delivers high aggregate performance to a
large number of clients. While sharing many of the same goals as previous distributed file
systems, our design has been driven by observations of our application workloads and
technological environment, both current and anticipated, that reflect a marked departure from
some earlier file system assumptions. This has led us to reexamine traditional choices and
12
explore radically different design points. The file system has successfully met our storage
needs. It is widely deployed within Google as the storage platform for the generation and
processing of data used by our service as well as research and development efforts that
require large data sets. The largest cluster to date provides hundreds of terabytes of storage
across thousands of disks on over a thousand machines, and it is concurrently accessed by
hundreds of clients. In this paper, we present file system interface extensions designed to
support distributed applications, discuss many aspects of our design, and report
measurements from both micro-benchmarks and real world use. The Google File System
demonstrates the qualities essential for supporting large-scale data processing workloads on
commodity hardware. While some design decisions are specific to our unique setting, many
may apply to data processing tasks of a similar magnitude and cost consciousness. We started
by reexamining traditional file system assumptions in light of our current and anticipated
application workloads and technological environment. Our observations have led to radically
different points in the design space. We treat component failures as the norm rather than the
exception, optimize for huge files that are mostly appended to (perhaps concurrently) and
then read (usually sequentially), and both extend and relax the standard file system interface
to improve the overall system. Our system provides fault tolerance by constant monitoring,
replicating crucial data, and fast and automatic recovery. Chunk replication allows us to
tolerate chunk server failures. The frequency of these failures motivated a novel online repair
mechanism that regularly and transparently repairs the damage and compensates for lost
replicas as soon as possible. Additionally, we use check summing to detect data corruption at
the disk or IDE subsystem level, which becomes all too common given the number of disks
in the system. Our design delivers high aggregate throughput to many concurrent readers and
writers performing a variety of tasks. We achieve this by separating file system control,
which passes through the master, from data transfer, which passes directly between chunk
servers and clients. Master involvement in common operations is minimized by a large chunk
size and by chunk leases, which delegates authority to primary replicas in data mutations.
This makes possible a simple, centralized master that does not become a bottleneck. We
believe that improvements in our networking stack will lift the current limitation on the write
throughput seen by an individual client. GFS has successfully met our storage needs and is
widely used within Google as the storage platform for research and development as well as
production data processing. It is an important tool that enables us to continue to innovate and
attack problems on the scale of the entire web.
13
Modification, and that it delivers graceful degradation and live-block recovery, and, through
access-driven diffusion, good performance. We conclude with a discussions of the lessons we
have learned in the process of implementing D-GRAID:
Limited knowledge within the disk does not imply limited functionality. One of the main
contributions of this article is a demonstration of both the limits of semantic knowledge, as
well as the proof via implementation that despite such limitations, interesting functionality
can be built inside of a semantically smart disk system. We believe any semantic disk system
must be careful in its assumptions about file system behavior, and hope that our work can
guide others who pursue a similar course.
Semantically smart disks would be easier to build with some help from above. Because of
the way file systems reorder, delay, and hide operations from disks, reverse engineering
exactly what they are doing at the SCSI level is difficult. We believe that small modifications
to file systems could substantially lessen this difficulty. For example, if the file system could
inform the disk whenever it believes the file system structures are in a consistent on-disk
state, many of the challenges in the disk would be lessened. This is one example of many
small alterations that could ease the burden of semantic disk development.
Semantically smart disks stress file systems in unexpected ways. File systems were not
built to operate on top of disks that behave as D-GRAID does; specifically, they may not
behave particularly well when part of a volume address space becomes unavailable. Perhaps
because of its heritage as an OS for inexpensive hardware, Linux file systems handle
unexpected conditions fairly well. However, the exact model for dealing with failure is
inconsistent: data blocks could be missing and then reappear, but the same is not true for
inodes. As semantically smart disks push new functionality into storage, file systems may
potentially need to evolve to accommodate them.
2006 An approach to virtual allocation in Storage systems.
Storage area networks (SANs) and storage virtualization [Huang et al. 2004] allow storage
devices to be shared by multiple heterogeneous operating systems. However, native file
systems, such as Windows NTFS or Linux ext2, expect to have exclusive access to their
volumes. In other words, each operating system reserves storage devices for its own use, and
the space in a storage device owned by one operating system cannot be used by another. This
15
problem seriously hampers the flexibility of storage management [Menon et al. 2003]. This
lack of flexibility manifests in the following restrictions:
(a) File systems cannot use storage space in another device owned by another file system; and
(b) a computer can only create files on devices it owns.
Traditional file systems are closely tied to their underlying storage. File systems currently
employ a static allocation approach, where the entire storage space is claimed at the time of
file system creation. This is somewhat similar to running with only physical memory in a
memory system. Memory systems employ virtual memory for many reasons: to allow
flexible allocation of resources, to share memory safely, etc. Traditional static allocation of
storage at the time of file system creation lacks such flexibility. To address this problem and
to enable storage as a discoverable resource, we have developed a virtual allocation technique
at the block level. Virtual allocation has the following combination of characteristics:
It uses the generic block interface widely employed in todays systems. This allows it to
support a wide range of heterogeneous platforms, and permits the simplest reuse of existing
operating and file systems technologies.
It provides a mechanism to create virtual storage systems, where unused storage space can
be provided to any application or file system as a discoverable resource.
It can be built on existing systems with little change to operating systems.
File system-level: IBMs Storage Tank [Menon et al. 2003] separates metadata operations
from data operations to allow a common pool of storage space to be shared across
heterogeneous environments.
Log-structured file systems (e.g., LFS [Rosenblum and Ousterhout 1992] and WAFL [Hitz et
al. 1994]) allow storage allocation to be detached from file systems because all new data is
written at the end of a log. File systems such as ReiserFS [Reiser 2004] and JFS (journaled
file system) [Best 2000] support expansion of the file systems. None of these systems allocate
storage space based on actual usage.
16
Block-level: Veritas Volume Manager [Veritas 2002] and other existing storage resource
management (SRM) products, such as IBM Tivoli Storage Manager [Kaczmarski et al. 2003],
provide considerable flexibility in storage management, but allocate storage space based on
file system size. Loge [English and Stepanov 1992] separates storage allocation from the file
system to optimize write operations; it writes blocks near the current disk-head position. The
Loge system does not provide storage allocation flexibility with traditional file systems. By
contrast, the focus of this article is on integration of the flexible storage allocation scheme
into a traditional file system environment. HPs AutoRAID employed dynamic allocation in
order to reduce the small-write penalty of RAID systems [Wilkes et al. 1996]. Petal [Lee and
Thekkath 1996] proposes a disk-like interface which provides an easy-tomanage distributed
storage system to many clients (e.g., file systems and databases). Petals virtual disks cleanly
separate a clients view of storage from the physical resources, which allows sharing of the
physical resources more flexibly among many clients, but it still allocates storage space
according to the configured file system size. Recently, a storage resource allocation technique
called thin provisioning has been introduced [3PARdata 2003], which provides storage
allocation flexibility, as our system does. However, the details about their architecture,
implementation, and performance are unknown. Virtual allocation is on-demand allocation; it
is orthogonal to storage virtualization products sold by vendors. Storage virtualization hides
the connectivity, physical characteristics, and organization of devices from file systems.
However, existing products allocate storage based on file system size, and dont allow
sharing of unused space. This is akin to programs allocating the maximum amount of
memory they may need and not relinquishing or sharing it, even when they actually need
much smaller amounts of memory. This is because the products employ static allocation,
where every mapping between the logical and physical blocks is predetermined for a given
size when creating a logical volume. By contrast, in VA, every mapping is determined
dynamically as data is written (on demand) so that storage can be allocated based on actual
use. Using VA, several different file systems may share a single storage resource, or a single
file system may span multiple storage resources, resulting in better flexibility.
Conclusion
We have proposed virtual allocation employing an allocate-on-write policy for improving the
flexibility of managing storage across multiple file systems/ platforms. By separating storage
allocation from file system creation, common storage space can be pooled across different
17
file systems and flexibly managed to meet the needs of different file systems. Virtual
allocation also makes it possible for storage devices to expand easily, so that existing devices
incorporate other available resources. We have shown that this scheme can be implemented
with existing file systems, without significant changes to the operating systems. Our
experimental results from a Linux PC-based prototype system demonstrated the effectiveness
of our approach.
2006 - The Conquest File System
Better Performance Through a Disk/Persistent-RAM Hybrid Design Modern file systems
assume the use of disk, a system-wide performance bottleneck for over a decade. Current disk
caching and RAM file systems either impose high overhead to access memory content or fail
to provide mechanisms to achieve data persistence across reboots. The Conquest file system
is based on the observation that memory is becoming inexpensive, which enables all file
system services to be delivered from memory, except for providing large storage capacity.
Unlike caching, Conquest uses memory with battery backup as persistent storage, and
provides specialized and separate data paths to memory and disk. Therefore, the memory data
path contains no disk-related complexity. The disk data path consists of optimizations only
for the specialized disk usage pattern. Compared to a memory-based file system, Conquest
incurs little performance overhead. Compared to several disk-based file systems, Conquest
achieves 1.3x to 19x faster memory performance, and 1.4x to 2.0x faster performance when
exercising both memory and disk. Conquest realizes most of the benefits of persistent RAM
at a fraction of the cost of a RAM-only solution. It also demonstrates that disk-related
optimizations impose high overheads for accessing memory content in a memory-rich
environment. The Conquest disk/persistent-RAM hybrid file system addresses the
performance problem of disk. The key observation is that the cost of persistent RAM (e.g.,
battery-backed DRAM) is declining rapidly, and the assumption of RAM as a scarce resource
is becoming less true for average users. Conquest explores these emerging memory-rich
environments and their effects on file system architecture and better performance. The
Conquest experience also teaches the following lessons: (1) Current operating systems have a
deep-rooted assumption of high latency storage throughout the computing stack, which is
difficult to bypass or remove; (2) File systems designed for disks fail to exploit the full
potential of memory performance in a memory-rich environment. (3) separating data paths
18
into low and high-latency storage and matching workload characteristics to appropriate
storage, media can yield significant performance gains and data path simplifications.
Conquest is a disk/persistent-RAM hybrid file system that delivers all file system services
from persistent RAM, with the single exception that high-capacity storage is still provided by
traditional disks. In essence, Conquest provides two specialized and simplified data paths to
persistent-RAM and disk storage: (1) It stores all small files and metadata (e.g., directories
and file attributes) in RAM; and (2) disk holds only the data content of the remaining large
files (with their metadata stored in persistent RAM).
By partitioning data in this fashion, Conquest performs all file system management on
memory-resident data structures, thereby minimizing disk accesses. Tailoring both file system
data structures and management to the physical characteristics of memory significantly
improves performance, as compared to disk-only designs. In addition, traversing the memory
data path incurs no disk-related overhead, and the disk data path consists of only the
processing costs for handling access patterns that are suitable for disk. Another benefit of
Conquest is its ability to provide a smooth and cost effective transition from disk-based to
persistent-RAM-based storage. Unlike other memory file systems [McKusick et al. 1990;
Douglis et al. 1994; Wu and Zwaenepoel 1994], Conquest allows persistent RAM to assume
more file system responsibility as memory prices decline. Thus, it can achieve most of the
benefits of persistent RAM, without the high cost of RAM-only solutions. The experience of
designing and implementing Conquest offers several major lessons:
The handling of disk characteristics permeates file system design, even at levels above the
device layer. For example, the default VFS routines contain read ahead and buffer cache
mechanisms that add high and unnecessary overheads to low-latency main store. Because of
the need to bypass these mechanisms, building Conquest was much more difficult than we
initially expected. For example, certain internal storage routines anticipate the data structures
associated with disk handling. Reusing these routines either involves constructing memoryspecific access routines from scratch or finding ways to invoke them with memory-based data
structures.
File systems that are optimized for disk are not suitable for an environment where memory
is abundant. For example, ext2, reiserfs, and SGI XFS do not exploit the speed of RAM as
well as had been anticipated. Disk-related optimizations impose high overheads on in19
believe that a key barrier to the adoption of contributory storage systems is that contributing a
large quantity of local storage interferes with the principal user of the machine.
To overcome this barrier, we introduce the Transparent File System (TFS). TFS provides
background tasks with large amounts of unreliable storageall of the currently available
spacewithout impacting the performance of ordinary file access operations. We show that
TFS allows a peer-to-peer contributory storage system to provide 40% more storage at twice
the performance when compared to a user-space storage mechanism. We analyze the impact
of TFS on replication in peer-to-peer storage systems and show that TFS does not appreciably
increase the resources needed for file replication.
Contributions:
This article presents the Transparent File System (TFS), a file system that can contribute
100% of the idle space on a disk while imposing a negligible performance penalty on the
local user. TFS operates by storing files in the free space of the file system so that they are
invisible to ordinary files. In essence, normal file allocation proceeds as if the system were
not contributing any space at all. We show in Section 5 that TFS imposes nearly no overhead
on the local user. TFS achieves this both by minimizing interference with the file systems
block allocation policy and by sacrificing persistence for contributed space: Normal files may
overwrite contributed space at any time. TFS takes several steps that limit this unreliability,
but because contributory applications are already designed to work with unreliable machines,
they behave appropriately in the face of unreliable files. Furthermore, we show that TFS does
not appreciably impact the bandwidth needed for replication. Users typically create little data
in the course of a day [Bolosky et al. 2000], thus the erasure of contributed storage is
negligible when compared to the rate of machine failures.
TFS is especially useful for replicated storage systems executing across relatively stable
machines with plentiful bandwidth, as in a university or corporate network. This environment
is the same one targeted by distributed storage systems such as FARSITE [Adya et al. 2002].
As others have shown previously, for high-failure modes such as wide-area Internet-based
systems, the key limitation is the bandwidth between nodes, not the total storage. The
bandwidth needed to replicate data after failures essentially limits the amount of storage the
network can use [Blake and Rodrigues 2003]. In a stable network, TFS offers substantially
more storage than dynamic, user-space techniques for contributing storage. Using free disk
21
space. Recognizing that the file system on a typical desktop is nearly half-empty, researchers
have investigated ways to make use of extra storage. FS2 [Huang et al. 2005] is a file system
that uses extra space on the disk for block-level replication to reduce average seek time. FS2
dynamically analyzes disk traffic to determine which blocks are frequently used together. It
then creates replicas of blocks so that the spatial locality on disk matches the observed
temporal locality. FS2 uses a policy that deletes replicas on-demand as space is needed. We
believe that it could benefit from a TFS-like allocation policy, where all replicas except the
primary one would be stored as transparent blocks. In this way, the entire disk could be used
for block replication. A number of peer-to-peer storage systems have been proposed that
make use of replication and free disk space to provide reliability. These include distributed
hash tables such as Chord [Morris et al. 2001] and Pastry [Rowstron and Druschel 2001a], as
well as complete file systems like the Chord file system
[Dabek et al. 2001], and Past [Rowstron and Druschel 2001b].
CA-NFS: A Congestion-Aware Network
File System
ALEXANDROS BATSAKIS et...al.,
We develop a holistic framework for adaptively scheduling asynchronous requests in
distributed file systems. The system is holistic in that it manages all resources, including
network bandwidth, server I/O, server CPU, and client and server memory utilization. It
accelerates, defers, or cancels asynchronous requests in order to improve applicationperceived performance directly. We employ congestion pricing via online auctions to
coordinate the use of system resources by the file system clients so that they can detect
shortages and adapt their resource usage. We implement our modifications in the CongestionAware Network File System (CA-NFS), an extension to the ubiquitous network file system
(NFS). Our experimental result shows that CA-NFS results in a 20% improvement in
execution times when compared with NFS for a variety of workloads. CA-NFS introduces a
new dimension in resource management by implicitly managing and coordinating the usage
of the file system resources among all clients. It unifies fairness and priorities in a single
framework that assures that realizing optimization goals will benefit file system users, not the
file system servers.
22
Flash-Aware Cluster Allocation Method Based on Filename Extension for FAT File System
To support conventional file systems such as FAT, the NAND flash memory needs the FTL to
provide transparent block device emulation to the file system. However, the log-block FTL,
the most popularly used FTL, suffers from the performance decline in workloads that contain
many random writes. When the file system is not aware of the NAND flash, it often allocates
files fragmented in the NAND flash, and this causes many random writes. Consequently, to
improve the performance of the FTL, the file system must allocate files defragmented in the
NAND flash. In this paper, we propose a NAND flash-aware cluster allocation method for
FAT file system, named FECA (Flash-aware Extension-based Cluster Allocation). FECA
exploits the following two observations. The first one is that the effort to defragment smallsized files may not improve the performance at all times.
The second one is that there is a very strong correlation between the size and the filename
extension of files in most cases. Based on those observations, FECA predicts sizes of files by
using their extensions and determines the allocation policy for them. To evaluate the
effectiveness of FECA, we devise two defragmentation metrics considering the features of
the NAND flash. We prove that FECA outperforms previous methods in terms of both
metrics through extensive experiments. The results show that FECA improves the
performance by 10 % and reduces the garbage collection frequency up to 35% compared to
the previous methods.
The file system space is managed in a unit called cluster or block that is a group of
consecutive sectors of the disk [9]. In this paper, to prevent confusion with NAND block, we
use the term cluster only. We assume that a cluster corresponds to a page of the NAND flash.
There are two existing methods popularly used to allocate free cluster: SCA (Simple Cluster
Allocation), and GCA (Greedy Cluster Allocation). Figure 1 illustrates how these methods
work. We assume that the FTL is a log-block mapping scheme and the logical space of file
system aligns with NAND block of physical space. In the figure, logical space means the
status of the space shown to the file system, and physical space indicates the status of the
physical NAND block space. They are aligned with each other, but only physical space may
contain obsolete pages, which were not erased after invalidation. Each slot of logical and
physical space corresponds to a cluster and a page, respectively. As shown in this figure,
clusters 0-2, 6-8, and a-b of logical space were already allocated to some files. Note that the
file system sees clusters 3-5, and 9 as free clusters, but they are actually obsolete in physical
23
space. We now describe what will happen when files A and B, whose sizes are five clusters
and one cluster respectively, are written in each file system.
5.1 Experimental environments
We implemented the allocation methods on a real NAND Flashbased embedded board. The
test board is equipped with S3C2413 (ARM926) MCU, 128MB SDRAM, and 1GB MLC
(Multi Level Cell) NAND Flash [14]. The access latency of the read, write, and erase
operation is 121us/page, 861us/page, and 1.5ms/block, respectively. The latency includes the
transfer time between the microprocessor and the page register in the NAND flash, as well as
transfer time between the page register and the NAND flash basic cell. We use the log-block
mapping FTL and the FAT-compatible file system. We set the cluster size as 2KB, so that 128
clusters are situated in one cluster group (NAND block). For fast experiment, we restrict the
file system size as 512MB but we believe that we would get similar results with a larger size.
In order to collect FDI and SDI, we devise IO control interfaces named IOCTL_FDI and
IOCTL_SDI. Upon the IOCTL_FDI request, the FDI value is returned by scanning the entire
files. For the IOCTL_SDI request, the SDI value is calculated by using cluster group bitmaps.
We target a smart phone that supports a full Internet browsing, where read, write, and delete
operations happen frequently. To obtain the workload characteristic, we analyze several web
server workloads [15][16] and Microsoft Explorer Internet Temporary Files. As a result, most
files involved in the Internet browsing are under 256 KB, and 90 % of them are under 16 KB
as shown in figure 5. Besides, general system is known to handle mostly small-sized files [7]
[17]. To reflect such workloads, we make two groups of files: the small-sized and the largesized group. The former consists of files whose sizes are uniformly distributed between 1
byte and 16 KB. The latter consists of files whose sizes
6. CONCLUSION
Since FAT is widely used, it is necessary for CE devices to support FAT-compatible file
systems. However, due to the unique physical characteristics of the NAND flash memory
used in most CE devices, existing allocation methods optimized to disk-based storage
systems show inefficient behaviors when they are adopted by the NAND flash. In this paper,
we have proposed the flashaware cluster allocation method optimized to FAT on the NAND
flash. Since the defragmentation is important for the NAND flash to improve the write
performance, we first have suggested the defragmentation policy taking the size of the file
24
into account. Then, by using the fact that the extension of the file is strongly correlated to the
size, we have proposed a flash-aware cluster allocation method based on the filename
extension, called FECA. To evaluate the effectiveness of FECA, we also have designed two
indices, FDI (File Defragmentation Index) and SDI (Space Defragmentation Index), that
represent the degree of the defragmentation by considering the features of the NAND flash.
Several experimental results have shown that FECA increased both of FDI and SDI, and
hence significantly improved the performance in comparison with SCA and GCA.
Request Bridging and Interleaving: Improving the Performance of Small Synchronous
Updates under Seek-Optimizing Disk Subsystems
Write-through caching in modern disk drives enables the protection of data in the event of
power failures as well as from certain disk errors when the write-back cache does not. Host
system can achieve these benefits at the price of significant performance degradation,
especially for small disk writes. We present new block level techniques to address the
performance problem of write-through caching disks. Our techniques are strongly motivated
by some interesting results when the disk-level caching is turned off. By extending the
conventional request merging, request bridging increases the request size and amortizes the
inherent delays in the disk drive across more bytes of data. Like sector interleaving, request
interleaving rearranges requests to prevent the disk head from missing the target sector
position in close proximity, and thus reduces disk latency. We have evaluated our block-level
approach using a variety of I/O workloads and shown that it increases disk I/O throughput by
up to about 50%. For some real-world workloads, the disk performance is comparable or
even superior to that of using the write-back disk cache. In practice, our simple yet effective
solutions achieve better tradeoffs between data reliability and disk performance when applied
to write through caching disks.
The proposed block I/O techniques, request bridging and interleaving improve small write
performance by efficiently merging and reordering them. In the multithreaded and multiprocess environment, our techniques have proven to be more effective in handling concurrent
multiple I/O streams that lead to a significant fraction of slow, small block writes in disk
drives. From the detailed analysis of block-level disk operations, we have found the causes of
the performance degradation in the write-through caching disk, which can be a good solution
25
to ensure data integrity from power failures and some disk errors. It is more beneficial to data
integrity as well as system performance to use write-through cache with request bridging and
interleaving than write-back cache in write-intensive multiprocess or multithread
environments. To conclude: our approach can be the best performing tradeoff between
performance and data reliability for synchronous disk I/O operations. We have implemented
the proposed technique in the Linux kernel and conducted experiments using several
benchmarks. Our experiments show that the combination of these techniques increases disk
I/O throughput by up to about 50%. For some real world workloads, the disk performance
was comparable or even superior to that of using write-back disk cache. The proposed
technique has been designed to minimize possible negative impact on the multi streamed I/O
requests issued to write-through disk concurrently; that is, sequential single-stream writes and
read-mostly workloads are out of the scope. For popular multicore CPUs and the rapidly
increasing I/O intensive parallel applications running on such machines, however, we expect
that our approach can be more effective as long as there is the semantic gap due to the narrow
interface among disk I/O layers.
FRASH: Exploiting Storage Class Memory in Hybrid File System for Hierarchical Storage
JAEMIN JUNG and YOUJIP WON
we develop a novel hybrid file system, FRASH, for storage-class memory and NAND Flash.
Despite the promising physical characteristics of storage-class memory, its scale is an order
of magnitude smaller than the current storage device scale. This fact makes it less than
desirable for use as an independent storage device. We carefully analyze in-memory and ondisk file system objects in a log-structured file system, and exploit memory and storage
aspects of the storage-class memory to overcome the drawbacks of the current log-structured
file system. FRASH provides a hybrid view storage-class memory. It harbors an in-memory
data structure as well as a ondisk structure. It provides nonvolatility to key data structures
which have been maintained inmemory in a legacy log-structured file system. This approach
greatly improves the mount latency and effectively resolves the robustness issue. By
maintaining on-disk structure in storage-class memory, FRASH provides byte-addressability
to the file system object and metadata for page, and subsequently greatly improves the I/O
performance compared to the legacy log-structured approach. While storage-class memory
offers byte granularity, it is still far slower than its DRAM counter part. We develop a copyon-mount technique to overcome the access latency difference between main memory and
26
storage-class memory. Our file system was able to reduce the mount time by 92% and file
system I/O performance was increased by 16%. In this work, we develop a hybrid file
system, FRASH, for storage-class memory and NAND Flash. Once realized into proper scale,
storage-class memory will clearly resolve significant issues in current storage and memory
systems. Despite all these promising characteristics, for the next few years, the scale of
storage-class memory devices will be an order of magnitude smaller (e.g., 1/1000) than the
current storage devices.We argue that a storage-class memory should be exploited as a new
hybrid layer between main memory and storage, rather than positioning itself as a full
substitute for memory or storage. Via this approach, storage-class memory can complement
the physical characteristics of the two: that is, the volatility of main memory and the block
access granularity of storage. The key ingredient in this file system design is how to use
storage class memory in a system hierarchy. It can be mapped onto the main memory address
space. In this case, it is possible to provide nonvolatility to data stored in the respective
address range. On the other hand, storage-class memory can be used as part of the block
device. In this case, I/O speed will become faster, and it is possible that an I/O-bound
workload will become a CPU-bound workload. The data structures and objects to be
maintained in storage class memory should be selected very carefully, since storage-class
memory is still too small to accommodate all file system objects. In this work, we exploit
both the memory and storage aspects of the storage class memory. FRASH provides a hybrid
view of the storage-class memory. It harbors in-memory data as well as on-disk structures for
the file system. By maintaining on-disk structure in storage-class memory, FRASH provides
byte addressability to the on-disk file system object and metadata for the page. The
contribution of the FRASH file system is threefold: (i) mount latency, which has been
regarded as a major drawback of the log-structured file system, is decreased by an order of
magnitude; (ii) I/O performance improves significantly via migrating an on-disk structure to
the storage-class memory layer; and (iii) by maintaining the directory snapshot and file tree in
the storage-class memory, system becomes more robust against unexpected failure. In
summary, we successfully developed a state-of-art hybrid file system, and showed that
storage class memory can be exploited effectively to resolve the various technical issues in
existing file systems.
MFTL: A Design and Implementation for MLC Flash Memory Storage Systems
JEN-WEI HSIEH
27
NAND flash memory has gained its popularity in a variety of applications as a storage
medium due to its low power consumption, non volatility, high performance, physical
stability, and portability. In particular, Multi-Level Cell (MLC) flash memory, which provides
a lower cost and higher density solution, has occupied the largest part of NAND flashmemory market share. However, MLC flash memory also introduces new challenges: (1)
Pages in a block must be written sequentially. (2) Information to indicate a page being
obsoleted cannot be recorded in its spare area due to the limitation on the number of partial
programming. Since most of applications access NAND flash memory under FAT file system,
this article designs an MLC Flash Translation Layer (MFTL) for flash-memory storage
systems which takes constraints of MLC flash memory and access behaviors of FAT file
system into consideration. A series of trace-driven simulations was conducted to evaluate the
performance of the proposed scheme. Although MFTL is designed for MLC flash memory
and FAT file system, it is applicable to SLC flash memory and other file systems as well. Our
experiment results show that the proposed MFTL could achieve a good performance for
various access patterns even on SLC flash memory.
2.3. Data Structures
There are two major data structures to implement address translation: a direct map and an
inverse map [Gal and Toledo 2005]. The direct map is the fundamental data structure of an
FTL, which maps a logical address to a physical address. The translation process can be as
simple as an array lookup, although it may also involve searching a tree. At the same time, a
logical address may need to go through several passes of translation to get its corresponding
physical address. The direct map usually resides in SRAM or the flash memory itself, but for
the sake of efficiency, at least the referenced parts of the direct map should be kept in SRAM.
The inverse map is essentially made up of identifiers of flash pages or blocks, which is
always kept in flash memory. When scanning a physical page or block of a flash memory, in
order to identify valid data during garbage collection or recover the FTL after a system
failure, we can easily get its logical address from the inverse map. In brief, the inverse map
guarantees that we can always identify the logical address of a page or block, and the direct
map provides fast address translation. It should be noted that an FTL may not necessarily
employ a direct map and that not all FTLs store a complete mirror of the direct map in
permanent storage.
28
Conclusion
Flash is a promising memory technology that has emerged for tens of years. Since in-place
updates are no longer supported, address translation technology becomes an indispensable
approach to make flash memory a new layer in the modern storage hierarchy. FTL is a
software layer built in the firmware or in the system software that implements address
translation, garbage collection, wear leveling, and so forth, and wraps the flash memory into a
normal block device like magnetic disks. Unfortunately, many of these technologies are either
undocumented or only described in patents. This survey provides a broad overview of some
typical state-of-the-art address translation technologies described in patents, journals, and
conference proceedings. We hope that this survey will facilitate future research and
development of flash-based applications.
This article proposed a multi-level cell (MLC) flash memory translation layer, MFTL, for
MLC flash memory storage systems. It provides a transparent way for existing file systems to
directly access large-capacity MLC flash memory. Observing that most existing user
applications access NAND flash memory under FAT file system, this article takes constraints
of MLC flash memory and access behaviors of FAT file system into consideration. The
proposed MFTL scheme treats different access behaviors in an adaptive manner. It adopts a
hybrid mapping approach to balance between the performance and the RAM requirement.
Since data integrity is an important issue for storage systems, MFTL also provides an
effective and efficientmechanism for recovery from a system crash. An extensive set of
experiments has been conducted to evaluate the performance of the proposed MFTL. Our
experiment results show that the proposed scheme achieves a good performance in terms of
extra page writes and block erasures while the RAM requirement is largely reduced compared
with existing schemes. SCMFS: A File System for Storage Class Memory and its Extensions
Modern computer systems have been built around the assumption that persistent storage is
accessed via a slow, block-based interface. However, emerging nonvolatile memory
technologies (sometimes referred to as storage class memory (SCM)), are poised to
revolutionize storage systems. The SCM devices can be attached directly to the memory bus
and offer fast, fine-grained access to persistent storage. In this article, we propose a new file
systemSCMFS, which is specially designed for Storage Class Memory. SCMFS is
implemented on the virtual address space and utilizes the existing memory management
module of the operating system to help mange the file system space. As a result, we largely
29
simplified the file system operations of SCMFS, which allowed us a better exploration of
performance gain from SCM. We have implemented a prototype in Linux and evaluated its
performance through multiple benchmarks. The experimental results show that SCMFS
outperforms other memory resident file systems, tmpfs, ramfs and ext2 on ramdisk, and
achieves about 70% of memory bandwidth for file read/write operations. In this article, we
propose a new file systemSCMFS, which is specifically designed for SCM. With
consideration of compatibility, this file system exports the identical interfaces as the regular
file systems do, in order that all the existing applications can work on it. In this file system,
we aim to minimize the CPU overhead of file system operations. We build our file system on
virtual memory space and utilize the memory management unit (MMU) to map the file
system address to physical addresses on CM. The layouts in both physical and virtual address
spaces are very simple. We also keep the space contiguous for each file in SCMFS, to
simplify the process of handling read/write requests in the file system. We will show, through
results based on multiple benchmarks, that the simplicity of SCMFS makes it easy to
implement and improves the performance. In this article, we focus on building a file system
on a single processor. Our approach can be extended to larger, multiprocessor, systems
through distributed shared memory.
The primary contributions of this article are as follows.
This article proposes a new, low-overhead file system for Storage Class Memory. The
proposed file system, SCMFS, largely simplifies regular file operations by leveraging the
existing memory management module of the operating system. We employ super pages
within our file system to relieve the pressure on system resources such as TLB, to further
improve performance. We show that SCMFS outperforms tmpfs, ramfs, and ext2 on Ram
Disk and achieves about 70% of memory bandwidth for file read/write operations. The
remainder of the article is organized as follows. We will discuss related work in Section 2. In
Section 3, we provide design and implementation details of SCMFS, which are then
evaluated in Section 4. We present further extensions of SCMFS in Section 5. Section 6
concludes the article. A number of file systems have been designed for flash devices (YAF1;
Woodhouse [2011]). BPFS [Condit et al. 2009] is proposed as a file system designed for
nonvolatile byte-addressable memory, which uses shadow paging techniques to provide fast
and consistent updates. Our file system aims to simplify the design and eliminate unnecessary
overheads to improve the performance. DFS [Josephson et al. 2010] is the file system most
30
similar to our own. DFS incorporates the functionality of block management in the device
driver and firmware to simplify the file system, and also keeps the files contiguous in a huge
address space. It is designed for a PCIe-based SSD device by FusionIo, and relies on specific
features in the hardware. Since there is no Linux kernel level implementation of BPFS, we
did not compare our performance to BPFS. In this work, we aim to design a file system for
SCM devices. With traditional persistent storage devices, the overhead brought by I/O latency
is much higher than that of the file system layer itself. So the storage system performance
usually depends on the devices characteristics and the performance of the I/O scheduler.
However, in the case where the storage device is directly attached to the memory bus, the
storage device will share some critical system resources with the main memory. They will
share the bandwidth of the memory bus, the CPU caches, and TLBs for both instructions and
data. We believe that the lower complexity of the file system can reduce the CPU overhead in
the storage system and thus improve the total performance. Our design is motivated by the
need to minimize the number of operations required to carry out file system requests. SCMFS
simplifies these data structures by allocating contiguous logical address space within each
file. We build the file system on virtual address space, which can be larger than the physical
address space of the SCM. Each file is allocated to a contiguous virtual address space while
the file data may be stored in physical pages that are not contiguous. In SCMFS, with the
contiguity inside each file, we do not need complicated data structures to keep track of logical
address space, and simply store the start address and the size for each file. This mechanism
significantly simplifies the processing of the read/write requests. To get the location of the
request data, we just add the offset to the start address of the file. The actual physical location
of the data is available through the page mapping data structures, again leveraging the system
infrastructure.
REFERENCES
1. Gold, D. E., & Kuck, D. J. (1974). A model for masking rotational latency by
dynamic
disk
allocation.
Communications
of
the
ACM,
17,
278288.
doi:10.1145/360980.361006.
2. Morgan, H. L. (1974). Optimal space allocation on disk storage devices.
Communications of the ACM, 17(3), 139142. doi:10.1145/360860.360867.
31
3. Shen, K. K., & Peterson, J. L. (1974). A weighted buddy method for dynamic storage
allocation. Communications of the ACM, 17, 558562. doi:10.1145/355620.361164
4. Burton, W. (1976). A buddy system variation for disk storage allocation.
Communications of the ACM, 19(7), 416417. doi:10.1145/360248.360259
5. Peterson, J. L., & Norman, T. a. (1977). Buddy systems. Communications of the ACM,
20, 421431. doi:10.1145/359605.359626
6. Du, H. C., & Sobolewski, J. S. (1982). Disk allocation for Cartesian product files on
multiple-disk systems. ACM Transactions on Database Systems, 7(1), 82101.
doi:10.1145/319682.319698
7. Hecht, M. S., & Gabbe, J. D. (1983). Shadowed management of free disk pages with a
linked
list.
ACM
Transactions
on
Database
Systems,
8(4),
503514.
doi:10.1145/319996.320002
8. Distributed Operating System I Bruce Walker , Gerald Popek , Robert English ,
Charles Kline and Greg Thiel 2 University of California at Los Angeles. (1983), 49
70.
9. Koch, P. D. L. (1987). Disk file allocation based on the buddy system. ACM
Transactions on Computer Systems, 5(4), 352370. doi:10.1145/29868.29871
10. Levy, E., & Silberschatz, a. (1990). Distributed File Systems: Concepts and
Examples, 22(4), 321374.
11. Abdel-Ghaffar, K. a. S., & El Abbadi, A. (1993). Optimal disk allocation for partial
match queries. ACM Transactions on Database Systems, 18(1), 132156.
doi:10.1145/151284.151288
12. Krieger, O., & Stumm, M. (1997). HFS: a performance-oriented flexible file system
based on building-block compositions. ACM Transactions on Computer Systems,
15(3), 286321. doi:10.1145/263326.263356
32
13. Prabhakar, S., Agrawal, D., El, A., & Barbara, S. (1998). Efficient Disk Allocation for
Fast Similarity Searching Divyakant Agrawal Amr El Abbadi University of
California, 7887.
14. Hess, C. K., & Campbell, . R. H. (2003). An application of a context-aware file
system, 339352. doi:10.1007/s00779-003-0250-y
15. Ghemawat, S., Gobioff, H., & Leung, S.-T. (2003). The Google file system. ACM
SIGOPS Operating Systems Review, 37, 29. doi:10.1145/1165389.945450
16. Gal, E., & Toledo, S. (2005). Algorithms and data structures for flash memories. ACM
Computing Surveys, 37(2), 138163. doi:10.1145/1089733.1089735
17. Sivathanu, M., Prabhakaran, V., Arpaci-Dusseau, A. C., & Arpaci-Dusseau, R. H.
(2005). Improving storage system availability with D-GRAID. ACM Transactions on
Storage, 1(2), 133170. doi:10.1145/1063786.1063787.
18. Kang, S., & Reddy, a. L. N. (2006). An approach to virtual allocation in storage
systems.
ACM
Transactions
on
Storage,
2(4),
371399.
doi:10.1145/1210596.1210597
19. Wang, A.-I. A., Kuenning, G., Reiher, P., & Popek, G. (2006). The Conquest file
system. ACM Transactions on Storage, 2(3), 309348. doi:10.1145/1168910.1168914
20. Agrawal, N., Bolosky, W. J., Douceur, J. R., & Lorch, J. R. (2007). A Five-Year Study
of File-System Metadata, 3(3).
21. Cipar, J., Corner, M. D., & Berger, E. D. (2007). Contributing storage using the
transparent
file
system.
ACM
Transactions
on
Storage,
3(3),
12es.
doi:10.1145/1288783.1288787
22. Batsakis, A., Burns, R., Kanevsky, A., Lentini, J., & Talpey, T. (2009). Ca-Nfs. ACM
Transactions on Storage, 5(4), 124. doi:10.1145/1629080.1629085
23. Thomasian, A., & Blaum, M. (2009). Higher reliability redundant disk arrays. ACM
Transactions on Storage, 5(3), 159. doi:10.1145/1629075.1629076
33
24. Ryu, S., Lee, C., Yoo, S., & Seo, S. (2010). Flash-aware cluster allocation method
based on filename extension for FAT file system. Proceedings of the 2010 ACM
Symposium on Applied Computing - SAC 10, 502. doi:10.1145/1774088.1774192
25. Jung, J., Won, Y., Kim, E., Shin, H., & Jeon, B. (2010). Frash. ACM Transactions on
Storage, 6(1), 125. doi:10.1145/1714454.1714457
26. Shin, D. I., Yu, Y. J., Kim, H. S., Eom, H., & Yeom, H. Y. (2011). Request Bridging
and
Interleaving.
ACM
Transactions
on
Storage,
7(2),
131.
doi:10.1145/1970348.1970349
27. Wu, X., & Reddy, a. L. N. (2011). SCMFS: A file system for Storage Class Memory.
2011 International Conference for High Performance Computing, Networking,
Storage and Analysis (SC), 9(3), 111. doi:10.1145/2063384.2063436
28. Hsieh, J.-W., Wu, C.-H., & Chiu, G.-M. (2012). Mftl. ACM Transactions on Storage,
8(2), 129. doi:10.1145/2180905.2180908
29. Paulo, J., & Pereira, J. (2014). A Survey and Classification of Storage Deduplication
Systems. ACM Computing Surveys, 47(1), 130. doi:10.1145/2611778
Microsoft was granted a series of patents for key parts of the FAT file system in 1990. All
four patents to long-file name extension to FAT first seen in Windows 95.
Limitations when using FAT32 file system with Windows XP.
Clusters cannot be 64 kilobytes (KB) or larger. If clusters are 64 KB or larger, some programs
(such as Setup programs) may incorrectly calculate disk space.
A FAT32 volume must contain a minimum of 65,527 clusters. You cannot increase the cluster
size on a volume that uses the FAT32 file system so that it contains fewer than 65,527
clusters.
34
The maximum disk size is approximately 8 terabytes when you take into account the
following variables: The maximum possible number of clusters on a FAT32 volume is
268,435,445, and there is a maximum of 32 KB per cluster, along with the space required for
the file allocation table (FAT).
You cannot decrease the cluster size on a FAT32 volume so that the size of the FAT is larger
than 16 megabytes (MB) minus 64 KB.
You cannot format a volume larger than 32 gigabytes (GB) in size using the FAT32 file
system during the Windows XP installation process. Windows XP can mount and support
FAT32 volumes larger than 32 GB (subject to the other limits), but you cannot create a FAT32
volume larger than 32 GB by using the Format tool during Setup. If you need to format a
volume that is larger than 32 GB, use the NTFS file system to format it. Another option is to
start from a Microsoft Windows 98 or Microsoft Windows Millennium Edition (Me) Startup
disk and use the Format tool included on the disk.
FILE SYSTEM
Table 3 Comparison of File Systems
Author
Publication and
References
Year
File System
Operating
System
Characteristics
FAT32
Currently, FAT32 can be used
Most operating systems (MS-DOS, Windows 98, Windows
only in Windows 98 and
NT, OS/2, and UNIX) are designed to implement and use it.
Windows 95 OSR2.
35
The largest possible file for a FAT32 drive is 4 GB minus 2 bytes. Win32-based applications
can open files this large without special handling. However, non-Win32-based applications
must use Int 21h Function 716Ch (FAT32) with the EXTENDED_SIZE (1000h) open flag.
The FAT32 file system includes 4 bytes per cluster within the file allocation table. This differs
from the FAT16 file system, which contains 2 bytes per cluster, and the FAT12 file system,
which contains 1.5 bytes per cluster within the file allocation table.
Note that the high 4 bits of the 32-bit values in the file allocation table for FAT32 are reserved
and are not part of the cluster number. Applications that directly read a FAT32 file allocation
table must mask off these bits and preserve them when writing new values.
Table 10.2 provides a comparison of FAT16 and FAT32 cluster sizes according to drive size.
Table 10.2 Cluster sizes of FAT16 and FAT32
Drive size
256 MB 511 MB
512 MB 1023 MB
1024 MB 2 GB
2 GB 8 GB
8 GB 16 GB
16 GB 32 GB
>32 GB
Understanding FAT32
FAT32 provides the following enhancements over previous implementations of the FAT file
system:
36
Uses space more efficiently. FAT32 uses smaller clusters (4 KB clusters for drives up
to 8 GB in size), resulting in 10 to 15% more efficient use of disk space relative to
large FAT drives, and reduces the resources necessary for the computer to operate.
More robust. FAT32 has the ability to relocate the root directory and use the backup
copy of the FAT instead of the default copy. In addition, the boot record on FAT32
drives has been expanded to include a backup of critical data structures. This means
that FAT32 drives are less susceptible to a single point of failure than existing FAT
volumes.
Programs load up to 50% faster. FAT32's smaller cluster size enables the new and
improved Disk Defragmenter to optimally locate the portions of an application and its
supporting files needed at the time it is loaded.
All of Microsoft's bundled disk utilities (Format, FDISK, Defrag, MS-DOS and Windows
ScanDisk, and DriveSpace) have been revised to work with FAT32.
Important When the Drive Converter Wizard is done, the Disk Defragmenter utility runs. It
is important that you let Disk Defragmenter run to completion after converting to FAT32. Not
defragmenting the disk after converting to FAT32 will result in an even less efficient and
slower computer than before the conversion.
You cannot dual boot Windows 98 and Windows NT 4.0 if you use FAT32. Windows NT 4.0
cannot access or boot from a FAT32 drive.
YEAR
FILE SYSTEM
OPERATING SYSTEM
FAT12
37
38