Ever-increasing research and development in science and engineering is based on
simulations and/or on analysis of observational data requiring the use of
high-performance computing (HPC) systems. HPC systems have already approached
peta-scale in many deployments and efforts are already underway towards
designing and scaling systems to exa-scale. A
typical large-scale data-intensive application may use several layers of
software. For example, an operational weather
modeling application may use Parallel netCDF (PnetCDF) for data representation and storage, which
in turn may be implemented in parallel using MPI-IO for portability, which may
be layered on top of the a parallel file system on a cluster.
I/O delegate software system uses a set of processes as gateway to the file system to carry out the I/O tasks for an application.
The goal in this approach is to execute
a caching layer within the I/O middleware
and the user application space.
Collaborative Caching
Aiming to improve the non-contiguous, independent I/O performance, we have developed a
distributed file caching mechanism in the I/O delegates. Small, non-contiguous
I/O requests are commonly seen in production applications. ROMIO, a popular
MPI-IO implementation developed at Argonne National Laboratory, uses a
two-phase I/O strategy to reorganize these requests into large, contiguous
ones. This strategy has demonstrated very successful for many I/O patterns.
However, two-phase I/O is only applicable for collective I/O functions and
collective I/O requires all processes open the shared file to synchronously participate
in the I/O call, which may introduce process idle time. Without process
synchronization, the MPI independent I/O functions have even less opportunity
for better performance. Our design of using data caching at delegate side can
reduce the performance gap between MPI independent and collective I/O. The idea
is to replace the data reorganization among the application processes as
currently done in the collective I/O with the data redistribution between
application processes and the delegates. With the help of caching , the small,
noncontiguous requests from independent I/O can be first buffered at the
delegates. Once filled by different and successive independent requests, the
cache pages will be flushed to the file system. This caching mechanism not only
inherits the traditional data caching benefit of fast read-write operations for
repeated access pattern, but also improves the performance for write-only
operations which represent the majority I/O pattern in today's large-scale
simulation applications.
I/O Delegate File View Coordination
One of the obstacles for high-performance parallel I/O is the overhead of data consistency control carried out by the underlying file systems . As the file system protects the data consistency through file locking, concurrent I/O operations could be serialized due to lock conflicts. Therefore, it is very important that the I/O requests are reorganized in our delegate system in a way that lock conflicts are minimized or completely eliminated. We will develop a method that assigns disjoint file regions to the I/O delegates such that the lock conflicts and overlapping accesses are resolved at the delegate system, instead of passing to the file system which would result in a much higher penalty. The partitioning will take into account the underlying file system striping configuration, such as the number of I/O servers that files are striped across.