Welcome to MACSio Documentation

Building MACSio

MACSio’s design is described in detail in an accompanying design document.

A key aspect about MACSio’s design relevant to the installation process is that MACSio is composed of a growing list of I/O plugins. Each plugin implements one or more strategies for marshalling MACSio’s data object(s) between primary (volitile) and secondary (non-volitile) storage (e.g. memory and disk). An I/O strategy may involve the use of one or more strategy-specific I/O libraries. Therefore, each plugin often introduces additional, but optional, build dependencies. A given installation of MACSio can include all or just a subset of the available plugins. This is all determined at MACSio build time.

Installing via Spack

The easiest way to build MACSio may be to use spack. In theory, the steps involved are as simple as

% git clone https://github.com/spack/spack.git
% . spack/share/spack/setup-env.sh
% spack install macsio

Note, however, that these few steps may install a lot of software dependencies, many which you may not ordinarily expect. The operation may take more than an hour to complete and the resulting package installations will be hashed according to Spack’s hashing specifications which may impact any subsequent software development activity you intend to do with the resulting MACSio installation.

Nonetheless, Spack is often the simplest way to install MACSio.

The Spack installation of MACSio attempts to build all of MACSio’s plugins.

Installing Manually

MACSio’s only required dependency is json_cwx. MPI is not even a required dependency. However, building without MPI will in all likelihood result in a MACSio installation that isn’t very useful in assessing I/O performance at scale. Building without MPI, however, is useful in debugging basic functionality of an installation.

MACSio has several optional dependencies, each one associated with a given plugin. For example, the Exodus plugin requires the Exodus and netCDF libraries which may, in turn, require the HDF5 library. The Silo plugin requires the Silo library and, optionally, the HDF5 library.

For dependencies associated with each plugin, see the CMakeLists.txt file in the plugin directory for telltail notes about plugin dependencies.

Once json-cwx and any optional dependencies have been successfully installed, CMake is used to build MACSio and any of the desired plugins (builds with silo by default)

% mkdir build
% cd build
% cmake -DCMAKE_INSTALL_PREFIX=[desired-install-location] \
      -DWITH_JSON-CWX_PREFIX=[path to json-cwx] \
      -DWITH_SILO_PREFIX=[path to silo] ..
% make
% make install
NOTE: Some options for the cmake line:
  • Build docs: -DBUILD_DOCS=ON

  • Build HDF5 Plugin: -DENABLE_HDF5_PLUGIN=ON -DWITH_HDF5_PREFIX=[path to hdf5]

  • Build TyphonIO Plugin: -DENABLE_TYPHONIO_PLUGIN=ON -DWITH_TYPHONIO_PREFIX=[path to typhonio]

  • Build PDB Plugin: -DENABLE_PBD_PLUGIN=ON

  • Build Exodus Plugin: -DENABLE_EXODUS_PLUGIN=ON -DWITH_EXODUS_PREFIX=[path to exodus]

Although MACSio is C Language, at a minimum it must be linked using a C++ linker due to its use of non-constant expressions in static initializers to affect the static plugin behavior. However, its conceivable that some C++’isms have crept into the code causing warnings or outright errors with some C compilers. If you encounter such a situation, please file an issue for it so we are aware of the issue and can fix it.

Note that the list of available plugins is determined by which plugins are compiled and linked with MACSio’s main. So, plugins are determined at link-time, not run-time.

Using MACSio

MACSio is probably very different from many other I/O benchmarking tools you may be familiar with. To orient yourself, it may be useful to read the first sections of the original design document.

In particular, in MACSio_ speak, when we talk about I/O requests, request sizes, frequencies, etc., we speak about them in terms of the operations of a real application performing dumps of its mesh and field data.

By default, MACSio’s command-line arguments are designed to maintain constant per-task I/O workload as task count is varied. This means MACSio, by default, exhibits weak scaling behavior. This does not mean, however, that strong scaling scenarios cannot also be handled. It means only that extra work is involved in constructing command-line arguments to ensure a strong scaling goal is achieved if that is desired.

MACSio has a large number of command-line arguments. In addition, each plugin may define additional command-line arguments. Full documentation of all MACSio’s command-line arguments and their meaning can be obtained with the command

% ./macsio --help | more

Be ready for a lot of output!

Here, we will describe only some of the basic arguments necessary to do initial testing that MACSio is installed correctly and to scale to large sizes.

All command-line arguments specified after the keyword argument --plugin_args are passed to the plugin and not interpreted by MACSio’s main.

Summary of Key Command Line Arguments

MACSio has many command-line arguments and that number is only likely to grow with time. We discuss here, only those the most relevant to initially getting started running MACSio tests.

MACSio is different from I/O benchmarking tools because it constructs and marshals data as real data objects commonly used in scientific computing applications. All of its command-line arguments are designed in these terms and one has to understand how those choices effect I/O workload created by MACSio.

In the descriptions below, default values for all command-line arguments are indicated in square brackets.

–interface--interface %s [miftmpl]

Specify the name of the interface (e.g. plugin) to be tested. Use keyword list to print a list of all known interface names and then exit. Examples:

  • Get list of available plugins

% ./macsio --interface list
List of available I/O-library plugins...
"miftmpl", "hdf5", "silo", "typhonio"
  • Use the HDF5 plugin for a given test

% ./macsio --interface hdf5
–parallel_file_mode--parallel_file_mode %s %d [MIF 4]

Specify the parallel file mode. There are several choices. Not all parallel modes are supported by all plugins. Use ‘MIF’ for Multiple Independent File (MIF) mode and then also specify the number of files. Or, use ‘MIFFPP’ for MIF mode and one file per task and where macsio uses known task count. Use ‘MIFOPT’ for MIF mode and let MACSio determine an optimum file count based on heuristics. Use ‘SIF’ for SIngle shared File mode. If you also give a file count for SIF mode, then MACSio will perform a sort of hybrid combination of MIF and SIF modes. It will produce the specified number of files by grouping tasks in the the same way MIF does, but I/O within each group will be to a single, shared file using SIF mode and a subsetted communicator. When using SIF parallel mode, be sure you are running on a true parallel file system (e.g. GPFS or Lustre).

–part_type--part_type %s [rectilinear]

Options are ‘uniform’, ‘rectilinear’, ‘curvilinear’, ‘unstructured’ and ‘arbitrary’. Generally, this option impacts only the I/O worload associated with the mesh object itself and not any variables defined on the mesh. However, not all I/O libraries (or their associated MACSio plugins) support all mesh types and when making comparisons it is important to have the option of specifying various mesh types.

–part-dim--part_dim %d [2]

Spatial dimension of mesh parts; 1, 2, or 3. In most cases, 2 is a good choice because it makes downstream visualization of MACSio data easier and more natural. While MACSio is designed such that we would not ordinarily expect I/O workload to be substantially different for different spatial dimensions, this isn’t always known to be true for all possible plugins ahead of time.

–part_size--part_size %d [80000]

Per-task mesh part size. This becomes the nominal I/O request size used by each task when marshaling data. A following B | K | M | G character indicates ‘B’ytes, ‘K’ilo-, ‘M’ega- or ‘G’iga- bytes representing powers of either 1000 or 1024 depending on the selected units prefix system. With no size modifier character, ‘B’ytes is assumed. Mesh and variable data is then sized by MACSio to hit this target byte count in I/O requests. However, due to constraints involved in creating valid mesh topology and variable data with realistic variation in features (e.g. zone- and node-centering), this target byte count is hit exactly for only the most commonly used objects and approximately for other objects.

–avg_num_parts--avg_num_parts %f [1]

The average number of mesh parts per task. Non-integral values are acceptable. For example, a value that is half-way between two integers, K and K+1, means that half the task have K mesh parts and half have K+1 mesh parts, a typical scanrio for multi-physics applications. As another example, a value of 2.75 here would mean that 75% of the tasks get 3 parts and 25% of the tasks get 2 parts. Note that the total number of parts is this number multiplied by the task count. If the result of that product is non-integral, it will be rounded and a warning message will be generated.

–vars_per_part--vars_per_part %d [20]

Number of mesh variables on each part. This controls the number of I/O requests each task makes to complete a given dump. Typical physics simulations run with anywhere from just a few effectively to several hundred mesh variables. Note that the choice in mesh part_type sets a lower bound on the effective number of mesh variables marshaled by MACSio due to the storage involved for the mesh coordinate and topology data alone. For example, for a uniform mesh this lower bound is effectively zero because there is no coordinate or topology data for the mesh itself. This is also almost true for rectilinear meshes. For curvilinear mesh the lower bound is the number of spatial dimensions and for unstructured mesh it is the number of spatial dimensions plus 2^number of topological dimensions.

Note

uniform vs. rectilinear not fully defined here.

–num_dumps--num_dumps %d [10]

Total number of dumps to marshal

–dataset_growth--dataset_growth %f [1]

A multiplier factor by which the volume of data will grow between dump iterations If no value is given or the value is <1.0 no dataset changes will take place.

–meta_type--meta_type %s [tabular]

Specify the type of metadata objects to include in each main dump. Options are ‘tabular’ or ‘amorphous’. For tabular type data, MACSio will generate a random set of tables of somewhat random structure and content. For amorphous, MACSio will generate a random hierarchy of random type and sized objects.

–meta_size--meta_size %d %d [10000 50000]

Specify the size of the metadata objects on each task and separately, the root (or master) task (MPI rank 0). The size is specified in terms of the total number of bytes in the metadata objects MACSio creates. For example, a type of tabular and a size of 10K bytes might result in 3 random tables; one table with 250 unnamed records where each record is an array of 3 doubles for a total of 6000 bytes, another table of 200 records where each record is a named integer value where each name is length 8 chars for a total of 2400 bytes and a 3rd table of 40 unnamed records where each record is a 40 byte struct comprised of ints and doubles for a total of 1600 bytes.

–compute_work_intensity--compute_work_intensity %d [0]

Add some compute workload (e.g. give the tasks something to do) between I/O dumps. There are three levels of ‘compute’ that can be performed as follows:

  • Level 1: Perform a basic sleep operation (this is the default)

  • Level 2: Perform some simple FLOPS with randomly accessed data

  • Level 3: Solves the 2D Poisson equation via the Jacobi iterative method

This input is intended to be used in conjunection with –compute_time which will roughly control how much time is spent doing work between dumps.

–time_randomize--time_randomize [0]

Make randomization in MACSio vary from dump to dump within a given run and from run to run by using PRNGs seeded by time.

–plugin-args--plugin_args

All arguments after this sentinel are passed to the I/O plugin plugin.

MACSio Command Line Examples

  • To run with Multiple Independent File (MIF) mode to on 93 tasks to 8 HDF5 files…

    mpirun -np 93 macsio --interface hdf5 --parallel_file_mode MIF 8
    
  • Same as above to but a Single Shared File (SIF) mode to 1 HDF5 file (note: this is possible with the same plugin because the HDF5 plugin in MACSio has been designed to support both the MIF and SIF parallel I/O modes.

    mpirun -np 93 macsio --interface hdf5 --parallel_file_mode SIF 1
    
  • Default per-task request size is 80,000 bytes (10K doubles). To use a different request size, use –part_size. For example, to run on 128 tasks, 8 files in MIF mode where I/O request size is 10 megabytes, use

    mpirun -np 128 macsio --interface hdf5 --parallel_file_mode MIF_ 8 --part_size 10M
    

    Here, the M after the 10 means either decimal Megabytes (Mb) or binary Mibibytes (Mi) depending on setting for --units_prefix_system. Default is binary.

  • To use H5Z-ZFP compression plugin (note: we’re talking about an HDF5 plugin here), be sure to have the plugin compiled and available with the same compiler and version of HDF5 you are using with MACSio. Here, we demonstrate a MACSio command line that runs on 4 tasks, does MIF parallel I/O mode to 2 files, on a two dimensional, rectilinear mesh with an average number of parts per task of 2.5 and a nominal I/O request size of 40,000 bytes. The args after --plugin-args are to specify ZFP compression parameters to the HDF5 plugin. In this case, we use H5Z-ZFP compression plugin in rate mode with a bit-rate of 4.

    env HDF5_PLUGIN_PATH=<path-to-plugin-dir> mpirun -np 4 ./macsio --interface hdf5 --parallel_file_mode MIF 2 --avg_num_parts 2.5 --part_size 40000 --part_dim 2 --part_type rectilinear --num_dumps 2 --plugin_args --compression zfp rate=4
    

    where path-to-plugin-dir is the path to the directory containing libh5zzfp.{a,so,dylib}

Weak Scaling Study Command-Line Example

Suppose you want to perform a weak scaling study with MACSio in MIF parllel I/O mode and where per-task I/O requests are nominally 100 kilobytes and each task has 8 mesh parts.

All MACSio command line arguments remain the same. The only difference is the task count you execute MACSio with.

for n in 32 64 128 256 512 1024 2048 4096
do
    mpirun -np $n macsio --interface hdf5 --avg_num_parts 8 --part_size 100K --parallel_file_mode MIF 32
done

Now, the above example started with a task count of 32 and 32 files in MIF mode and kept the file count constant. It is concievable that if you continued this study to larger and larger scales, you may also want the MIF file count to vary somewhat as well. Here is an example of doing that.

Example

Which when run, results in the following sequence of MACSio command-lines.

mpirun -np 32 macsio --interface hdf5 --avg_num_parts 8 --part_size 100K --parallel_file_mode MIF 32
mpirun -np 64 macsio --interface hdf5 --avg_num_parts 8 --part_size 100K --parallel_file_mode MIF 64
mpirun -np 128 macsio --interface hdf5 --avg_num_parts 8 --part_size 100K --parallel_file_mode MIF 64
mpirun -np 256 macsio --interface hdf5 --avg_num_parts 8 --part_size 100K --parallel_file_mode MIF 64
mpirun -np 512 macsio --interface hdf5 --avg_num_parts 8 --part_size 100K --parallel_file_mode MIF 64
mpirun -np 1024 macsio --interface hdf5 --avg_num_parts 8 --part_size 100K --parallel_file_mode MIF 64
mpirun -np 2048 macsio --interface hdf5 --avg_num_parts 8 --part_size 100K --parallel_file_mode MIF 64
mpirun -np 4096 macsio --interface hdf5 --avg_num_parts 8 --part_size 100K --parallel_file_mode MIF 64
mpirun -np 8192 macsio --interface hdf5 --avg_num_parts 8 --part_size 100K --parallel_file_mode MIF 64
mpirun -np 16384 macsio --interface hdf5 --avg_num_parts 8 --part_size 100K --parallel_file_mode MIF 128
mpirun -np 32768 macsio --interface hdf5 --avg_num_parts 8 --part_size 100K --parallel_file_mode MIF 128
mpirun -np 65536 macsio --interface hdf5 --avg_num_parts 8 --part_size 100K --parallel_file_mode MIF 128
mpirun -np 131072 macsio --interface hdf5 --avg_num_parts 8 --part_size 100K --parallel_file_mode MIF 256
mpirun -np 262144 macsio --interface hdf5 --avg_num_parts 8 --part_size 100K --parallel_file_mode MIF 256

Because the default mesh type is rectilinear (which involves very little in the way of I/O for the mesh itself), the default number of mesh variables is 20, for a typical plugin running the MIF parallel I/O mode, each task will emit 8*20 (there are 8 mesh parts per task), 100K writes for a given dump. The default number of dumps is 10, so this will be repeated 10 times. For a --parallel_file_mode MIF 32 run, each dump will produce 32 files for a total of 320 files produced for that first run. In the first run, because file count and task count are 1:1, each file will contain just 8 mesh parts. For the last run, there are 256 files per dump but 262144 tasks, each task having 8 mesh parts. So, each file would, in that case, contain 8192 mesh parts.

Now, because the HDF5 plugin also supports the SIF parallel I/O mode, we could run the same sequence of tests in that mode by replacing MIF XX with SIF.

Strong Scaling Study Command-Line Example

Suppose we wish to perform a strong scaling study. In this case, we need to settle upon the global final mesh size and then construct MACSio command lines for each run such that the task count together with the per-task command-line arguments results in the same (or approximately so) global final mesh object in each run.

In the preceding weak scaling example, MACSio generated a global mesh of size in the range [8*32*100K, 8*262144*100K]. Selecting a middle-of-the-range run of 8*8192*100K (6,710,886,400 bytes) as a nominal global mesh size, we can then use a given task count to determine part size and average part count to hit that target global size. We demonstrate this in the following code block…

When this shell code is run, it results in the following sequence of MACSio command-lines. Note that we cap the total number of files at 1024. Depending on the particular system where this is run, this cap may be too low or too high. The best number is the number of I/O nodes a given instance of MACSio can see when running. Because this number typically varies with task count and where on the system the tasks are actually allocated, this number is not always easily known or obtained.

mpirun -np 32 ./macsio --interface hdf5 --avg_num_parts 2048 --part_size 102400 --parallel_file_mode MIF 32
mpirun -np 64 ./macsio --interface hdf5 --avg_num_parts 1024 --part_size 102400 --parallel_file_mode MIF 64
mpirun -np 128 ./macsio --interface hdf5 --avg_num_parts 512 --part_size 102400 --parallel_file_mode MIF 128
mpirun -np 256 ./macsio --interface hdf5 --avg_num_parts 256 --part_size 102400 --parallel_file_mode MIF 256
mpirun -np 512 ./macsio --interface hdf5 --avg_num_parts 128 --part_size 102400 --parallel_file_mode MIF 512
mpirun -np 1024 ./macsio --interface hdf5 --avg_num_parts 64 --part_size 102400 --parallel_file_mode MIF 1024
mpirun -np 2048 ./macsio --interface hdf5 --avg_num_parts 32 --part_size 102400 --parallel_file_mode MIF 1024
mpirun -np 4096 ./macsio --interface hdf5 --avg_num_parts 16 --part_size 102400 --parallel_file_mode MIF 1024
mpirun -np 8192 ./macsio --interface hdf5 --avg_num_parts 8 --part_size 102400 --parallel_file_mode MIF 1024
mpirun -np 16384 ./macsio --interface hdf5 --avg_num_parts 4 --part_size 102400 --parallel_file_mode MIF 1024
mpirun -np 32768 ./macsio --interface hdf5 --avg_num_parts 2 --part_size 102400 --parallel_file_mode MIF 1024
mpirun -np 65536 ./macsio --interface hdf5 --avg_num_parts 1 --part_size 102400 --parallel_file_mode MIF 1024
mpirun -np 131072 ./macsio --interface hdf5 --avg_num_parts 1 --part_size 51200 --parallel_file_mode MIF 1024
mpirun -np 262144 ./macsio --interface hdf5 --avg_num_parts 1 --part_size 25600 --parallel_file_mode MIF 1024

To perform a strong scaling study in SIF parallel I/O mode, just replace the trailing MIF $nf in the above MACSio command line above with SIF.

This strong scaling scenario maintains constant part size (up to the last three iterations) and just varies the part count to hit the target global size. Instead, we may want to do a strong scaling scenario where we maintain just a single part per task and vary that single part’s size to hit the target global size. This is demonstrated with following code…

which prodcues the following sequence of MACSio command-lines…

mpirun -np 32 ./macsio --interface hdf5 --avg_num_parts 1 --part_size 209715200 --parallel_file_mode MIF 32
mpirun -np 64 ./macsio --interface hdf5 --avg_num_parts 1 --part_size 104857600 --parallel_file_mode MIF 64
mpirun -np 128 ./macsio --interface hdf5 --avg_num_parts 1 --part_size 52428800 --parallel_file_mode MIF 128
mpirun -np 256 ./macsio --interface hdf5 --avg_num_parts 1 --part_size 26214400 --parallel_file_mode MIF 256
mpirun -np 512 ./macsio --interface hdf5 --avg_num_parts 1 --part_size 13107200 --parallel_file_mode MIF 512
mpirun -np 1024 ./macsio --interface hdf5 --avg_num_parts 1 --part_size 6553600 --parallel_file_mode MIF 1024
mpirun -np 2048 ./macsio --interface hdf5 --avg_num_parts 1 --part_size 3276800 --parallel_file_mode MIF 1024
mpirun -np 4096 ./macsio --interface hdf5 --avg_num_parts 1 --part_size 1638400 --parallel_file_mode MIF 1024
mpirun -np 8192 ./macsio --interface hdf5 --avg_num_parts 1 --part_size 819200 --parallel_file_mode MIF 1024
mpirun -np 16384 ./macsio --interface hdf5 --avg_num_parts 1 --part_size 409600 --parallel_file_mode MIF 1024
mpirun -np 32768 ./macsio --interface hdf5 --avg_num_parts 1 --part_size 204800 --parallel_file_mode MIF 1024
mpirun -np 65536 ./macsio --interface hdf5 --avg_num_parts 1 --part_size 102400 --parallel_file_mode MIF 1024
mpirun -np 131072 ./macsio --interface hdf5 --avg_num_parts 1 --part_size 51200 --parallel_file_mode MIF 1024
mpirun -np 262144 ./macsio --interface hdf5 --avg_num_parts 1 --part_size 25600 --parallel_file_mode MIF 1024

The difference in these two strong scaling scenarious is that the first one will suffer from per-part overheads. Each task will have per-part memory overheads and per-part I/O request overheads which the second scenario does not suffer.

Assessing Performance Achieved by MACSio

On the one hand, MACSio tries to boil all the performance data down to a single, scalar metric; the average time to complete a dump or the average bandwidth for completing a dump. This information is captured in task 0’s section (header) MACSio’s main log file, macsio-log.log.

An example is shown below for 10 dumps from a 4 task run.

The key line in the file, for task 0, is the total bytes divided by the time between the last task to finish and the first task to start which looks similar to the line below…

Info:"macsio_main.c":727:Total Bytes:  57.4580 Mi; Last finisher - First starter =   1.5012 secs; BW =  38.2758 Mi/sec:::

Detailed Performance Data

Detailed performance data is captured by MACSio as timers and then dumped to strings (see MACSIO_TIMING_DumpTimersToStrings() for specific format of dumped timer data strings) for output to a log file upon exit.

The performance data gathered by MACSio is wholly dependent on the degree to which developers of MACSio or its plugins have instrumented MACSio with suitable timing calls. If timing information provided is not sufficient, the solution is to submit a PR with relevant timer calls added.

Each task maintains its own unique set of timers for its own activities. In addition, MACSio will reduce all the task-specific timers just prior to completion. All timer data, the task-specific timers together with the reduced timers, is then dumped to a MACSio log file with the name macsio-timings.log. That single file captures all of the performance data for a given run of MACSio.

Validating Data MACSio Produces

Most MACSio plugins produce valid scientific computing data which can be processed by other common workflow tools and, in particular, visualization tools such as VisIt, ParaView or Ensight.

The images here demonstrate MACSio, 2D data written with the Silo plugin

_images/macsio_vis1.png

VisIt displaying MACSio produced data on 4 tasks and 2.5 parts per task with the Silo plugin. From left to right, the visualizations display the parallel decomposition of the mesh into blocks, a smoothly varying, node-centered variable on the mesh, a high noise node-centered variable and a low-noise node-centered variable.

_images/macsio_vis2.png

VisIt displaying MACSio produced data on 2048 tasks and 2.5 parts per task with the Silo plugin The left image shows the parallel decomposition of the mesh. The right image shows a smoothly varying node-centered variable. In this example, there were 2048 tasks and 5120 mesh parts.

Welcome to MACSio Documentation

Main

The design of MACSio is described in more detail in an accompanying design document. Here, we provide only a high level overview.

MACSio is divided into two halves; the MACSio main driver and the I/O plugin(s).

The main driver generates the scientific data objects (examples of which are pictured below) typical of HPC, multi-physcis simulation codes. It also orchestrates a number of activities such as compute and/or communication workloads to be mixed with the I/O workload, time and space performance data gathering, event logging and other support operations.

_images/scientific_data_objects.png

Scientific computing data objects typical of mesh-based multi-physics simulation codes

The other half of MACSio is the I/O plugin(s). A MACSio plugin orchestrates marshalling of data between memory and disk. Different plugins exist to implement different I/O proxies. For example, the Silo plugin serves as a proxy for many (not all) LLNL simulation codes. The Exodus plugin serves as proxy for many Sandia simulation codes. On the other hand, instead of representing a proxy for any specific organization or code, the HDF5 plugin serves as a proxy for the best way to use HDF5 for various parallel I/O paradigms. It also provides a number of command-line options to control low-level features of the HDF5 library to help probe the impact of various options on the performance space.

Note

Should extend MACSio to support scripted sequences of dumps

MACSio’s main driver accepts a slew of command-line arguments. If MACSio has been built, you can obtain help for MACSio’s main driver arguments

MACSio’s command-line arguments are designed to give the user control over the nominal I/O request sizes emitted from MPI ranks for mesh bulk data and for amorphous metadata. The user specifies a size, in bytes, for mesh pieces. MACSio then computes a mesh part size, in nodes, necessary to hit this target byte count for double precision data. MACSio will determine an N dimensional logical size of a mesh piece that is a close to equal dimensional as possible. In addition, the user specifies an average number of mesh pieces that will be assigned to each MPI rank. This does not have to be a whole number. When it is a whole number, each MPI rank has the same number of mesh pieces. When it is not, some processors have one more mesh piece than others. This is common of HPC multi-physics applications. Together, the total processor count and average number of mesh pieces per processor gives a total number of mesh pieces that comprise the entire mesh. MACSio then finds an N dimensional arrangement (N=[1,2,3]) of the pieces that is as close to equal dimension as possible. If mesh piece size or total count of pieces wind up being prime numbers, MACSio will only be able to factor these into long, narrow shapes where 2 (or 3) of the dimensions are of size 1. That will make examination of the resulting data using visualization tools like VisIt a little less convenient but is otherwise harmless from the perspective of driving and assessing I/O performance.

Mesh and Field Data Generation

MACSio is capable of generating a wide variety of mesh and variable data and amorphous metadata typical of HPC multi-physics applications. Currently, MACSio can generate structured, rectilinear, unstructured and arbitrary polyhedral, multi-block meshes in 2 or 3 dimensions. In addition, regardless of the particular type of mesh MACSio generates for purposes of I/O performance testing, it stores and marshalls all of the resultant data in an uber JSON-C object that is passed around witin MACSio and between MACSIO and its I/O plugins.

MACSio employs a very simple algorithm to generate and then decompose a mesh in parallel. However, the decomposition is also general enough to create multiple mesh pieces on individual MPI ranks and for the number of mesh pieces to vary, somewhat, between MPI ranks. At present, there is no support to explicitly specify a particular arrangement of mesh pieces and MPI ranks. However, such enhancement is planned for the near future.

Once the global whole mesh shape is determined as a count of total pieces and as counts of pieces in each of the logical dimensions, MACSio uses a very simple algorithm to assign mesh pieces to MPI ranks. The global list of mesh pieces is numbered starting from 0. First, the number of pieces to assign to rank 0 is chosen. When the average piece count is non-integral, it is a value between K and K+1. So, MACSio randomly chooses either K or K+1 pieces but being carful to weight the randomness so that once all pieces are assigned to all ranks, the average piece count per rank target is achieved. MACSio then assigns the next K or K+1 numbered pieces to the next MPI rank. It continues assigning pieces to MPI ranks, in piece number order, until all MPI ranks have been assigned pieces. The algorithm runs indentically on all ranks. When the algorithm reaches the part assignment for the rank on which its executing, it then generates the K or K+1 mesh pieces for that rank. Although the algorithm is essentially a sequential algorithm with asymptotic behavior O(#total pieces), it is primarily a simple book-keeping loop which completes in a fraction of a second even for more than one million pieces.

Each piece of the mesh is a simple rectangular region of space. The spatial bounds of that region are easily determined. Any variables to be placed on the mesh can be easily handled as long as the variable’s spatial variation can be described in the global goemetric space.

Data API

group MACSIO_DATA

Data (Mesh) Generation.

Functions

struct json_object *MACSIO_DATA_GenerateTimeZeroDumpObject(struct json_object *main_obj, int *rank_owning_chunkId)

Parameters
  • main_obj: The main JSON object holding mesh, field, amorphous data

  • rank_owning_chunkId: missing this info

int MACSIO_DATA_GetRankOwningPart(struct json_object *main_obj, int chunkId)

Given a chunkId, return rank of owning task.

int MACSIO_DATA_ValidateDataRead(struct json_object *main_obj)

Not yet implemented.

int MACSIO_DATA_SimpleAssignKPartsToNProcs(int k, int n, int my_rank, int *my_part_cnt, int **my_part_ids)

Not yet implemented.

struct json_object *MACSIO_DATA_EvolveDataset(struct json_object *main_obj, int *dataset_evolved, float factor, int growth_bytes)

Vary mesh and field data including changing not only values but sizes.

Random (Amourphous) Object Generation

In addition to the data necessary to represent the main mesh and field MACSio produces, it is also common for HPC multi-physics code to generate modest amount of amourphous metadata dealing with such things as materials, material models and material properties tables, user input controls, performance tracking information, library versions and data provenenance, debugging information, etc. This kind of information often takes the form of a large hierarchy of key-value pairs where the values can sometimes be arrays on the order of the size of the number of tasks, etc.

Two methods in MACSio’s data generation package help to create some random but nonetheless statistically similar kinds of data.

group MACSIO_RANDOBJ

Construct a random JSON object.

Functions

struct json_object *MACSIO_DATA_MakeRandomObject(int maxd, int nraw, int nmeta, unsigned maxds)

Parameters
  • maxd: maximum depth of object tree

  • nraw: bytes of raw data (in extarr objects) in final object

  • nmeta: bytes of meta data (in non-extarr objects) in final object

  • maxds: maximum size of any given raw (extarr) object

struct json_object *MACSIO_DATA_MakeRandomTable(int nrecs, int totbytes)

Construct a random table of some number of random JSON objects.

Parameters
  • nrecs: number of records in the table

  • totbytes: total bytes in the table

Randomization

Randomization of various of MACSio’s behaviors may be explicitly enforced or prevented. However, to ensure such behaviors, random number generation needs to be handled carefully especially when dealing with reproducibility from run to run and/or agreement across parallel tasks or both.

For this reason, MACSio creates and initializes several default pseudo random number generators (PRNGs), all of which utilize the C standard library random() method.

Pseudo Random Number Generator (PRNG) support is handled by maintaining a series of state vectors for calls to initstate/setstate of C library random() and friends. Multiple different PRNGs are supported each having zero or more special properties depending on how they are initialized (e.g. the seed used from run to run and/or from task to task) as well as how they are expected to be used/called. For example, if a PRNG is intended to produce identical values on each task, then it is up to the caller to ensure the respective PRNG is called consistently (e.g. collectively) across tasks. Otherwise, unintended or indeterminent behavior may result.

The default PRNGs MACSio creates are…

NaiveMD_random_naive

The naive prng is fine for any randomization that does not need special behaviors. This is the one that is expected to be used most of the time. Think of it as a wholesale replacement for using C library random() directly. MD_random is a shorthand alias for MD_random_naive

Naive, Time VaryingMD_random_naive_tv

Like MD_random_naive except that it is seeded based on current time and will therefore ensure variation from run to run (e.g. is time varying). It should be used in place of MD_random_naive where randomization from run to run is needed.

Naive, Rank VaryingMD_random_naive_rv

Like MD_random_naive but ensures variation across tasks (e.g. is rank varying) by seeding based on task rank. It should be used in place of MD_random_naive where randomization from task to task is needed.

Naive, Rank and Time VaryingMD_random_naive_rtv

Like MD_random_naive_rv but also ensures results vary from rank to rank and run to run.

Rank InvariantMD_random_rankinv

Is neither seeded based on rank or time nor shall be used by a caller in way that would cause inconsistency across ranks. It is up to the caller to ensure it is called consistently (e.g. collectively) across ranks. Otherwise indeterminent behavior may result.

Rank Invariant, Time VaryingMD_random_rankinv_tv

Like MD_random_rankinv except ensures variation in randomization from run to run. So, for each given run, all ranks will agree on randomization but from run to run, that agreement will vary.

If these PRNGs are not sufficient, developers are free to create other PRNGs for other specific purposes using MACSIO_DATA_CreatePRNG().

PRNG API

group MACSIO_PRNGS

Randomization and Pseudo Random Number Generators.

Defines

MD_random_naive()

Get next random value from naive PRNG. Same as MD_random().

The naive prng is fine for any randomization that does not need special behaviors. This is the one that is expected to be used most of the time.

MD_random()

Convenience shorthand for MD_random_naive()

MD_random_naive_tv()

Get next random value from naive, time-variant PRNG.

The naive_tv PRNG is like naive except that it is seeded based on current time and so will change from run to run (e.g. is time-variant). It should be used in place of MD_random_naive() when randomization from run to run is required.

MD_random_naive_rv()

Get next random value from naive, rank-variant PRNG.

Like MD_random_naive() but guarantees variation across ranks.

MD_random_naive_rtv()

Get next random value from naive, rank- and time-variant PRNG.

Like MD_random_naive_rv() but guarantees variation from run to run.

MD_random_rankinv()

Get next random value from rank-invariant PRNG.

The rank_invariant PRNG is neither seeded based on rank nor shall be used in way that would cause inconsistency across ranks. Failing to call it collectively will likely cause indeterminent behavior. There is no enforcement of this requirement. However, a sanity check is performed at application termination…there is a collective MPI_BXOR reduce on the current value from this PRNG and if the result is non-zero, an error message is generated in the logs.

MD_random_rankinv_tv()

Get next random value from rank-invariant but time-variant PRNG.

Like the MD_random_rankinv() except causes variation in randomization from run to run

Functions

int MACSIO_DATA_CreatePRNG(unsigned seed)

Create a Pseudo Random Number Generator (PRNG)

Parameters
  • seed: seed for initializing random number generator

Return

An integer identifier for the created PRNG

void MACSIO_DATA_ResetPRNG(int prng_id)

Reset a PRNG This will cause the specified PRNG to restart the sequence of pseudo random numbers it generates.

Parameters
  • prng_id: identifier for the PRNG to reset

long MACSIO_DATA_GetValPRNG(int prng_id)

Get next value from PRNG.

Parameters
  • prng_id: identifer of the desired PRNG to get a value from

Return

Next value from the PRNG sequence. Logically equivalent to calling random().

void MACSIO_DATA_DestroyPRNG(int prng_id)

Free resources associated a PRNG.

Parameters
  • prng_id: identifer of the desired PRNG to free

void MACSIO_DATA_InitializeDefaultPRNGs(unsigned mpi_rank, unsigned curr_time)

Initialize default Pseudo Random Number Generators (PRNGs) Should be called near beginning of application startup. Creates the (default) PRNGs described here. Applications are free to create others as needed.

Parameters
  • mpi_rank: unsigned representation of MPI rank of calling processor

  • curr_time: unsigned but otherwise arbitrary representation of current time but which is equal on all ranks yet guaranteed to vary from run to run.

void MACSIO_DATA_FinalizeDefaultPRNGs(void)

Free up resources for default PRNGs Should be called near the termination of application.

PRNG Example (tstprng.c)

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107

#include <assert.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>

#ifdef HAVE_MPI
#include <mpi.h>
#endif

#include <macsio_data.h>

int main(int argc, char **argv)
{
    int i, id1, id2, id3, id5;
    int parallel = 0, ckfail = 0;
    long tseed;
    long series1[100], series2[100], series3[100], series4[100][5], series5[100][5];
    int rank=0, size=1;
    struct timeval tv;

    for (i = 1; i < argc; i++)
    {
        if (!strcasestr(argv[i], "--parallel"))
            parallel = 1;
        else if (!strcasestr(argv[i], "--ckfail"))
            ckfail = 1;
    }

#ifdef HAVE_MPI
    if (parallel)
    {
        MPI_Init(&argc, &argv);
        MPI_Comm_size(MPI_COMM_WORLD, &size);
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    }
#endif

    gettimeofday(&tv, 0);
    tseed = (long) (((unsigned int) (tv.tv_sec + tv.tv_usec)) & 0xFFFFFFFF);

    MACSIO_DATA_InitializeDefaultPRNGs(rank, tseed);

    id1 = MACSIO_DATA_CreatePRNG(0xDeadBeef);
    id2 = MACSIO_DATA_CreatePRNG(tseed); /* time varying output */
    id3 = MACSIO_DATA_CreatePRNG(0xBabeFace);
    id5 = MACSIO_DATA_CreatePRNG(0xDeadBeef);

    for (i = 0; i < 100; i++)
    {
        int j, id4 = MACSIO_DATA_CreatePRNG(0x0BedFace);
        series1[i] = MACSIO_DATA_GetValPRNG(id1);
        series2[i] = i%3?0:MACSIO_DATA_GetValPRNG(id2);
        series3[i] = i%5?0:MACSIO_DATA_GetValPRNG(id3);
        for (j = 0; j < 5; j++)
            series4[i][j] = MACSIO_DATA_GetValPRNG(id4);
        MACSIO_DATA_DestroyPRNG(id4);
        for (j = 0; j < 5; j++)
            series5[i][j] = MACSIO_DATA_GetValPRNG(id5);
        MACSIO_DATA_ResetPRNG(id5);
        if (i && !(i%50))
            MACSIO_DATA_ResetPRNG(id2);
        MD_random();
        MD_random_rankinv();
        MD_random_rankinv_tv();
        if (ckfail && rank == 0 && i < 5)
        {
            MD_random_rankinv();
            MD_random_rankinv_tv();
        }
    }
    if (memcmp(&series2[0], &series2[51], 40*sizeof(long)))
        return 1;
    if (memcmp(&series4[0], &series4[1], 5*sizeof(long)))
        return 1;
    if (memcmp(&series4[1], &series4[23], 5*sizeof(long)))
        return 1;
    if (memcmp(&series5[0], &series5[1], 5*sizeof(long)))
        return 1;
    if (memcmp(&series5[1], &series5[23], 5*sizeof(long)))
        return 1;

    MACSIO_DATA_DestroyPRNG(id1);
    MACSIO_DATA_DestroyPRNG(id2);
    MACSIO_DATA_DestroyPRNG(id3);
    MACSIO_DATA_DestroyPRNG(id5);

#ifdef HAVE_MPI
    if (parallel)
    {
        /* Lets confirm the rank-invariant PRNGs agree and issue a message if not */
        unsigned rand_check[2] = {MD_random_rankinv(), MD_random_rankinv_tv()};
        unsigned rand_check_r[2];
        MPI_Reduce(rand_check, rand_check_r, 2, MPI_UNSIGNED, MPI_BXOR, 0, MPI_COMM_WORLD);
        assert(rand_check_r[0] == 0 && rand_check_r[1] == 0);
    }
#endif

    MACSIO_DATA_FinalizeDefaultPRNGs();

#ifdef HAVE_MPI
    if (parallel)
        MPI_Finalize();
#endif

    return 0;
}

Plugins

Ordinarily, when we think of plugins in software, we think of shared libraries being opened with dlopen() at runtime. To avoid the requirement for shared libraries and dlopen(), MACSio uses a simpler link-time approach to handling plugins called static plugins. In this case, the plugin’s available to MACSio are determined when MACSio’s main() is linked. Whatever plugins are included on the link line are then available to MACSio.

MACSio exploits a feature in C++ which permits initialization of static variables via non-constant expressions In fact, the only reason MACSio requires a C++ compiler is for the final link to support this feature.

All symbols in a plugin source file are defined with static scope. Every plugin defines an int register_this_plugin(void) function and initializes a static dummy integer to the result of a call to register_this_plugin() like so…

static int register_this_plugin(void)
{
  MACSIO_IFACE_Handle_t iface;

  strcpy(iface.name, iface_name);
  strcpy(iface.ext, iface_ext);

  if (!MACSIO_IFACE_Register(&iface))
      MACSIO_LOG_MSG(Die, ("Failed to register interface \"%s\"", iface.name));
}
static int dummy = register_this_plugin();

In the above code example, the call to MACSIO_IFACE_Register() calls a method in MACSio’s main which does the work of adding the plugin’s interface specification (MACSIO_IFACE_Handle_t) to a statically defined table of plugins. The plugin’s id is its location in that table.

Each plugin is defined by a file such as macsio_foo.c for a plugin named foo in the plugins directory. macsio_foo.c implements the MACSIO_IFACE interface for the foo plugin.

At the time the executable loads, the register_this_plugin() method is called. Note that this is called long before even main() is called. The call to MACSIO_IFACE_Register() from within register_this_plugin() winds up adding the plugin to MACSio’s global list of plugins. This happens for each plugin. The order in which they are added to MACSio doesn’t matter because plugins are identified by their (unique) names. If MACSio encounters a case where two different plugins have the same name, then it will abort and inform the user of the problem. The remedy is to adjust the name of one of the two plugins. MACSio is able to call static methods defined within the plugin via function callback pointers registered with the interface.

group MACSIO_IFACE

Plugin Interface Specification.

Defines

MACSIO_IFACE_MAX_COUNT

Maximum number of plugins possible.

MACSIO_IFACE_MAX_NAME

Maximum length of any plugin’s name.

Typedefs

typedef void (*DumpFunc)(int argi, int argc, char **argv, json_object *main_obj, int dumpNum, double dumpTime)

Main mesh+field dump (write) function specification.

typedef void (*LoadFunc)(int argi, int argc, char **argv, char const *path, json_object *main_obj, json_object **data_read_obj)

Main mesh+field load (read) function specification.

typedef int (*ProcessArgsFunc)(int argi, int argc, char **argv)

Command-line argument processing function for plugin.

typedef int (*QueryFeaturesFunc)(void)

Function to query plugin’s features (not currently used)

typedef int (*IdentifyFileFunc)(char const *pathname)

Function to ask plugin if it recognizes a given file.

typedef struct MACSIO_IFACE_Handle_t MACSIO_IFACE_Handle_t

Plugin interface handle type.

Functions

int MACSIO_IFACE_Register(MACSIO_IFACE_Handle_t const *iface)

Register a plugin with MACSIO.

Parameters
  • iface: The interface specification for a new plugin

This function should be called from within a code-block which is executed as part of a static initiliztion of the plugin’s local, static variables like so…

static int register_this_plugin(void)
{
    MACSIO_IFACE_Handle_t iface;
    iface.name = "foobar";
    iface.ext = ".fb";
    .
    .
    .
    MACSIO_IFACE_Register(&iface);
    return 0;
}
static int dummy = register_this_plugin()

void MACSIO_IFACE_GetIds(int *cnt, int **ids)

Return all registered plugin ids.

Parameters
  • cnt: the number of registered plugins available

  • ids: the list of ids of the registered plugins

void MACSIO_IFACE_GetIdsMatchingFileExtension(char const *ext, int *cnt, int **ids)

Return plugin that recognize files with a given extension.

Parameters
  • ext: the file extension to query

  • cnt: the number of plugins that recognize the extension

  • ids: the list of ids of the matching plugins

int MACSIO_IFACE_GetId(char const *name)

return id of plugin given its name

Return

On error, returns -1. Otheriwse a non-negative valid plugin id.

char const *MACSIO_IFACE_GetName(int id)

Get name of plugin given its id.

Return

On error, returns (char*)0. Otheriwse a non-null name of valid plugin.

MACSIO_IFACE_Handle_t const *MACSIO_IFACE_GetByName(char const *name)

Get interface handle for a plugin given its name.

Return

On error, returns (MACSIO_IFACE_Handle_t*)0. Otherwise returns a valid plugin interface handle.

MACSIO_IFACE_Handle_t const *MACSIO_IFACE_GetById(int id)

Get interface handle for a plugin given its id.

Return

On error, returns (MACSIO_IFACE_Handle_t*)0. Otherwise returns a valid plugin interface handle.

Variables

char name[MACSIO_IFACE_MAX_NAME]

Name of this plugin

char ext[MACSIO_IFACE_MAX_NAME]

File extension of files associated with this plugin

int slotUsed

[Internal] indicate if this position in table is used

ProcessArgsFunc processArgsFunc

Plugin’s command-line argument processing callback

DumpFunc dumpFunc

Plugin’s main dump (write) function callback

LoadFunc loadFunc

Plugin’s main load (read) function callback

QueryFeaturesFunc queryFeaturesFunc

Plugin’s callback to query its feature set (not in use)

IdentifyFileFunc identifyFileFunc

Plugin’s callback to indicate if it thinks it owns a file

struct MACSIO_IFACE_Handle_t
#include <macsio_iface.h>

Plugin interface handle type.

Public Members

char name[MACSIO_IFACE_MAX_NAME]

Name of this plugin

char ext[MACSIO_IFACE_MAX_NAME]

File extension of files associated with this plugin

int slotUsed

[Internal] indicate if this position in table is used

ProcessArgsFunc processArgsFunc

Plugin’s command-line argument processing callback

DumpFunc dumpFunc

Plugin’s main dump (write) function callback

LoadFunc loadFunc

Plugin’s main load (read) function callback

QueryFeaturesFunc queryFeaturesFunc

Plugin’s callback to query its feature set (not in use)

IdentifyFileFunc identifyFileFunc

Plugin’s callback to indicate if it thinks it owns a file

Example (macsio_miftmpl.c)

Writing a new plugin for MACSio involves a mininum of two new files in the plugins directory. One is the C or C++ source code for the plugin. The other is a .make file that includes various plugin-specific variable definitions and logic to decode library dependencies that effect.

The miftmpl plugin is intended to serve as a template for how to create a basic MIF mode plugin for MACSio. This template code does indeed actually function correctly as a MACSio plugin. It does so by writing MACSio’s internal JSON objects repesenting each mesh part as ascii strings to the individual files.

Each processor in a MIF group serializes each JSON object representing a mesh part to an ascii string. Then, each of these strings is appended to the end of the file. For each such string, the plugin maintains knowledge of the mesh part’s ID, the filename it was written to and the offset within the file.

The filenames, offsets and mesh part IDs are then written out as a separate JSON object to a root or master file. Currently, there is no plugin in VisIt to read these files and display them. But, this example code does help to outline the basic work to write a MIF plugin.

In practice, this plugin could have simply written the entire JSON object from each processor to its MIF group’s file. However, in doing that, the resulting file would not “know” such things as how many mesh parts there are or where a given mesh part is located in the file set. So, we wind up writing JSON objects for each part individually so that we can keep track of where they all are in the fileset.

Some of the aspects of this plugin code exist here only to serve as an example in writing a MIF plugin and are non-essential to the proper operation of this plugin.

MACSio uses a static load approach to its plugins. The MACSio main executable must be linked with all the plugins it expects to use.

In writing any MACSio plugin (MIF or SSF) be sure to declare all of your plugin’s symbols (functions, local variables, etc.) as static. Each plugin is being linked into MACSio’s main and any symbols that are not static file scope will wind up appearing in and therefore being vulnerable too global namespace collisions. The plugin’s main interface methods to MACSio are handled via registration of a set of function pointers.

Functions

int process_args(int argi, int argc, char *argv[])

Process command-line arguments specific to this plugin.

Parameters
  • argi: Argument index of first argument that is specific to this plugin

  • argc: argc as passed into main

  • argv: argv as passed into main

Uses MACSIO_CLARGS_ProcessCmdline() to do its work.

This example plugin is implemented to route command line arguments to memory locations (e.g. static variables) here in the plugin. Alternatively, a plugin can choose to route the results of MACSIO_CLARGS_ProcessCmdline() to a JSON object. MACSio’s main is implemented that way.

void *CreateMyFile(const char *fname, const char *nsname, void *userData)

CreateFile MIF Callback.

Parameters
  • fname: Name of the MIF file to create

  • nsname: Name of the namespace within the file for caller should use.

  • userData: Optional plugin-specific user-defined data

This implments the MACSIO_MIF_CreateFile callback needed for a MIF mode plugin.

Return

A void pointer to the plugin-specific file handle

void *OpenMyFile(const char *fname, const char *nsname, MACSIO_MIF_ioFlags_t ioFlags, void *userData)

OpenFile MIF Callback.

Parameters
  • fname: Name of the MIF file to open

  • nsname: Name of the namespace within the file caller should use

  • ioFlags: Various flags indicating behavior/options

  • userData: Optional plugin-specific user-defined data

This implments the MACSIO_MIF_OpenFile callback needed for a MIF mode plugin.

Return

A void pointer to the plugin-specific file handle

int CloseMyFile(void *file, void *userData)

CloseFile MIF Callback.

Parameters
  • file: A void pointer to the plugin specific file handle

  • userData: Optional plugin specific user-defined data

This implments the MACSIO_CloseFile callback needed for a MIF mode plugin.

json_object *write_mesh_part(FILE *myFile, char const *fileName, json_object *part_obj)

Write a single mesh part to a MIF file.

Parameters
  • myFile: The file handle being used in a MIF dump

  • fileName: Name of the MIF file

  • part_obj: The json object representing this mesh part

All this method does is serialize the JSON object for the given mesh part to an ASCII string and then appends/writes that string at the end of the current file.

After serializing the object to an ASCII string and writing it to the file, the memory for the ASCII string is released by json_object_free_printbuf().

Return

A tiny JSON object holding the name of the file, the offset at which the JSON object for this part was written in the file and the part’s ID.

void main_dump(int argi, int argc, char **argv, json_object *main_obj, int dumpn, double dumpt)

Main MIF dump implementation for this plugin.

Parameters
  • argi: Command-line argument index at which first plugin-specific arg appears

  • argc: argc from main

  • argv: argv from main

  • main_obj: The main json object representing all data to be dumped

  • dumpn: The number/index of this dump. Each dump in a sequence gets a unique, monotone increasing index starting from 0

  • dumpt: The time to be associated with this dump (like a simulation’s time)

This is the function MACSio main calls to do the actual dump of data with this plugin.

It uses MACSIO_MIF twice; once for the main dump and a second time to create the root (or master) file. However, in the second use, the file count is set to 1. That means that the root file is effectively written using serial (e.g. non-parallel) I/O.

It is a useful exercise to ask how we might improve the implementation here to avoid writing the root file using serial I/O.

int register_this_interface()

Method to register this plugin with MACSio main.

Due to its use to initialize a file-scope, static const variable, this function winds up being called at load time (e.g. before main is even called).

Its purpose is to add key information about this plugin to MACSio’s global interface table.

Variables

char const *iface_name = "miftmpl"

Name of the interface this plugin uses

char const *iface_ext = "json"

Default file extension for files generated by this plugin

int json_as_html = 0

Use HTML output instead of raw ascii

int my_opt_one

Example of a static scope, plugin-specific variable to be set in process_args to control plugin behavior

int my_opt_two

Another example variable to control plugin behavior

char *my_opt_three_string

Another example variable to control plugin behavior

float my_opt_three_float

Another example variable to control plugin behavior

int const dummy = register_this_interface()

Dummy initializer to trigger register_this_interface by the loader.

This one statement is the only statement requiring compilation by a C++ compiler. That is because it involves initialization and non constant expressions (a function call in this case). This function call is guaranteed to occur during initialization (that is before even ‘main’ is called) and so will have the effect of populating the iface_map array merely by virtue of the fact that this code is linked with a main.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334

#include <json-cwx/json.h>

#include <macsio_clargs.h>
#include <macsio_iface.h>
#include <macsio_log.h>
#include <macsio_main.h>
#include <macsio_mif.h>
#include <macsio_utils.h>

#include <stdio.h>

#ifdef HAVE_MPI
#include <mpi.h>
#endif

/*!
\defgroup plugins Plugins
@{
*/

/*!
\addtogroup MIF Template
\brief A simple MIF Plugin Template
@{
*/

static char const *iface_name = "miftmpl"; /**< Name of the interface this plugin uses */
static char const *iface_ext = "json";     /**< Default file extension for files generated by this plugin */
static int json_as_html = 0;               /**< Use HTML output instead of raw ascii */
static int my_opt_one;                     /**< Example of a static scope, plugin-specific variable to be set in
                                                process_args to control plugin behavior */
static int my_opt_two;                     /**< Another example variable to control plugin behavior */
static char *my_opt_three_string;          /**< Another example variable to control plugin behavior */
static float my_opt_three_float;           /**< Another example variable to control plugin behavior */

/*!
\brief Process command-line arguments specific to this plugin

Uses MACSIO_CLARGS_ProcessCmdline() to do its work.

This example plugin is implemented to route command line arguments to memory locations
(e.g. static variables) here in the plugin.  Alternatively, a plugin can choose to
route the results of MACSIO_CLARGS_ProcessCmdline() to a JSON object. MACSio's main is
implemented that way.
*/
static int process_args(
    int argi,      /**< [in] Argument index of first argument that is specific to this plugin */
    int argc,      /**< [in] argc as passed into main */
    char *argv[]   /**< [in] argv as passed into main */
)
{
    /* Can use MACSIO_CLARGS_TOJSON here instead in which case pass the pointer to
       a json_object* as first arg and eliminate all the pointers to specific
       variables. The args will be returned as a json-c object. */
    const MACSIO_CLARGS_ArgvFlags_t argFlags = {MACSIO_CLARGS_WARN, MACSIO_CLARGS_TOMEM};

    MACSIO_CLARGS_ProcessCmdline(0, argFlags, argi, argc, argv,
        "--json_as_html", "",
            "Write files as HTML instead of raw ascii [false]",
            &json_as_html,
        "--my_opt_one", "",
            "Help message for my_opt_one which has no arguments. If present, local\n"
            "var my_opt_one will be assigned a value of 1 and a value of zero otherwise.",
            &my_opt_one,
        "--my_opt_two %d", MACSIO_CLARGS_NODEFAULT,
            "Help message for my_opt_two which has a single integer argument",
            &my_opt_two,
        "--my_opt_three %s %f", MACSIO_CLARGS_NODEFAULT,
            "Help message for my_opt_three which has a string argument and a float argument",
            &my_opt_three_string, &my_opt_three_float,
           MACSIO_CLARGS_END_OF_ARGS);

    return 0;
}

/*!
\brief CreateFile MIF Callback

This implments the MACSIO_MIF_CreateFile callback needed for a MIF mode plugin.

\return A void pointer to the plugin-specific file handle
*/
static void *CreateMyFile( 
    const char *fname,     /**< [in] Name of the MIF file to create */
    const char *nsname,    /**< [in] Name of the namespace within the file for caller should use. */
    void *userData         /**< [in] Optional plugin-specific user-defined data */
)
{
    FILE *file = fopen(fname, "w");
    return (void *) file;
}

/*!
\brief OpenFile MIF Callback

This implments the MACSIO_MIF_OpenFile callback needed for a MIF mode plugin.

\return A void pointer to the plugin-specific file handle
*/
static void *OpenMyFile(
    const char *fname,            /**< [in] Name of the MIF file to open */
    const char *nsname,           /**< [in] Name of the namespace within the file caller should use */
    MACSIO_MIF_ioFlags_t ioFlags, /**< [in] Various flags indicating behavior/options */
    void *userData                /**< [in] Optional plugin-specific user-defined data */
)
{
    FILE *file = fopen(fname, "a+");
    return (void *) file;
}

/*!
\brief CloseFile MIF Callback

This implments the MACSIO_CloseFile callback needed for a MIF mode plugin.
*/
static int CloseMyFile(
    void *file,      /**< [in] A void pointer to the plugin specific file handle */
    void *userData   /**< [in] Optional plugin specific user-defined data */
)
{
    return fclose((FILE*) file);
}

/*!
\brief Write a single mesh part to a MIF file

All this method does is serialize the JSON object for the given mesh
part to an ASCII string and then appends/writes that string at the
end of the current file.

After serializing the object to an ASCII string and writing it to the
file, the memory for the ASCII string is released by json_object_free_printbuf().

\return A tiny JSON object holding the name of the file, the offset at
which the JSON object for this part was written in the file and the part's ID.
*/
static json_object *write_mesh_part(
    FILE *myFile,          /**< [in] The file handle being used in a MIF dump */
    char const *fileName,  /**< [in] Name of the MIF file */
    json_object *part_obj  /**< [in] The json object representing this mesh part */
)
{
    json_object *part_info = json_object_new_object();

//#warning SOMEHOW SHOULD INCLUDE OFFSETS TO EACH VARIABLE
    /* Write the json mesh part object as an ascii string */
    fprintf(myFile, "%s\n", json_object_to_json_string_ext(part_obj, JSON_C_TO_STRING_PRETTY));
    json_object_free_printbuf(part_obj);

    /* Form the return 'value' holding the information on where to find this part */
    json_object_object_add(part_info, "partid",
//#warning CHANGE NAME OF KEY IN JSON TO PartID
        json_object_new_int(json_object_path_get_int(part_obj, "Mesh/ChunkID")));
    json_object_object_add(part_info, "file",
        json_object_new_string(fileName));
    json_object_object_add(part_info, "offset",
        json_object_new_double((double) ftello(myFile)));

    return part_info;
}

/*!
\brief Main MIF dump implementation for this plugin

This is the function MACSio main calls to do the actual dump of data with this plugin.

It uses \ref MACSIO_MIF twice; once for the main dump and a second time to create the
root (or master) file. However, in the second use, the file count is set to 1. That
means that the root file is effectively written using serial (e.g. non-parallel) I/O.

It is a useful exercise to ask how we might improve the implementation here to avoid
writing the root file using serial I/O.
*/
static void main_dump(
    int argi,               /**< [in] Command-line argument index at which first plugin-specific arg appears */
    int argc,               /**< [in] argc from main */
    char **argv,            /**< [in] argv from main */
    json_object *main_obj,  /**< [in] The main json object representing all data to be dumped */
    int dumpn,              /**< [in] The number/index of this dump. Each dump in a sequence gets a unique,
                                      monotone increasing index starting from 0 */
    double dumpt            /**< [in] The time to be associated with this dump (like a simulation's time) */
)
{
    int i, rank, numFiles;
    char fileName[256];
    FILE *myFile;
    MACSIO_MIF_ioFlags_t ioFlags = {MACSIO_MIF_WRITE,(unsigned int) JsonGetInt(main_obj,"clargs/exercise_scr")&0x1};
    MACSIO_MIF_baton_t *bat;
    json_object *parts;
    json_object *part_infos = json_object_new_array();

    /* process cl args */
    process_args(argi, argc, argv);

    /* ensure we're in MIF mode and determine the file count */
//#warning SIMPLIFY THIS LOGIC USING NEW JSON INTERFACE
    json_object *parfmode_obj = json_object_path_get_array(main_obj, "clargs/parallel_file_mode");
    if (parfmode_obj)
    {
        json_object *modestr = json_object_array_get_idx(parfmode_obj, 0);
        json_object *filecnt = json_object_array_get_idx(parfmode_obj, 1);
        if (!strcmp(json_object_get_string(modestr), "SIF"))
        {
            MACSIO_LOG_MSG(Die, ("miftmpl plugin cannot currently handle SIF mode"));
        }
        else
        {
            numFiles = json_object_get_int(filecnt);
        }
    }
    else
    {
        char const * modestr = json_object_path_get_string(main_obj, "clargs/parallel_file_mode");
        if (!strcmp(modestr, "SIF"))
        {
            MACSIO_LOG_MSG(Die, ("miftmpl plugin cannot currently handle SIF mode"));
        }
        else if (!strcmp(modestr, "MIFMAX"))
            numFiles = json_object_path_get_int(main_obj, "parallel/mpi_size");
        else if (!strcmp(modestr, "MIFAUTO"))
        {
            /* Call MACSio utility to determine optimal file count */
        }
    }

    bat = MACSIO_MIF_Init(numFiles, ioFlags, MACSIO_MAIN_Comm, 3,
        CreateMyFile, OpenMyFile, CloseMyFile, 0);

    rank = json_object_path_get_int(main_obj, "parallel/mpi_rank");

    /* Construct name for the silo file */
    sprintf(fileName, "%s_json_%05d_%03d.%s",
        json_object_path_get_string(main_obj, "clargs/filebase"),
        MACSIO_MIF_RankOfGroup(bat, rank),
        dumpn,
        json_object_path_get_string(main_obj, "clargs/fileext"));

    MACSIO_UTILS_RecordOutputFiles(dumpn, fileName);

    myFile = (FILE *) MACSIO_MIF_WaitForBaton(bat, fileName, 0);

    parts = json_object_path_get_array(main_obj, "problem/parts");
    for (i = 0; i < json_object_array_length(parts); i++)
    {
        json_object *this_part = json_object_array_get_idx(parts, i);
        json_object_array_add(part_infos, write_mesh_part(myFile, fileName, this_part));
    }

    /* Hand off the baton to the next processor. This winds up closing
     * the file so that the next processor that opens it can be assured
     * of getting a consistent and up to date view of the file's contents. */
    MACSIO_MIF_HandOffBaton(bat, myFile);

    /* We're done using MACSIO_MIF for these files, so finish it off */
    MACSIO_MIF_Finish(bat);

    /* Use MACSIO_MIF a second time to manage writing of the master/root
       file contents. This winds up being serial I/O but also means we
       never collect all info on all parts to any single processor. */
//#warning THERE IS A BETTER WAY TO DO USING LOOP OF NON-BLOCKING RECIEVES
    bat = MACSIO_MIF_Init(1, ioFlags, MACSIO_MAIN_Comm, 5,
        CreateMyFile, OpenMyFile, CloseMyFile, 0);

    /* Construct name for the silo file */
    sprintf(fileName, "%s_json_root_%03d.%s",
        json_object_path_get_string(main_obj, "clargs/filebase"),
        dumpn,
        json_object_path_get_string(main_obj, "clargs/fileext"));

    MACSIO_UTILS_RecordOutputFiles(dumpn, fileName);


    /* Wait for MACSIO_MIF to give this processor exclusive access */
    myFile = (FILE *) MACSIO_MIF_WaitForBaton(bat, fileName, 0);

//#warning FIX THE STRING THAT WE PRODUCE HERE SO ITS A SINGLE JSON ARRAY OBJECT
    /* This processor's work on the file is just to write its part_infos */
    fprintf(myFile, "%s\n", json_object_to_json_string_ext(part_infos, JSON_C_TO_STRING_PRETTY));

    MACSIO_MIF_HandOffBaton(bat, myFile);

    MACSIO_MIF_Finish(bat);

    /* decriment ref-count (and free) part_infos */
    json_object_put(part_infos);
}

/*!
\brief Method to register this plugin with MACSio main

Due to its use to initialize a file-scope, static const variable, this
function winds up being called at load time (e.g. before main is even called).

Its purpose is to add key information about this plugin to MACSio's global
interface table.
*/
static int register_this_interface()
{
    MACSIO_IFACE_Handle_t iface;

    if (strlen(iface_name) >= MACSIO_IFACE_MAX_NAME)
        MACSIO_LOG_MSG(Die, ("Interface name \"%s\" too long", iface_name));

    /* Populate information about this plugin */
    strcpy(iface.name, iface_name);
    strcpy(iface.ext, iface_ext);
    iface.dumpFunc = main_dump;
    iface.processArgsFunc = process_args;

    /* Register this plugin */
    if (!MACSIO_IFACE_Register(&iface))
        MACSIO_LOG_MSG(Die, ("Failed to register interface \"%s\"", iface_name));

    return 0;
}

/*!
\brief Dummy initializer to trigger register_this_interface by the loader

This one statement is the only statement requiring compilation by
a C++ compiler. That is because it involves initialization and non
constant expressions (a function call in this case). This function
call is guaranteed to occur during *initialization* (that is before
even 'main' is called) and so will have the effect of populating the
iface_map array merely by virtue of the fact that this code is linked
with a main.
*/
static int const dummy = register_this_interface();

/*!@}*/

/*!@}*/

HDF5 Plugin

The `HDF5`_ plugin is designed to support both MIF and SSF parallel I/O paradigms. In SSF mode, writes can be collective or independent. In MIF mode, only independent writes are supported because each task creates its own datasets.

In addition, it supports Gzip, szip and zfp compression but only in MIF mode.

Finally, there are a number of command-line arguments to the plugin that allow control of a number of low-level `HDF5`_ features. More can be easily added.

There have been some recent additions to help improve scalable performance of the HDF5_ library and support for these features are yet to have been added to the plugin.

There is a minor difference in the representation of the mesh and its field between the MIF and SSF parallel I/O modes. In MIF mode, each task outputs a chunk of mesh such that nodes on the exterior of the chunk are duplicates of the equivalent nodes on neighboring chunks. However, in SSF mode, the whole mesh is assembled into a single, monolithic whole mesh. The nodes that are duplicated in neighboring mesh chunks are factored out of the actual write requests to HDF5 to avoid lower levels in the I/O stack from having to handle coincident writes.

Command-Line Arguments

Note

Need to auto-gen the command-line args doc from the source code or a run of the executable with –help

Dependencies

To use either gzip or szip compression with the `HDF5`_ plugin, the HDF5 library to which MACSio is linked must have been compiled with support for zlib and szip.

However, to use zfp compression with the `HDF5`_ plugin, all that is necessary is that the H5Z-ZFP filter be available to the `HDF5`_ library to load at run time as any other filter plugin. The H5Z-ZFP filter must have been compatably compiled. In addition, the environment variable, HDF5_PLUGIN_PATH set to include a directory containining the H5Z-ZFP filter plugin library. For example,

env HDF5_PLUGIN_PATH=<path-to-plugin-dir> mpirun -np 4 ./macsio ...

Defines

H5Pset_zfp_rate_cdata(R, N, CD)

H5Z-ZFP generic interface for setting rate mode.

H5Pset_zfp_precision_cdata(P, N, CD)

H5Z-ZFP generic interface for setting precision mode.

H5Pset_zfp_accuracy_cdata(A, N, CD)

H5Z-ZFP generic interface for setting accuracy mode.

H5Pset_zfp_expert_cdata(MiB, MaB, MaP, MiE, N, CD)

H5Z-ZFP generic interface for setting expert mode.

MAINZER_PARAMS

Typedefs

typedef struct _user_data user_data_t

User data for MIF callbacks.

Functions

hid_t make_fapl()

create HDF5 library file access property list

int get_tokval(char const *src_str, char const *token_to_match, void *val_ptr)

Utility to parse compression string command-line args.

Parameters
  • src_str: CL arg string to be parsed

  • token_to_match: a token in the string to be matched including a trailing scanf format specifier

  • val_ptr: Pointer to memory where parsed value should be placed

Does a case-insensitive search of src_str for token_to_match not including the trailing format specifier. Upon finding a match, performs a scanf at the location of the match into temporary memory confirming the scan will actually succeed. Then, performs the scanf again, storying the result to the memory indicated in val_ptr.

Return

0 on error, 1 on success

hid_t make_dcpl(char const *alg_str, char const *params_str, hid_t space_id, hid_t dtype_id)

create HDF5 library dataset creation property list

Parameters
  • alg_str: compression algorithm string

  • params_str: compression params string

  • space_id: HDF5 dataspace id for the dataset

  • dtype_id: HDF5 datatype id for the dataset

If the dataset size is below the minsize threshold, no special storage layout or compression action is taken.

Chunking is initially set to single-chunk. However, for szip compressor, chunking can be set by command-line arguments.

int process_args(int argi, int argc, char *argv[])

Process command-line arguments an set local variables.

Parameters
  • argi: argument index to start processing argv

  • argc: argc from main

  • argv: argv from main

void main_dump_sif(json_object *main_obj, int dumpn, double dumpt)

Single shared file implementation of main dump.

Parameters
  • main_obj: main json data object to dump

  • dumpn: dump number (like a cycle number)

  • dumpt: dump time

void *CreateHDF5File(const char *fname, const char *nsname, void *userData)

MIF create file callback for HDF5 MIF mode.

Parameters
  • fname: file name

  • nsname: curent task namespace name

  • userData: user data specific to current task

void *OpenHDF5File(const char *fname, const char *nsname, MACSIO_MIF_ioFlags_t ioFlags, void *userData)

MIF Open file callback for HFD5 plugin MIF mode.

Parameters
  • fname: filename

  • nsname: namespace name for current task

  • userData: task specific user data for current task

int CloseHDF5File(void *file, void *userData)

MIF close file callback for HDF5 plugin MIF mode.

Parameters
  • file: void* to hid_t of file to cose

  • userData: task specific user data

void write_mesh_part(hid_t h5loc, json_object *part_obj)

Write individual mesh part in MIF mode.

Parameters
  • h5loc: HDF5 group id into which to write

  • part_obj: JSON object for the mesh part to write

void main_dump_mif(json_object *main_obj, int numFiles, int dumpn, double dumpt)

Main dump output for HDF5 plugin MIF mode.

Parameters
  • main_obj: main data object to dump

  • numFiles: MIF file count

  • dumpn: dump number (like a cycle number)

  • dumpt: dump time

void main_dump(int argi, int argc, char **argv, json_object *main_obj, int dumpn, double dumpt)

Main dump callback for HDF5 plugin.

Parameters
  • argi: arg index at which to start processing argv

  • argc: argc from main

  • argv: argv from main

  • main_obj: main json data object to dump

  • dumpn: dump number

  • dumpt: dump time

Selects between MIF and SSF output.

int register_this_interface()

Function called during static initialization to register the plugin.

Variables

char const *iface_name = "hdf5"

name of this plugin

char const *iface_ext = "h5"

file extension for files managed by this plugin

int use_log = 0

Use HDF5’s logging fapl

int no_collective = 0

Use HDF5 independent (e.g. not collective) I/O

int no_single_chunk = 0

disable single chunking

int silo_block_size = 0

block size for silo block-based VFD

int silo_block_count = 0

block count for silo block-based VFD

int sbuf_size = -1

HDF5 library sieve buf size

int mbuf_size = -1

HDF5 library meta blocck size

int rbuf_size = -1

HDF5 library small data block size

int lbuf_size = 0

HDF5 library log flags

const char *filename
hid_t fid
hid_t dspc = -1
int show_errors = 0
char compression_alg_str[64]
char compression_params_str[512]
int dummy = register_this_interface()

Static initializer statement to cause plugin registration at link time.

this one statement is the only statement requiring compilation by a C++ compiler. That is because it involves initialization and non constant expressions (a function call in this case). This function call is guaranteed to occur during initialization (that is before even ‘main’ is called) and so will have the effect of populating the iface_map array merely by virtue of the fact that this code is linked with a main.

struct _user_data

User data for MIF callbacks.

Public Members

hid_t groupId

HDF5 hid_t of current group

Parallel I/O Paradigms

MACSio provides utility packages to support the development of plugins handling a variety of parallel I/O paradigms and variations thereof. These include

  • Multiple Independent File (MIF) parallel I/O * With and without SCR * With and without naive Two-Phase

  • Single, Shared File (SSF) parallel I/O

  • Hybrid MIF/SSF where each MIF file is handled via SSF

  • Multi-Pass

  • Application managed/extra I/O nodes.

To be clear, MACSio itself does NOT implement any of these parallel I/O paradigms. Rather, MACSio provides some utility packages that enable plugin developers to implement various parallel I/O paradigms. These packages are described here.

Multiple Independent File (MIF) Parallel I/O

MACSio’s Multiple Independent File (MIF) package is designed to support the development of plugins utilizing the MIF paradigm.

In the multiple, independent file paradigm, parallelism is achieved through simultaneous access to multiple files. The application divides itself into file groups. For each file group, the application manages exclusive access among all the tasks of the group. I/O is serial within groups but parallel across groups. The number of files (groups) is wholly independent from the number of processors or mesh-parts. It is often chosen to match the number of independent I/O pathways available in the hardware between the compute nodes and the filesystem. The MIF paradigm is sometimes also called N->M because it is N tasks writing to M files (M<N).

In this paradigm I/O requests are almost exclusively independent. However, there are scenarios where collective I/O requests can be made to work and might even make sense in the MIF paradigm. MIF is often referred to as Poor Man’s parallel I/O because the onus is on the application to manage the distribution of data across potentially many files. In truth, this illuminates the only salient distinction between Single Shared File (SSF) and MIF. In either paradigm, if you dig deep enough into the I/O stack, you soon discover that data is always being distributed across multiple files. The only difference is where in the stack the distribution into files is handled. In the MIF paradigm, it is handled explicitly by the application itself. In a typical SSF paradigm, it is handled by the underling filesystem.

In MIF, MPI tasks are divided into N groups and each group is responsible for creating one of the N files. At any one moment, only one MPI task from each group has exclusive access to the file. Hence, I/O is serial within a group. However, because one task in each group is writing to its group’s own file, simultaneously, I/O is parallel across groups.

_images/pmpio_diagram.png

In the example in the diagram, the computational object is decomposed into 16 pieces, called domains, distributed among 6 tasks. Some tasks have 3 domains and some have 2. Tasks are formed into 3 groups each group producing one file. Finally, a small “master” file is written to capture information on how all the pieces are distributed among the files.

A call to MACSIO_MIF_Init() establishes the mapping of tasks to file groups.

Within a group, access to the group’s file is handled in a round-robin fashion. The first MPI task in the group creates the file and then iterates over all mesh pieces it has. For each mesh piece, it creates a sub-directory within the file (e.g., a separate namespace for the objects). It repeats this process for each mesh piece it has. Then, the first MPI task closes the file and hands off exclusive access to the next task in the group. That MPI task opens the file and iterates over all domains in the same way. Exclusive access to the file is then handed off to the next task. This process continues until all processors in the group have written their domains to unique sub-directories in the file.

Calls to MACSIO_MIF_WaitForBaton() and MACSIO_MIF_HandOffBaton() handle this handshaking of file creations and/or opens and bracket blocks of code that are performing MIF I/O.

After all groups have finished with their files, an optional final step may involve creating a master file which contains special metadata objects that point at all the pieces of mesh (domains) scattered about in the N files.

A call to MACSIO_MIF_Finish() frees up resources associated with handling MIF mappings.

The basic coding structure for a MIF I/O operation is as follows…

MACSIO_MIF_baton_t bat = MACSIO_MIF_Init(. . .);
SomeFileType *fhndl = (SomeFileType *) MACSIO_MIF_WaitForBaton(bat, "GroupName", "ProcName");

    /* this task's work on the file */

MACSIO_MIF_HandOffBaton(bat, fhndl);
MACSIO_MIF_Finalize(bat);

Setting N to be equal to the number of MPI tasks, results in a file-per-process configuration, which is typically not recommended. However, some applications do indeed choose to run this way with good results. Alternatively, setting N equal to 1 results in effectively serializing the I/O and is also certainly not recommended. For large, parallel runs, there is typicall a sweet spot in the selection of N which results in peak I/O performance rates. If N is too large, the I/O subsystem will likely be overwhelmed; setting it too small will likely underutilize the system resources. This is illustrated of files and MPI task counts.

_images/Ale3d_io_perf2.png

Time to write a restart file from Ale3d as a function of MPI Task count and MIF file count (circa 1999)

This approach to scalable, parallel I/O was originally developed in the late 1990s by Rob Neely, a lead software architect on ALE3D at the time. It and variations thereof have since been adopted by many codes and used productively through several transitions in orders of magnitude of MPI task counts from hundreds then to hundreds of thousands today.

There are a large number of advantages to MIF-IO over SSF-IO.

  • MIF enables the use of serial (e.g. non-parallel) I/O libraries. One caveat is that such libraries do need to provide namespace features (e.g. sub-directories).

  • MIF is a much simpler programming model because it frees developers from having to think in terms of collective I/O operations. The code to write data from one MPI task doesn’t depend on or involve other tasks. For large multi-physics applications where the size, shape and even existence of data can vary dramatically among MPI tasks, this is invaluable in simplifying I/O code development.

  • MIF alleviates any need for global-to-local and local-to-global remapping upon every exchange of data between the application and its file.

  • Some Data-in-transit (DIT) services (e.g. those that may change size or shape such as compression) are easier to apply in a MIF setting because processors are freed from having to coordinate with each other on changes in data size/shape as it is moved to the file.

  • Good performance demands very little in the way of extra/advanced features from the underlying I/O hardware and filesystem. A relatively dumb filesysem can get it right and perform well.

  • Application controlled throttling of I/O is easily supported in a MIF setting because the number of concurrent operations is explicitly controlled. This can help to avoid overloading the underlying I/O subsystems.

  • MIF is consistent with the way leading-edge commercial big data I/O in map-reduce operations is handled. Data sets are broken into pieces and stored in the filesystem as a collection of shards and different numbers of parallel tasks can process different numbers of shards.

MACSio’s MIF Package and Naive Two-Phase I/O

Warning

FEATURE CURRENTLY IN DESIGN

An additional option available in MACSio’s MIF package is support for a naive form of two-phase I/O. In this mode, I/O requests from each parallel task are accumulated into a local ram-disk or other form of node-local storage. This process proceeds in an embarrasingly parallel way until either a threshold of storage is reached or the plugin explicitly indicates a flush is desired. At this point, all parallel tasks block while local data accumulated on each task is aggregted to aggregator tasks where the actual I/O is performed. To accomplish this, MACSio implements what amount to a virtual file such that all I/O operations performed by a task are captured in this vitual file. Aggregation is performed by message passing these virtual files to aggregator tasks which essentially turn around and replay.

MACSio’s MIF Package and SCR

These MIF utilities are designed to support use in conjunction with the Scalable Checkpoint / Restart (SCR) library. However, use of SCR may place additional restrictions on the tasks-to-files mapping depending, partially, on whether SCR is configured to write to node-local storage. For example, SCR is typically supported only in file-per-processor mappings.

MIF API

struct _MACSIO_MIF_ioFlags_t

Bit Field struct for I/O flags.

Public Members

unsigned int do_wr

bit0: 1=write, 0=read

unsigned int use_scr

bit1: 1=use SCR, 0=don’t use SCR

group MACSIO_MIF

Utilities supporting Multiple Indpendent File (MIF) Parallel I/O.

Defines

MACSIO_MIF_READ
MACSIO_MIF_WRITE

Typedefs

typedef struct _MACSIO_MIF_ioFlags_t MACSIO_MIF_ioFlags_t

Bit Field struct for I/O flags.

typedef struct _MACSIO_MIF_baton_t MACSIO_MIF_baton_t

Opaque struct holding private implementation of MACSIO_MIF_baton_t.

typedef void *(*MACSIO_MIF_CreateCB)(const char *fname, const char *nsname, void *udata)
typedef void *(*MACSIO_MIF_OpenCB)(const char *fname, const char *nsname, MACSIO_MIF_ioFlags_t ioFlags, void *udata)
typedef int (*MACSIO_MIF_CloseCB)(void *file, void *udata)

Functions

MACSIO_MIF_baton_t *MACSIO_MIF_Init(int numFiles, MACSIO_MIF_ioFlags_t ioFlags, MPI_Comm mpiComm, int mpiTag, MACSIO_MIF_CreateCB createCb, MACSIO_MIF_OpenCB openCb, MACSIO_MIF_CloseCB closeCb, void *clientData)

Initialize MACSIO_MIF for a MIF I/O operation.

Parameters
  • numFiles: Number of resultant files. Note: this is entirely independent of number of tasks. Typically, this number is chosen to match the number of independent I/O pathways between the nodes the application is executing on and the filesystem. Pass MACSIO_MIF_MAX for file-per-processor. Pass MACSIO_MIF_AUTO (currently not supported) to request that MACSIO_MIF determine and use an optimum file count.

  • ioFlags: See MACSIO_MIF_ioFlags_t for meaning of flags.

  • mpiComm: The MPI communicator containing all the MPI ranks that will marshall data in the MIF I/O operation.

  • mpiTag: MPI message tag MACSIO_MIF will use in all MPI messages for this MIF I/O context.

  • createCb: Callback MACSIO_MIF should use to create a group’s file

  • openCb: Callback MACSIO_MIF should use to open a group’s file

  • closeCb: Callback MACSIO_MIF should use to close a group’s file

  • clientData: Optional, client specific data MACSIO_MIF will pass to callbacks

Creates and returns a MACSIO_MIF baton object establishing the mapping between MPI ranks and file groups for a MIF I/O operation.

All processors in the mpiComm communicator must call this function collectively with identical values for numFiles, ioFlags, and mpiTag.

If a given execution context has multiple active MACSIO_MIF_baton_t objects, the caller must ensure that each passed a different mpiTag.

The resultant baton object is used in subsequent calls to WaitFor and HandOff the baton to the next processor in each group.

The createCb, openCb, closeCb callback functions are used by MACSIO_MIF to execute baton waits and handoffs during which time a group’s file will be closed by the HandOff function and opened by the WaitFor method except for the first processor in each group which will create the file.

Processors in the mpiComm communicator are broken into numFiles groups. If there is a remainder, R, after dividing the communicator size into numFiles groups, then the first R groups will have one additional processor.

Return

The MACSIO_MIF baton object

void MACSIO_MIF_Finish(MACSIO_MIF_baton_t *bat)

End a MACSIO_MIF I/O operation and free resources.

Parameters
  • bat: The MACSIO_MIF baton handle

void *MACSIO_MIF_WaitForBaton(MACSIO_MIF_baton_t *Bat, char const *fname, char const *nsname)

Wait for exclusive access to the group’s file.

Parameters
  • Bat: The MACSIO_MIF baton handle

  • fname: The filename

  • nsname: The namespace within the file to be used for this task’s objects.

All tasks in mpiComm argument to MACSIO_MIF_Init() call this function collectively. For the first task in each group, this call returns immediately. For all others in the group, it blocks, waiting for the task before it to finish its work on the group’s file and call MACSIO_MIF_HandOffBaton().

Return

A void pointer to whatever data instance the createCb or openCb methods return. The caller must cast this returned pointer to the correct type.

int MACSIO_MIF_HandOffBaton(MACSIO_MIF_baton_t const *Bat, void *file)

Release exclusive access to the group’s file.

Parameters
  • Bat: The MACSIO_MIF baton handle

  • file: A void pointer to the group’s file handle

This function is called only by the current task holding exclusive access to a group’s file and closes the group’s file for the calling task handing off control to the next task in the group.

Return

The integer value returned from the MACSIO_MIF_CloseCB callback.

int MACSIO_MIF_RankOfGroup(MACSIO_MIF_baton_t const *Bat, int rankInComm)

Rank of the group in which a given (global) rank exists.

Parameters
  • Bat: The MACSIO_MIF baton handle

  • rankInComm: The (global) rank of a task for which the rank of its group is desired

Given the rank of a task in mpiComm used in the MACSIO_MIF_Init() call, this function returns the rank of the group in which the given task exists. This function can be called from any rank and will return correct values for any rank it is passed.

int MACSIO_MIF_RankInGroup(MACSIO_MIF_baton_t const *Bat, int rankInComm)

Rank within a group of a given (global) rank.

Parameters
  • Bat: The MACSIO_MIF baton handle

  • rankInComm: The (global) rank of a task for which it’s rank in a group is desired

Given the rank of a task in mpiComm used in the MACSIO_MIF_Init() call, this function returns its rank within its group. This function can be called from any rank and will return correct values for any rank it is passed.

struct _MACSIO_MIF_ioFlags_t
#include <macsio_mif.h>

Bit Field struct for I/O flags.

Public Members

unsigned int do_wr

bit0: 1=write, 0=read

unsigned int use_scr

bit1: 1=use SCR, 0=don’t use SCR

Single Shared File (SSF) Parallel I/O

Command Line Argument Parsing

The command line argument package simplifies a number of aspects of defining and also parsing command line arguments for MACSio’s main code or any of its plugins. The package supports command line arguments and their options with the followin features…

  • boolean arguments (the argument’s presence or absence maps to 0 or 1)

  • integer, double and string valued arguments

  • multiple arguments for a given command-line option

  • default values for arguments

  • option groupings

  • help strings and printing of usage with --help option

  • pretty (somewhat) formatting of long help strings to terminal width

  • separating arguments into sub-command groups.

  • routing the parsed results to caller supplied local variables or constructing a returned JSON object.

  • error detection and logging

  • collective execution in parallel * argc, argv inputs ensured consistent across all tasks * results collectively unified across all tasks in parallel

The package is designed to do parsing of simple command-line arguments and assign values associated with them either to caller-supplied scalar variables or to populate a json_object. The work-horse function, MACSIO_CLARGS_ProcessCmdline(), is used in the following manner…

After the argi, argc, argv trio of arguments, the remaining arguments come in groups of either 3 (when mapping to a json_object) or 4 (when mapping to scalar program variables).

The first of these is an argument format specifier much like a printf statement. It indicates the type of each parameter to the argument as well as the number of parameters. Presently, it understands only %d, %f and %s types.

The second is a string argument indicating the default value, as a string.

The third is a help string for the argument. Note, you can make this string as long as the C-compiler will permit. You should not embed any newline characters as the routine to print the help string will do that for you. However, you may embed newline characters if you wish to control specific line breaking when output.

The fourth argument is present only when mapping to scalar program variables and is a pointer to the variable location(s) where values will be stored.

Command line arguments for which only existence on the command-line is tested assume a caller-supplied return value of int and will be assigned 1 if the argument exists on the command line and 0 otherwise.

Do not name any argument with a substring help as that is reserved for obtaining help. Also, do not name any argument with the string end_of_args as that is used to indicate the end of the list of arguments passed to the function.

If any argument on the command line has the substring help, help will be printed by task 0 and then this function calls MPI_Finalize() (in parallel) and exit(1).

This function must be called collectively in MPI_COMM_WORLD. All tasks are guaranteed to complete with identical results.

CLARGS API

group MACSIO_CLARGS

Command-line argument parsing utilities.

Defines

MACSIO_CLARGS_WARN
MACSIO_CLARGS_ERROR
MACSIO_CLARGS_TOMEM
MACSIO_CLARGS_TOJSON
MACSIO_CLARGS_ASSIGN_OFF
MACSIO_CLARGS_ASSIGN_ON
MACSIO_CLARGS_HELP
MACSIO_CLARGS_OK
MACSIO_CLARGS_GRP_SEP_STR
MACSIO_CLARGS_GRP_BEG
MACSIO_CLARGS_GRP_END
MACSIO_CLARGS_ARG_GROUP_BEG(GRPNAME, GRPHELP)

Begin option group (for TOJSON routing invokation)

MACSIO_CLARGS_ARG_GROUP_END(GRPNAME)

End option group (for TOJSON routing invokation)

MACSIO_CLARGS_ARG_GROUP_BEG2(GRPNAME, GRPHELP)

Begin option group (for TOMEM routing invokation)

MACSIO_CLARGS_ARG_GROUP_END2(GRPNAME)

End option group (for TOMEM routing invokation)

MACSIO_CLARGS_END_OF_ARGS

Moniker to terminate argument list.

MACSIO_CLARGS_NODEFAULT

Value to indicate no default specified.

Typedefs

typedef struct _MACSIO_CLARGS_ArgvFlags_t MACSIO_CLARGS_ArgvFlags_t

Functions

int MACSIO_CLARGS_ProcessCmdline(void **retobj, MACSIO_CLARGS_ArgvFlags_t flags, int argi, int argc, char **argv, ...)

Command-line argument parsing, default values and help output.

Parameters
  • retobj: Optional void pointer to a returned object encoding the command line options. Currently only JSON_C object is supported. Must specify TOJSON routing.

  • argi: First arg index at which to to start processing argv

  • argc: argc from main

  • argv: argv from main

Avoid definining any argument with the substring --help or --no-strict. These are command-line option keywords reserved by MACSio. --help is a request to print usage and exit. no-strict disables strict command-line option error response which ordinarily causes an abort and instead issues warnings.

In parallel, this function must be called collectively by all ranks in MPI_COMM_WORLD. argc and argv on task rank 0 are broadcast to all tasks. If request for help (e.g. --help is in argv) or error(s) are encountered, all tasks are broadcast this outcome. Only task rank 0 will print usage or error messages.

Return

An integer value of either MACSIO_CLARGS_OK, MACSIO_CLARGS_HELP or MACSIO_CLARGS_ERROR.

struct _MACSIO_CLARGS_ArgvFlags_t
#include <macsio_clargs.h>

Public Members

unsigned int error_mode

0=warn, 1=abort

unsigned int route_mode

0=(TOMEM) scalar variables, 1=(TOJSON) json_object, 2,3 unused

unsigned int defaults_mode

0=assign defaults, 1=do not assign defaults

Example

int doMultifile, numCycles, Ni, Nj;
double Xi, Xj;
MACSIO_CLARGS_ArgvFlags_t flags;
json_object *obj = 0;
int argi = 1; /* start at argv[1] not argv[0] */

flags.error_mode = MACSIO_CLARGS_ERROR;
flags.route_mode = MACSIO_CLARGS_TOMEM;

MACSIO_CLARGS_ProccessCmdline(obj, flags, argi, argc, argv,
    "--multifile", "", /* example boolean with no default */
        "if specified, use a file-per-timestep",
        &doMultifile,
    "--numCycles %d", "10", /* example integer with default of 10 */
        "specify the number of cycles to run",
        &numCycles,
    "--dims %d %f %d %f", "25 5 35 7", /* example multiple values */
        "specify size (logical and geometric) of mesh",
        &Ni, &Xi, &Nj, &Xj,
MACSIO_CLARGS_END_OF_ARGS);
./macsio --help
Usage and Help for "./macsio" Defaults, if any, in square brackets after argument definition
  --multifile [] if specified, use a file-per-timestep
  --numCycles %d [10] specify the number of cycles to run
  --dims %d %f %d %f [25 5 35 7] specify size (logical and geometric) of mesh

Timing

MACSio’s timing package includes utilities to support the creation of a number of timers to time sections of code.

Timers are initialized/started with a user-defined label, and an optional group mask and iteration number. A hash of the timer is computed from its label and group mask combined with the source file name and line number. The resulting hash value is used to identify the timer and returned as the timer’s id when the timer is created/started.

Timers can be iterated and re-started. These are two different concepts. During iteration, the same timer can be used to time the same section of code as it gets executed multiple times. Each time represents a different iteration of the timer. The timer will keep track of the minimum time, maximum time, average time and variance of times over all iterations it is invoked. However, apart from these running statistics, a timer maintains no memory of past values. If a timer is iterated 10 times, it does not maintain knowledge of all 10 individual times. It maintains only knowledge of the running statistics; min, max, avg, var. The algorithm it uses to maintain these running statistics is the Welford Online algorithm.

Within any timer iteration, a timer can be stopped and restarted. If a timer is being used to time a block of code that involves some occasional and perhaps irrlevant alternative logic path, the timer can be stopped until execution returns from the alternative path. Then the timer can be restarted. This can only be done within a given iteration.

Timers can also be grouped into a small number (<64) of different groups. A timer group is created with a call to MACSIO_TIMING_GroupMask(). Timers are included in one or more groups by or’ing the associated group masks into the gmask argument to MACSIO_TIMING_StartTimer(). A timer can be a member of multiple groups For example, it is possible to maintain a set of timers for MIF file I/O apart from a set of timers being used to time a particular plugin’s operations to marshal a mesh or variable to/from persistent storage. Or, operations use to interact with filesystem metadata directly (e.g. mkdir, chdir, readir, etc.) can be maintained separately from timers used for entirely other purposes.

Finally, timers can be reduced parallel tasks thereby creating a statistical summary of timer information across tasks.

A timer is initialized/started by a call to MACSIO_TIMING_StartTimer() or the shorthand macro MT_StartTimer. This call returns the timer’s ID which is used in a subsequent call to MACSIO_TIMING_StopTimer() to stop the timer.

MACSIO_TIMING_TimerId_t tid = MT_StartTimer("my timer", MACSIO_TIMING_GROUP_NONE, MACSIO_TIMING_ITER_AUTO);

...do some work here...

MT_StopTimer(tid);

In the above code, the call to MT_StartTimer starts a timer for a new (automatic) iteration. In this simple examle, we do not worry about timer group masks.

By default, the timing package uses MPI_Wtime() but a caller can set MACSIO_TIMING_UseMPI_Wtime to zero to instead use gettimeofday().

Examples of the use of the logging package can be found in Example (tsttiming.c).

Timing API

group MACSIO_TIMING

Timing utilities.

Defines

MACSIO_TIMING_ITER_AUTO

Automatic iteration numbering Use for iter arg to MT_StartTimer() for automatic iteration numbering of the timer explicitly.

MACSIO_TIMING_ITER_IGNORE

What is this?

MACSIO_TIMING_INVALID_TIMER

May be returned from MT_StartTimer()

MACSIO_TIMING_NO_GROUP

Group mask when timer is not assigned to any group.

MACSIO_TIMING_ALL_GROUPS

Group mask representing all groups.

MT_Time

Shorthand for MACSIO_TIMING_GetCurrentTime()

__BASEFILE__

Same as basename(FILE) Used to strip unnecessary path terms from in messages.

MT_StartTimer(LAB, GMASK, ITER)

Shorthand for MACSIO_TIMING_StartTimer()

Parameters
  • [in] LAB: User defined timer label string

  • [in] GMASK: User defined group mask. Use MACSIO_TIMING_NO_GROUP if timer grouping is not needed.

  • [in] ITER: The iteration number. Use MACSIO_TIMING_ITER_IGNORE if timer iteration is not needed.

MT_StopTimer(ID)

Shorthand for MACSIO_TIMING_StopTimer()

Parameters

Typedefs

typedef unsigned int MACSIO_TIMING_TimerId_t
typedef unsigned long long MACSIO_TIMING_GroupMask_t

Functions

MACSIO_TIMING_GroupMask_t MACSIO_TIMING_GroupMask(char const *grpName)

Create a group name and mask.

Parameters
  • grpName: Name of the group for which group mask is needed

A small number of groups (less than 64) can be defined into which timers can be grouped Timers can be assigned to multiple groups by or’ing the resulting group masks.

MACSIO_TIMING_TimerId_t MACSIO_TIMING_StartTimer(char const *label, MACSIO_TIMING_GroupMask_t gmask, int iter_num, char const *file, int line)

Create/Start a timer.

Parameters
  • label: User defined label to be assigned to the timer

  • gmask: Mask to indicate the timer’s group membership

  • iter_num: Iteration number

  • file: The source file name

  • line: The source file line number

This call either creates a new timer and starts it or starts a new iteration of an existing timer.

Return

A hash derived from a string catenation the label, gmask, file and line.

double MACSIO_TIMING_StopTimer(MACSIO_TIMING_TimerId_t id)

Stop a timer.

Parameters

This call stops a currently running timer.

Return

Returns the time for the current iteration of the timer

double MACSIO_TIMING_GetTimerDatum(MACSIO_TIMING_TimerId_t tid, char const *field)

Get specific data field from a timer.

Parameters

Possible field names are…

  • “start_time” time last iteration was started

  • ”iter_count” number of times the timer has been triggered (iterated)

  • ”iter_time” total time over all iterations

  • ”min_iter” iteration at which minimum time was observed

  • ”max_iter” iteration at which maximum time was observed

  • ”min_rank” task rank where minimum time occurred (only valid on reduced timers)

  • ”max_rank” task rank where maximum time occurred (only valid on reduced timers)

  • ”depth” how deeply this timer is nested within other timers

  • ”total_time” total time spent in this timer

  • ”min_time” minimum time observed for this timer

  • ”max_time” maximum time observed for this timer

  • ”running_mean” current average time for this timer

  • ”running_var” current variance for this timer

Where applicable, returned values are over either

  • all iterations (when using non-reduced timers) or

  • all iterations and ranks (when using reduced timers).

double MACSIO_TIMING_GetReducedTimerDatum(MACSIO_TIMING_TimerId_t tid, char const *field)

Get data field a specific reduced timer.

Parameters

For field names, see MACSIO_TIMING_GetTimerDatum()

void MACSIO_TIMING_DumpTimersToStrings(MACSIO_TIMING_GroupMask_t gmask, char ***strs, int *nstrs, int *maxlen)

Dump timers to ascii strings.

Parameters
  • gmask: Group mask to filter only timers belonging to specific groups

  • strs: Array of strings, one for each timer, returned to caller. Caller must free.

  • nstrs: Number of strings returned to caller

  • maxlen: The maximum length of all strings

This call will find all used timers in the hash table matching the specified gmask group mask and dumps each timer to a string. For convenience, the maximum length of the strings is also returned. This is to facilitate dumping the strings to a MACSIO_LOG. Pass MACSIO_TIMING_ALL_GROUPS if you don’t care about timer groups.

Each timer is dumped to a string with the following informational fields catenated together and separated by colons…

  • TOT=%10.5f summed total time spent in this timer over all iterations (and over all ranks when reduced)

  • CNT=%04d summed total total number of iterations in this timer (and over all ranks when reduced)

  • MIN=%8.5f(%4.2f):%06d

    • minimum time recorded for this timer over all iterations (and ranks when reduced)

    • (%4.2f) the number of standard deviations of the min from the mean time

    • :%06d task rank where the minimum was observed (only valid when reduced)

  • AVG=%8.5f the mean time of this timer over all iterations (and ranks when reduced)

  • MAX=%8.5f(%4.2f):%06d,

    • maximum time recorded for this timer over all iterations (and ranks when reduced)

    • (%4.2f) the number of standard deviations of the max from the mean time

    • :%06d task rank where the maximum was observed. (only valid when reduced)

  • DEV=%8.8f standard deviation observed for all iterations of this timer

  • FILE=%s the source file where this timer is triggered

  • LINE=%d the source line number where this timer is triggered

  • LAB=%s the user-defined label for this timer

void MACSIO_TIMING_ReduceTimers(MPI_Comm comm, int root)

Reduce timers across MPI tasks.

Parameters
  • comm: The MPI communicator to use for the reduction

  • root: The MPI rank of the root task to be used for the reduction

Computes a parallel reduction across MPI tasks of all timers.

void MACSIO_TIMING_DumpReducedTimersToStrings(MACSIO_TIMING_GroupMask_t gmask, char ***strs, int *nstrs, int *maxlen)

Dump reduced timers to ascii strings.

Parameters
  • gmask: Group mask to filter only timers belonging to specific groups

  • strs: Array of strings, one for each timer, returned to caller. Caller must free

  • nstrs: Number of strings returned to caller

  • maxlen: The maximum length of all strings

Similar to DumpTimersToStrings except this call dumps reduced timers

void MACSIO_TIMING_ClearTimers(MACSIO_TIMING_GroupMask_t gmask)

Clear a group of timers.

Parameters
  • gmask: Group mask to filter only timers belonging to specific groups

Clears and resets timers of specified group. Pass MACSIO_TIMING_ALL_GROUPS to clear all timers.

double MACSIO_TIMING_GetCurrentTime(void)

Get current time.

Depending on current value of MACSIO_TIMING_UseMPI_Wtime, uses either MPI_WTime() or gettimeofday().

Variables

int MACSIO_TIMING_UseMPI_Wtime

Integer variable to control function used to get timer values.

A non-zero value indicates that MACSIO_TIMING should use MPI_Wtime(). Otherwise, it will use gettimeofday().

Example (tsttiming.c)

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156

#include <math.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>

#ifdef HAVE_MPI
#include <mpi.h>
#endif

#include <macsio_log.h>
#include <macsio_timing.h>

int MACSIO_MAIN_Rank;
int MACSIO_MAIN_Size;
int MACSIO_MAIN_Comm;

static void dsleep(double delay)
{
    double dsec = floor(delay);
    long nsec = (long) ((delay - dsec) * 1e+9);
    time_t sec = (time_t) dsec;
    struct timespec ts = {sec, nsec};
    nanosleep(&ts, 0);
}

void func2();

void func4()
{
    static int iter = 0;
    double delay = (random() % 1000) / (double) 10000; /* random up to 1/10th second of sleep */
    MACSIO_TIMING_TimerId_t tid = MT_StartTimer("func4", MACSIO_TIMING_ALL_GROUPS, iter++);
    dsleep(delay);
    MT_StopTimer(tid);
}

void func3()
{
    static int iter = 0;
    MACSIO_TIMING_TimerId_t tid = MT_StartTimer("func3", MACSIO_TIMING_ALL_GROUPS, iter++);
    func4();
    func2();
    MT_StopTimer(tid);
}

void func2()
{
    static int iter = 0;
    MACSIO_TIMING_TimerId_t tid = MT_StartTimer("func2", MACSIO_TIMING_ALL_GROUPS, iter++);
    dsleep(0.02);
    MT_StopTimer(tid);
}

void func1()
{
    static int iter = 0;
    MACSIO_TIMING_TimerId_t tid = MT_StartTimer("func1", MACSIO_TIMING_ALL_GROUPS, iter);
    MACSIO_TIMING_TimerId_t tid2;

    func2();
    dsleep(0.01);
    tid2 = MT_StartTimer("call to func3 from func1", MACSIO_TIMING_ALL_GROUPS, iter++);
    func3();
    MT_StopTimer(tid2);
    func2();

    MT_StopTimer(tid);
}

int main(int argc, char **argv)
{
    int i, rank = 0, size = 1;
    MACSIO_TIMING_TimerId_t a, b;
    char **timer_strs;
    int ntimer_strs, maxstrlen;

#ifdef HAVE_MPI
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
#endif

    MACSIO_LOG_DebugLevel = 1; /* should only see debug messages level 1 and 2 */
    srandom(0xDeadBeef); /* used to vary length of some timers */

    if (size > 8)
    {
        if (!rank)
            fprintf(stderr, "This test only appropriate for 8 or fewer processors\n");
#ifdef HAVE_MPI
        MPI_Abort(MPI_COMM_WORLD, 1);
#endif
        exit(1);
    }

    a = MT_StartTimer("main", MACSIO_TIMING_ALL_GROUPS, 0);

    func1();

    if (rank > 2)
        func4();

    b = MT_StartTimer("call to func3 from main", MACSIO_TIMING_ALL_GROUPS, 0);
    func3();
    MT_StopTimer(b);

    if (rank < 2)
        func1();

    MT_StopTimer(a);

    MACSIO_TIMING_DumpTimersToStrings(MACSIO_TIMING_ALL_GROUPS, &timer_strs, &ntimer_strs, &maxstrlen);

#ifdef HAVE_MPI
    {
        int rbuf[2], sbuf[2] = {ntimer_strs, maxstrlen};
        MPI_Allreduce(sbuf, rbuf, 2, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
        MACSIO_LOG_MainLog = MACSIO_LOG_LogInit(MPI_COMM_WORLD, "tsttiming.log", rbuf[1]+32, 2*rbuf[0]+4,0);
    }
#else
    MACSIO_LOG_MainLog = MACSIO_LOG_LogInit(0, "tsttiming.log", maxstrlen+4, ntimer_strs+4, 0);
#endif

    for (i = 0; i < ntimer_strs; i++)
    {
        MACSIO_LOG_MSG(Dbg1, (timer_strs[i]));
        free(timer_strs[i]);
    }
    free(timer_strs);

#ifdef HAVE_MPI
    MACSIO_TIMING_ReduceTimers(MPI_COMM_WORLD, 0);
    if (!rank)
    {
        MACSIO_TIMING_DumpReducedTimersToStrings(MACSIO_TIMING_ALL_GROUPS, &timer_strs, &ntimer_strs, &maxstrlen);
        MACSIO_LOG_MSG(Dbg1, ("#####################Reduced Timer Values######################"));
        for (i = 0; i < ntimer_strs; i++)
        {
            MACSIO_LOG_MSG(Dbg1, (timer_strs[i]));
            free(timer_strs[i]);
        }
        free(timer_strs);
    }
#endif

    MACSIO_LOG_LogFinalize(MACSIO_LOG_MainLog);

    MACSIO_TIMING_ClearTimers(MACSIO_TIMING_ALL_GROUPS);

#ifdef HAVE_MPI
    MPI_Finalize();
#endif

    return 0;
}

Logging

MACSio’s logging package is a small utility for logging various kinds of messages from processors. This includes debugging messages, warnings and errors.

A log is a single text file that is divided into groups of message lines. Each processor gets its own group of lines within the file it may write to. Rank 0 gets the first group of lines. Rank 1, the next group of lines and so forth. Each line has a maximum length too.

When a log is created with MACSIO_LOG_LogInit(), the caller specifies the MPI communicator the log will be used for, the number of message lines per task to allocate plus a count of extra lines for rank 0 and the maximum length of any message. Rank 0 initializes the text file with all space characters except for header lines to distinguish each processor’s group of lines in the file. Note that for a large run on say 10^5 processors, it may take rank 0 several seconds to initialize this file.

Messages are restricted to a single line of text. Any embedded new-line characters are removed from a message and replaced with a ‘!’ character. If a processor’s message is longer than the maximum length of a line for the log, the message is truncated to fit within the line. As processors issue messages, they are written to the processors group of lines in round-robin fashion. As the set of messages issued from a processor reaches the end of its group of lines within the file, it starts issuing new messages at the beginning of the group. So, older messages can wind up getting overwritten by newer messages. However, all the most recent messages prior to any significant error will at least be captured in the log.

Parallelism in writing to a MACSIO_LOG file is achieved by ensuring no two processor attempt to write data in overlapping regions in the file by using pwrite() to do the actual writes.

MACSio’s main creates a default log, MACSIO_LOG_MainLog, on the MACSIO_MAIN_Comm. That log is probably the only log needed by MACSio proper or any of its plugins. The convenience macro, MACSIO_LOG_MSG, is the only method one need to worry about to log messages to the main log. That macro will also capture more detailed information regarding error states around the time the message issue issued including the file and line number, the system errno and the most recent MPI error state.

If you use MACSIO_LOG_MSG, messages are allowed one of several severities; Dbg1, Dbg2, Dbg3, Warn, Err and Die. A severity of Die causes an abort. Depending on the current debug level setting, Dbg1 - Dbg3 messages may or may not be recorded to a log.

MACSio’s main also creates a log for issuing messages to stderr, MACSIO_LOG_StdErr. However, there is no convenience macro like MACSIO_LOG_MSG for logging messages to MACSIO_LOG_StdErr. A caller simply has to use the logging interface methods to issue messages to MACSIO_LOG_StdErr.

However, any plugin or other portion of MACSio may, optionally, create its own, private, log using MACSIO_LOG_LogInit(). Once a log is created, any processor can independently issue messages to the log. There is no parallel communication involved in issuing messages to logs.

Examples of the use of the logging package can be found in Example (tstlog.c).

Logging API

group MACSIO_LOG

Message logging utilities.

Defines

MACSIO_LOG_DEFAULT_LINE_COUNT
MACSIO_LOG_DEFAULT_EXTRA_LINES
MACSIO_LOG_DEFAULT_LINE_LENGTH
__BASEFILE__

Same as basename(FILE) Used to strip unnecessary path terms from in messages.

MACSIO_LOG_MSG(SEV, MSG)

Convenience macro for logging a message to the main log.

Parameters
  • [in] SEV: Abbreviated message severity (e.g. ‘Dbg1’, ‘Warn’)

  • [in] MSG: Caller’s sprintf-style message enclosed in parenthises (e.g. ‘(“Rank %d failed”,rank))’

MACSIO_LOG_MSGV(VSEV, MSG)

Alterantive to MACSIO_LOG_MSG when severity is a runtime variable.

Parameters
  • [in] VSEV: Runtime variable in which message severity is stored

  • [in] MSG: Caller’s sprintf-style message enclosed in parenthises (e.g. ‘(“Rank %d failed”,rank))’

MACSIO_LOG_MSGL(LOG, SEV, MSG)

Convenience macro for logging a message to any specific log.

Parameters
  • [in] LOG: The log handle

  • [in] SEV: Abbreviated message severity (e.g. ‘Dbg1’, ‘Warn’)

  • [in] MSG: Caller’s sprintf-style message enclosed in parenthises (e.g. ‘(“Rank %d failed”,rank))’

MACSIO_LOG_MSGLV(LOG, VSEV, MSG)

Convenience macro for logging a message with variable severity to any specific log.

Parameters
  • [in] LOG: The log handle

  • [in] VSEV: Runtime variable in which message severity is stored

  • [in] MSG: Caller’s sprintf-style message enclosed in parenthises (e.g. ‘(“Rank %d failed”,rank))’

Typedefs

typedef struct _log_flags_t log_flags_t
typedef struct _MACSIO_LOG_LogHandle_t MACSIO_LOG_LogHandle_t
typedef enum _MACSIO_LOG_MsgSeverity_t MACSIO_LOG_MsgSeverity_t
typedef struct _MACSIO_LOG_LogHandle_t MACSIO_LOG_LogHandle_t

Enums

enum _MACSIO_LOG_MsgSeverity_t

Values:

enumerator MACSIO_LOG_MsgDbg1

Debug level 1: For coarse grained debugging messages (rare enough performance isn’t effected)

enumerator MACSIO_LOG_MsgDbg2

Debug level 2: For moderate grained debugging messages (may effect performance)

enumerator MACSIO_LOG_MsgDbg3

Debug level 3: For fine grained debugging messages (most likely effects performance)

enumerator MACSIO_LOG_MsgInfo

Informational messages

enumerator MACSIO_LOG_MsgWarn

Warnings of minor problems that can be recovered from without undue effects

enumerator MACSIO_LOG_MsgErr

Error conditions that result in unexpected behavior

enumerator MACSIO_LOG_MsgDie

Unrecoverable errors

Functions

char const *MACSIO_LOG_MakeMsg(char const *format, ...)

Internal convenience method to build a message from a printf-style format string and args.

Parameters
  • format: A printf-like error message format string.

This method is public only because it is used within the MACSIO_LOG_MSG convenience macro.

MACSIO_LOG_LogHandle_t *MACSIO_LOG_LogInit(MPI_Comm comm, char const *path, int line_len, int lines_per_proc, int extra_lines_proc0)

Initialize a log.

Parameters
  • comm: MPI Communicator of tasks that will issue messages to this log

  • path: The name of the log file

  • line_len: The length of each message line in the log file

  • lines_per_proc: The number of message lines for each MPI task

  • extra_lines_proc0: The number of extra message lines for processor rank 0

All processors in the comm communicator must call this function collectively with identical values for path, line_len, and lines_per_proc.

void MACSIO_LOG_LogMsg(MACSIO_LOG_LogHandle_t const *log, char const *fmt, ...)

Issue a printf-style message to a log.

Parameters
  • log: The handle for the specified log

  • fmt: A printf-style format string for the log message

May be called independently by any processor in the communicator used to initialize the log.

void MACSIO_LOG_LogMsgWithDetails(MACSIO_LOG_LogHandle_t const *log, char const *linemsg, MACSIO_LOG_MsgSeverity_t sevVal, char const *sevStr, int sysErrno, int mpiErrno, char const *theFile, int theLine)

Convenience method for building a detailed message for a log.

Parameters
  • log: Log handle to issue message to

  • linemsg: Caller’s message string

  • sevVal: Caller’s message severity value

  • sevStr: Caller’s message severity abbreviation string

  • sysErrno: Current (most recnet) system’s errno

  • mpiErrno: Current (most recent) MPI error

  • theFile: Caller’s file name

  • theLine: Caller’s line number within the file

void MACSIO_LOG_LogFinalize(MACSIO_LOG_LogHandle_t *log)

Finalize and close an open log Should be called collectively by all processors that created the log.

Parameters
  • log: The log to be closed

Variables

int mpi_errno

Error code returned by most recent MPI call.

MACSIO uses mpi_errno much like the system’s errno. However, there is no way to enforce that any particular MPI function calls set mpi_errno. MACSIO relies upon the honor system of develepors to always make MPI calls by setting mpi_errno to the return value of those calls. Assuming this practice is followed throughout MACSIO and any of its plugins, then the global variable mpi_errno should always hold the MPI error return value of the most recent MPI call.

int MACSIO_LOG_DebugLevel

Filtering level for debugging messages.

MACSIO generates 3 levels of debug messages numbered 1, 2 and 3. Each level is intended to represent more and more detailed debugging messages. Level 1 messages are issued rare enough by MACSIO (e.g. coarse grained) that they will not impact performance. Level 3 messages can be issued often enough by MACSIO that they will almost certainly impact performance. Level 2 messages are somewhere in between. The global variable MACSIO_LOG_DebugLevel is used to filter what messages generated by MACSIO that will actually make it into a detailed log message. If MACSIO_LOG_DebugLevel is set to level N, then messages at and below N debug level will appear in a log when written via MACSIO_LOG_LogMsgWithDetails. However, messages above this level will be discarded. As developers write code blocks in MACSIO, developers must choose what kind of messages are issued and, for debugging kinds of messages, the appropriate debug level.

MACSIO_LOG_LogHandle_t *MACSIO_LOG_MainLog

Log handle for MACSIO’s main log.

Typically, developers will use MACSIO_LOG_MSG macro to log messages. Such messages are automatically logged to the main log. The main log is created by MACSIO’s main.

MACSIO_LOG_LogHandle_t *MACSIO_LOG_StdErr

Log handle for MACSIO’s stderr output.

In rare cases, developers may want to issues messages to stderr. In this case, MACSIO’s stderr log handle should be used.

struct _log_flags_t

Public Members

unsigned int was_logged

Indicates if a message was ever logged to the log

struct _MACSIO_LOG_LogHandle_t

Public Members

MPI_Comm comm

MPI Communicator of tasks that will issue messages to this log

char *pathname

Name of the log file

int logfile

Log file file descriptor

int rank

Rank of the processor that created this log handle

int size

Size of the communicator that created this log handle

int log_line_length

Maximum length of a message line in the log file

int lines_per_proc

Number of message lines allocated in the file for each processor

int extra_lines_proc0

Additional number of message lines for processor with MPI rank 0

int current_line

Index into this processor’s group of lines in the log file at which the next message will be written

log_flags_t flags

Informational flags regarding the log

Example (tstlog.c)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87

#include <errno.h>
#include <stdlib.h>
#include <strings.h>

#include <macsio_log.h>

#ifdef HAVE_MPI
#include <mpi.h>
#endif

int MACSIO_MAIN_Rank;
int MACSIO_MAIN_Size;
int MACSIO_MAIN_Comm;

int main (int argc, char **argv)
{
    int i;
    int rank=0, size=1;
    int num_cols = 128, num_rows = 20, extra_lines = 20;
#ifdef HAVE_MPI
    MPI_Comm comm = MPI_COMM_WORLD;
#else
    int comm = 0;
#endif

    for (i = 0; i < argc; i++)
    {
        if (!strncasecmp(argv[i], "num_cols=", 9))
            num_cols = strtol(argv[i]+9, 0, 10);
        else if (!strncasecmp(argv[i], "num_rows=", 9))
            num_rows = strtol(argv[i]+9, 0, 10);
        else if (!strncasecmp(argv[i], "extra_lines=", 9))
            extra_lines = strtol(argv[i]+9, 0, 10);
    }

#ifdef HAVE_MPI
    MPI_Init(&argc, &argv);
    MPI_Comm_size(comm, &size);
    MPI_Comm_rank(comm, &rank);
#endif

    MACSIO_LOG_DebugLevel = 2; /* should only see debug messages level 1 and 2 */
    MACSIO_LOG_MainLog = MACSIO_LOG_LogInit(comm, "tstlog.log", num_cols, num_rows, extra_lines);
    MACSIO_LOG_StdErr = MACSIO_LOG_LogInit(comm, 0, 0, 0, 0);

    if (rank == 1)
    {
        MACSIO_LOG_MSG(Dbg1, ("I am staring with processor 1"));
        MACSIO_LOG_MSG(Dbg2, ("Test output of a very, very, very, very, very, very, very, very, very, very, very, very, very, very, very, very, very, very, very, very, very, very, very, very, very, very long line"));
    }
    else if (rank == 2)
    {
        MACSIO_LOG_MSG(Warn, ("Starting on proc 2"));
        MACSIO_LOG_LogMsg(MACSIO_LOG_StdErr, "Logging a message to stderr for rank %d", rank);
    }
    else if (rank == 0)
    {
        errno = EOVERFLOW;
#ifdef HAVE_MPI_
        mpi_errno = MPI_ERR_COMM;
#else
        mpi_errno = 0;
#endif

        MACSIO_LOG_MSG(Err, ("I am here on proc 0"));
    }
    else
    {
        int i;
        MACSIO_LOG_MsgSeverity_t sev = rank%2?MACSIO_LOG_MsgDbg2:MACSIO_LOG_MsgDbg3;
        for (i = 0; i < 25; i++)
        {
            MACSIO_LOG_MSGV(sev, ("Outputing line %d for rank %d\n", i, rank));
        }
    }

    MACSIO_LOG_LogFinalize(MACSIO_LOG_StdErr);
    MACSIO_LOG_LogFinalize(MACSIO_LOG_MainLog);

#ifdef HAVE_MPI
    MPI_Finalize();
#endif

    return 0;

}

Contributing to MACSio Documentation

  • Use :any: for links into Doxygon content instead of :c:func: as described in breathe documentation. The reason is that in order for breathe’s :c:func: to work, one has to create breathe-specific sphinx directives for each and every Doxygen target (which IMHO kinda defeats the purpose of breathe). Also, it apparently only works for functions anyways, not types or macros or structs

  • Put Doxygen markup only in header files. Avoid using it in implementation (c) files.

  • Keep Doxygen markup fairly terse and to the point and put more of the prose and lengthy descriptions in .rst files.

  • Use .. only:: internals to conditionally include blocks that are used solely to document internal functionality of a given package. See example in macsio_mif.rst.

  • To build locally env DONT_INSTALL_BREATHE=1 READTHEDOCS=1 sphinx-build -b html . _build -a [-t internals]