Efficient File Copying On Linux

Efficient File Copying On Linux

Efficient File Copying On Linux
Mar 22, 2017
In response to my last post about dd, a friend of mine noticed that GNU cp always uses a 128 KB buffer size when copying a regular file; this is also the buffer size used by GNU cat. If you use strace to watch what happens when copying a file, you should see a lot of 128 KB read/write sequences:

$ strace -s 8 -xx cp /dev/urandom /dev/null

read(3, “\x61\xca\xf8\xff\x1a\xd6\x83\x8b”…, 131072) = 131072
write(4, “\x61\xca\xf8\xff\x1a\xd6\x83\x8b”…, 131072) = 131072
read(3, “\xd7\x47\x8f\x09\xb2\x3d\x47\x9f”…, 131072) = 131072
write(4, “\xd7\x47\x8f\x09\xb2\x3d\x47\x9f”…, 131072) = 131072
read(3, “\x12\x67\x90\x66\xb7\xed\x0a\xf5″…, 131072) = 131072
write(4, “\x12\x67\x90\x66\xb7\xed\x0a\xf5″…, 131072) = 131072
read(3, “\x9e\x35\x34\x4f\x9d\x71\x19\x6d”…, 131072) = 131072
write(4, “\x9e\x35\x34\x4f\x9d\x71\x19\x6d”…, 131072) = 131072

As you can see, each copy is operating on buffers 131072 bytes in size, which is 128 KB. GNU cp is part of the GNU coreutils project, and if you go diving into the coreutils source code you’ll find this buffer size is defined in the file src/ioblksize.h. The comments in this file are really fascinating. The author of the code in this file (Jim Meyering) did a benchmark using dd if=/dev/zero of=/dev/null with different values of the block size parameter, bs. On a wide variety of systems, including older Intel CPUs, modern high-end Intel CPUs, and even an IBM POWER7 CPU, a 128 KB buffer size is fastest. I used gnuplot to graph these results, shown below. Higher transfer rates are better, and the different symbols represent different system configurations.

buffer size

Most of the systems get faster transfer rates as the buffer size approaches 128 KB. After that, performance generally degrades slightly.

The file includes a cryptic, but interesting, explanation of why 128 KB is the best buffer size. Normally with these system calls it’s more efficient to use larger buffer sizes. This is because the larger the buffer size used, the fewer system calls need to be made. So why the drop off in performance when a buffer larger than 128 KB is used?

When copying a file, GNU cp will first call posix_fadvise(2) on the source file with POSIX_FADV_SEQUENTIAL as the “advice” flag. As the name implies, this gives a hint to the kernel that cp plans to scan the source file sequentially. This causes the Linux kernel to use “readahead” for the file. On Linux you can also initiate readahead using madvise(2). There’s also a system call actually called readahead(2), but it has a slightly different use case.

When you read(2) data from a regular file, if you’re lucky some or all of the data you plan to read will already be in the kernel’s page cache. The page cache is a cache of disk pages stored in kernel memory. Normally this works on an LRU basis, so when you read a page from disk the kernel first checks the page cache, and if the page isn’t in the cache it reads it from disk and copies it into the page cache (possibly evicting an older page from the cache). This means the first access to a disk page actually requires going to disk, but subsequent accesses can simply copy the data from main memory if the disk page is still in the page cache.

When the kernel initiates readahead, it makes a best effort to prefetch pages that it thinks will be needed imminently. In particular, when accessing a file sequentially, the kernel will attempt to prefetch upcoming parts of the file as the file is read. When everything is working correctly, one can get a high cache hit rate even if the file contents weren’t already in the page cache when the file was initially opened. In fact, if the file is actually accessed sequentially, there’s a good chance of getting a 100% hit rate from the page cache when the kernel is doing readahead.

There’s a trade-off here, because if the kernel prefetches pages more aggressively there will be a higher cache hit rate; but if the kernel is too aggressive, it may wastefully prefetch pages that aren’t actually going to be read. What actually happens is the kernel has a readahead buffer size configured for each block device, and the readahead kernel thread will prefetch at most that much data for files on that block device. You can see the readahead buffer size using the blockdev command:

# Get the readahead size for /dev/sda
$ blockdev –getra /dev/sda
The units returned by blockdev are in terms of 512 byte “sectors” (even though my Intel SSD doesn’t actually have true disk sectors). Thus a return value of 256 actually corresponds to a 128 KB buffer size. You can see how this is actually implemented by the kernel in the file mm/readahead.c, in particular in the method ondemand_readahead() which calls get_init_ra_size(). From my non-expert reading of the code, it appears that the code tries to look at the number of pages in the file, and for large files a maximum value of 128 KB is used. Note that this is highly specific to Linux: other Unix kernels may or may not implement readahead, and if they do there’s no guarantee that they’ll use the same readahead buffer size.

So how is this related to disk transfer rates? As noted earlier, typically one wants to minimize the number of system calls made, as each system call has overhead. In this case that means we want to use as large a buffer size as possible. On the other hand, performance will be best when the page cache hit rate is high. A buffer size of 128 KB fits both of these constraints—it’s the maximum buffer size that can be used before readahead will stop being effective. If a larger buffer size is used, read(2) calls will block while kernel waits for the disk to actually return new data.

In the real world a lot of other things will be happening on the host, so there’s no guarantee that the stars will align perfectly. If the disk is very fast, the effect of readahead is diminished, so the penalty for using a larger buffer size might not be as bad. It’s also possible to race the kernel here: a userspace program could try to read a file faster than the kernel can prefetch pages, which will make readahead less effective. But on the whole, we expect a 128 KB buffer size to be most effective, and that’s exactly what the benchmark above demonstrates.


Haven App: Keep Watch

Haven App: Keep Watch

About Haven
Haven is for people who need a way to protect their personal spaces and possessions without compromising their own privacy. It is an Android application that leverages on-device sensors to provide monitoring and protection of physical spaces. Haven turns any Android phone into a motion, sound, vibration and light detector, watching for unexpected guests and unwanted intruders. We designed Haven for investigative journalists, human rights defenders, and people at risk of forced disappearance to create a new kind of herd immunity. By combining the array of sensors found in any smartphone, with the world’s most secure communications technologies, like Signal and Tor, Haven prevents the worst kind of people from silencing citizens without getting caught in the act.

View our full Haven App Overview presentation for more about the origins and goals of the project.

Announcement and Public Beta
We are announcing Haven today, as an open-source project, along a public beta release of the app. We are looking for contributors who understand that physical security is as important as digital, and who have an understanding and compassion for the kind of threats faced by the users and communities we want to support. We also think it is really cool, cutting edge, and making use of encrypted messaging and onion routing in whole new ways. We believe Haven points the way to a more sophisticated approach to securing communication within networks of things and home automation system.

Learn more about the story of this project at the links below:

Haven: Building the Most Secure Baby Monitor Ever?
Snowden’s New App Uses Your Smartphone To Physically Guard Your Laptop
Snowden’s New App Turns Your Phone Into a Home Security System
Project Team
Haven was developed through a collaboration between Freedom of the Press Foundation and Guardian Project. Prototype funding was generously provided by FoPF, and donations to support continuing work can be contributed through their site: https://freedom.press/donate-support-haven-open-source-project/

Freedom of the Press Foundation Guardian Project

Safety through Sensors
Haven only saves images and sound when triggered by motion or volume, and stores everything locally on the device. You can position the device’s camera to capture visible motion, or set your phone somewhere discreet to just listen for noises. Get secure notifications of intrusion events instantly and access the logs remotely or anytime later.

The follow sensors are monitored for a measurable change, and then recorded to an event log on the device:

Accelerometer: phone’s motion and vibration
Camera: motion in the phone’s visible surroundings from front or back camera
Microphone: noises in the enviroment
Light: change in light from ambient light sensor
Power: detect device being unplugged or power loss
The application can be built using Android Studio and Gradle. It relies on a number of third-party dependencies, all which are free, open-source and listed at the end of this document.

You can currently get the Haven BETA release in one of three ways:

Download Haven from Google Play
First, install F-Droid the open-source app store, and second, add our Haven Nightly “Bleeding Edge” repository by scanning the QR Code below:

or add this repository manually in F-Droid’s Settings->Repositories: https://guardianproject.github.io/haven-nightly/fdroid/repo/

Grab the APK files from the Github releases page
You can, of course, build the app yourself, from source.

If you are an Android developer, you can learn more about how you can make use of F-Droid in your development workflow, for nightly builds, testing, reproducability and more here: F-Droid Documentation

Why no iPhone Support?
While we hope to support a version of Haven that runs directly on iOS devices in the future, iPhone users can still benefit from Haven today. You can purchase an inexpensive Android phone for less than $100, and use that as your “Haven Device”, that you leave behind, while you keep your iPhone with you. If you run Signal on your iPhone, you can configure Haven on Android to send encrypted notifications, with photos and audio, directly to you. If you enable the “Tor Onion Service” feature in Haven (requires installing “Orbot” app as well), you can remotely access all Haven log data from your iPhone, using the Onion Browser app.

So, no, iPhone users we didn’t forget about you, and hope you’ll pick up an Android burner today for a few bucks!

Haven is meant to provide an easy onboarding experience, that walks through user through configuring the sensors on their device to best detect intrusions into their environment. The current implementation has some of this implemented, but we are looking to improve this user experience dramatically.

Main view
Application’s main view allows the user to set which sensors to use and the corresponding level of sensitivity. A security code must be provided, needed to disable monitoring. A phone number can be set, if any of the sensors is triggered a message is sent to the specified number.

When one of the sensors is triggered (reaches the sensibility threshold) a notifications is sent through the following channels (if enabled).

SMS: a message is sent to the number specified when monitoring started
Signal: if configured, can send end-to-end encryption notifications via Signal
Notifications are sent through a service running in background that is defined in class MonitorService.

Remote Access
All event logs and captured media can be remotely accessed through a Tor Onion Service. Haven must be configured as an Onion Service, and requires the device to also have Orbot: Tor for Android installed and running.

This project contains source code or library dependencies from the follow projects:

SecureIt project available at: https://github.com/mziccard/secureit Copyright (c) 2014 Marco Ziccardi (Modified BSD)
libsignal-service-java from Open Whisper Systems: https://github.com/WhisperSystems/libsignal-service-java (GPLv3)
signal-cli from AsamK: https://github.com/AsamK/signal-cli (GPLv3)
Sugar ORM from chennaione: https://github.com/chennaione/sugar/ (MIT)
Square’s Picasso: https://github.com/square/picasso (Apache 2)
JayDeep’s AudioWife: https://github.com/jaydeepw/audio-wife (MIT)
AppIntro: https://github.com/apl-devs/AppIntro (Apache 2)
Guardian Project’s NetCipher: https://guardianproject.info/code/netcipher/ (Apache 2)
NanoHttpd: https://github.com/NanoHttpd/nanohttpd (BSD)
Milosmns’ Actual Number Picker: https://github.com/milosmns/actual-number-picker (GPLv3)
Fresco Image Viewer: https://github.com/stfalcon-studio/FrescoImageViewer (Apache 2)
Facebook Fresco Image Library: https://github.com/facebook/fresco (BSD)
Audio Waveform Viewer: https://github.com/derlio/audio-waveform (Apache 2)
FireZenk’s AudioWaves: https://github.com/FireZenk/AudioWaves (MIT)
MaxYou’s SimpleWaveform: https://github.com/maxyou/SimpleWaveform (MIT)
haven is maintained by guardianproject.
This page was generated by GitHub Pages.

Environment Modules

Environment Modules

Environment Modules
Welcome to the Environment Modules open source project. The Environment Modules package provides for the dynamic modification of a user’s environment via modulefiles.

What are Environment Modules?
Typically users initialize their environment when they log in by setting environment information for every application they will reference during the session. The Environment Modules package is a tool that simplify shell initialization and lets users easily modify their environment during the session with modulefiles.

Each modulefile contains the information needed to configure the shell for an application. Once the Modules package is initialized, the environment can be modified on a per-module basis using the module command which interprets modulefiles. Typically modulefiles instruct the module command to alter or set shell environment variables such as PATH, MANPATH, etc. modulefiles may be shared by many users on a system and users may have their own collection to supplement or replace the shared modulefiles.

Modules can be loaded and unloaded dynamically and atomically, in an clean fashion. All popular shells are supported, including bash, ksh, zsh, sh, csh, tcsh, fish, as well as some scripting languages such as perl, ruby, tcl and python.

Modules are useful in managing different versions of applications. Modules can also be bundled into metamodules that will load an entire suite of different applications.

Latest source release
Release notes, Migrating (2017-10-16)
Source repository
How to install Modules, reference manual page for the module(1) command and for modulefile(4) script, Frequently Asked Questions, …
Documentation portal
Mailing list: questions, comments or development suggestions for the Modules community can be sent to the modules-interest mailing list.
Bug report: spotting bug, report it to our tracker.
Quick examples
Here is an example of loading a module on a Linux machine under bash.
% module load gcc/6.1.1
% which gcc
Now we’ll switch to a different version of the module
% module switch gcc gcc/6.3.1
% which gcc
And now we’ll unload the module altogether
% module unload gcc
% which gcc
gcc not found
Now we’ll log into a different machine, using a different shell (tcsh).
tardis-> module load gcc/6.3.1
tardis-> which gcc
Note that the command line is exactly the same, but the path has automatically configured to the correct architecture.

About Modules
John L. Furlani, Peter W. Osel, “Abstract Yourself With Modules”, Proceedings of the Tenth Large Installation Systems Administration Conference (LISA ’96), pp.193-204, Chicago, IL, September 29 – October 4, 1996.
John L. Furlani, “Modules: Providing a Flexible User Environment”, Proceedings of the Fifth Large Installation Systems Administration Conference (LISA V), pp. 141-152, San Diego, CA, September 30 – October 3, 1991.
Erich Whitney, Mark Sprague, “Drag Your Design Environment Kicking and Screaming into the 90’s With Modules!”, Synopsys Users’ Group, Boston 2001
About Modules contributed / based tools
Richard Elling, Matthew Long, “user-setup: A system for Custom Configuration of User Environments, or Helping Users Help Themselves”, from Proceedings of the Sixth Systems Administration Conference (LISA VI), pp. 215-223, Long Beach, CA, October 19-23, 1992.
Brock Palen and Jeff Squyres – Research, Computing and Engineering – “RCE 60: Modules”, September 20, 2011.
Related tools
Flavours is a wrapper built on top of Modules C-version to simplify the organization and presentation of software that requiring multiple builds against different compilers, MPI libraries, processor architectures, etc. This package is written and maintained by Mark Dixon.

Env2 is a Perl script to convert environment variables between scripting languages. For example, convert a csh setup script to bash or the other way around. Supports bash, csh, ksh, modulecmd, perl, plist, sh, tclsh, tcsh, vim, yaml and zsh. This package is written and maintained by David C. Black.

Software Collections is a Red Hat project that enables you to build and concurrently install multiple RPM versions of the same components on your system, without impacting the system versions of the RPM packages installed from your distribution. Once installed a software collection is enabled with the scl command that relies on Modules for the user environment setup.

The OSCAR Cluster Project uses modules along with a tool called switcher. Read about switcher and modules in section 4.11 of the OSCAR Cluster User’s Guide.

Reference installations
The NERSC – The National Energy Research Scientific Computing Center has a great introduction and help page: Modules Approach to Environment Management.

The University of Minnesota CSE-IT manages software in Unix using Modules. They give some insight of their Modules usage and provide details on the way they have setup their Modules environment.

Modules is covered by the GNU General Public License, version 2 and the GNU Lesser General Public License, version 2.1. Copyright © 1996-1999 John L. Furlani & Peter W. Osel, © 1998-2017 R.K.Owen, © 2002-2004 Mark Lakata, © 2004-2017 Kent Mein, © 2016-2017 Xavier Delaruelle. All rights reserved. Trademarks used are the property of their respective owners.

Environment Modules – A Great Tool for Clusters

Environment Modules – A Great Tool for Clusters

HPC – Admin Magazine


Home Articles News Newsletter ADMIN Shop
Home » HPC » Articles » Environment Mod…
Sign up for our Newsletter

Email Address

Sooner or later every cluster develops a plethora of tools and libraries for applications or for building applications. Often the applications or tools need different compilers or different MPI libraries, so how do you handle situations in which you need to change tool sets or applications? You can do it the hard way, or you can do it the easy way with Environment Modules.

Environment Modules – A Great Tool for Clusters
Jeff Layton
When people first start using clusters, they tend to stick with whatever compiler and MPI library came with the cluster when it was installed. As they become more comfortable with the cluster, using the compilers, and using the MPI libraries, they start to look around at other options: Are there other compilers that could perhaps improve performance? Similarly, they might start looking at other MPI libraries: Can they help improve performance? Do other MPI libraries have tools that can make things easier? Perhaps even more importantly, these people would like to install the next version of the compilers or MPI libraries so they can test them with their code. So this forces a question: How do you have multiple compilers and multiple MPI libraries on the cluster at the same time and not get them confused? I’m glad you asked.

The Hard Way

If you want to change you compiler or libraries – basically anything to do with your environment – you might be tempted to change your $PATH in the .bashrc file (if you are using Bash) and then log out and log back in whenever you need to change your compiler/MPI combination. Initially this sounds like a pain, and it is, but it works to some degree. It doesn’t work in the situation where you want to run multiple jobs each with a different compiler/MPI combination.

For example, say I have a job using the GCC 4.6.2 compilers using Open MPI 1.5.2, then I have a job using GCC 4.5.3 and MPICH2. If I have both jobs in the queue at the same time, how can I control my .bashrc to make sure each job has the correct $PATH? The only way to do this is to restrict myself to one job in the queue at a time. When it’s finished I can then change my .bashrc and submit a new job. Because you are using a different compiler/MPI combination from what is in the queue, even for something as simple as code development, you have to watch when the job is run to make sure your .bashrc matches your job.

The Easy Way

A much better way to handle compiler/MPI combinations is to use Environment Modules. (Be careful not to confuse “environment modules” with “kernel modules.”) According to the website, “The Environment Modules package provides for the dynamic modification of a user’s environment via modulefiles.” Although this might not sound earth shattering, it actually is a quantum leap for using multiple compilers/MPI libraries, but you can use it for more than just that, which I will talk about later.

You can use Environment Modules to alter or change environment variables such as $PATH, $MANPATH, $LD_LIBRARY_LOAD, and others. Because most job scripts for resource managers, such as LSF, PBS-Pro, and MOAB, are really shell scripts, you can incorporate Environment Modules into the scripts to set the appropriate $PATH for your compiler/MPI combination, or any other environment variables an application requires for operation.

How you install Environment Modules depends on how your cluster is built. You can build it from source, as I will discuss later, or you can install it from your package manager. Just be sure to look for Environment Modules.

Using Environment Modules

To begin, I’ll assume that Environment Modules is installed and functioning correctly, so you can now test a few of the options typically used. In this article, I’ll be using some examples from TACC. The first thing to check is what modules are available to you by using the module avail command:

[laytonjb@dlogin-0 ~]$ module avail

——————————————- /opt/apps/intel11_1/modulefiles ——————————————-
fftw3/3.2.2 gotoblas2/1.08 hdf5/1.8.4 mkl/ mvapich2/1.4 netcdf/4.0.1 openmpi/1.4

———————————————— /opt/apps/modulefiles ————————————————
gnuplot/4.2.6 intel/11.1(default) papi/3.7.2
intel/10.1 lua/5.1.4 pgi/10.2

————————————————– /opt/modulefiles —————————————————
Linux TACC TACC-paths cluster
———————————————– /cm/shared/modulefiles ————————————————
acml/gcc/64/4.3.0 fftw3/gcc/64/3.2.2 mpich2/smpd/ge/open64/64/1.1.1p1
acml/gcc/mp/64/4.3.0 fftw3/open64/64/3.2.2 mpiexec/0.84_427
acml/gcc-int64/64/4.3.0 gcc/4.3.4 mvapich/gcc/64/1.1
acml/gcc-int64/mp/64/4.3.0 globalarrays/gcc/openmpi/64/4.2 mvapich/open64/64/1.1
acml/open64/64/4.3.0 globalarrays/open64/openmpi/64/4.2 mvapich2/gcc/64/1.2
acml/open64-int64/64/4.3.0 hdf5/1.6.9 mvapich2/open64/64/1.2
blacs/openmpi/gcc/64/1.1patch03 hpl/2.0 netcdf/gcc/64/4.0.1
blacs/openmpi/open64/64/1.1patch03 intel-cluster-checker/1.3 netcdf/open64/64/4.0.1
blas/gcc/64/1 intel-cluster-runtime/2.1 netperf/2.4.5
blas/open64/64/1 intel-tbb/ia32/22_20090809oss open64/
bonnie++/1.96 intel-tbb/intel64/22_20090809oss openmpi/gcc/64/1.3.3
cmgui/5.0 iozone/3_326 openmpi/open64/64/1.3.3
default-environment lapack/gcc/64/3.2.1 scalapack/gcc/64/1.8.0
fftw2/gcc/64/double/2.1.5 lapack/open64/64/3.2.1 scalapack/open64/64/1.8.0
fftw2/gcc/64/float/2.1.5 mpich/ge/gcc/64/1.2.7 sge/6.2u3
fftw2/open64/64/double/2.1.5 mpich/ge/open64/64/1.2.7 torque/2.3.7
fftw2/open64/64/float/2.1.5 mpich2/smpd/ge/gcc/64/1.1.1p1
This command lists what environment modules are available. You’ll notice that TACC has a very large number of possible modules that provide a range of compilers, MPI libraries, and combinations. A number of applications show up in the list as well.

You can check which modules are “loaded” in your environment by using the list option with the module command:

[laytonjb@dlogin-0 ~]$ module list
Currently Loaded Modulefiles:
1) Linux 2) intel/11.1 3) mvapich2/1.4 4) sge/6.2u3 5) cluster 6) TACC
This indicates that when I log in, I have six modules already loaded for me. If I want to use any additional modules, I have to load them manually:

[laytonjb@dlogin-0 ~]$ module load gotoblas2/1.08
[laytonjb@dlogin-0 ~]$ module list
Currently Loaded Modulefiles:
1) Linux 3) mvapich2/1.4 5) cluster 7) gotoblas2/1.08
2) intel/11.1 4) sge/6.2u3 6) TACC
You can just cut and paste from the list of available modules to load the ones you want or need. (This is what I do, and it makes things easier.) By loading a module, you will have just changed the environmental variables defined for that module. Typically this is $PATH, $MANPATH, and $LD_LIBRARY_LOAD.

To unload or remove a module, just use the unload option with the module command, but you have to specify the complete name of the environment module:

[laytonjb@dlogin-0 ~]$ module unload gotoblas2/1.08
[laytonjb@dlogin-0 ~]$ module list
Currently Loaded Modulefiles:
1) Linux 2) intel/11.1 3) mvapich2/1.4 4) sge/6.2u3 5) cluster 6) TACC
Notice that the gotoblas2/1.08 module is no longer listed. Alternatively, to you can unload all loaded environment modules using module purge:

[laytonjb@dlogin-0 ~]$ module purge
[laytonjb@dlogin-0 ~]$ module list
No Modulefiles Currently Loaded.
You can see here that after the module purge command, no more environment modules are loaded.

If you are using a resource manager (job scheduler), you are likely creating a script that requests the resources and runs the application. In this case, you might need to load the correct Environment Modules in your script. Typically after the part of the script in which you request resources (in the PBS world, these are defined as #PBS commands), you will then load the environment modules you need.

Now that you’ve seen a few basic commands for using Environment Modules, I’ll go into a little more depth, starting with installing from source. Then I’ll use the module in a job script and write my own module.

Building Environment Modules for Clusters

In my opinion, the quality of open source code has improved over the last several years to the point at which building and installing is fairly straightforward, even if you haven’t built any code before. If you haven’t built code, don’t be afraid to start with Environment Modules.

For this article, as an example, I will build Environment Modules on a “head” node in the cluster in /usr/local. I will assume that you have /usr/local NSF exported to the compute nodes or some other filesystem or directory that is mounted on the compute nodes (perhaps a global filesystem?). If you are building and testing your code on a production cluster, be sure to check that /usr/local is mounted on all of the compute nodes.

To begin, download the latest version – it should be a *.tar.gz file. (I’m using v3.2.6, but the latest as of writing this article is v3.2.9). To make things easier, build the code in /usr/local. The documentation that comes with Environment Modules recommends that it be built in /usr/local/Modules/src. As root, run the following commands:

% cd /usr/local
% mkdir Modules
% cd Modules
% mkdir src
% cp modules-3.2.6.tar.gz /usr/local/Modules/src
% gunzip -c modules-3.2.6.tar.gz | tar xvf –
% cd modules-3.2.6
At this point, I would recommend you carefully read the INSTALL file; it will save your bacon. (The first time I built Environment Modules, I didn’t read it and had lots of trouble.)

Before you start configuring and building the code, you need to fulfill a few prerequisites. First, you should have Tcl installed, as well as the Tcl Development package. Because I don’t know what OS or distribution you are running, I’ll leave to you the tasks of installing Tcl and Tcl Development on the node where you will be building Environment Modules.

At this point, you should configure and build Environment Modules. As root, enter the following commands:

% cd /usr/local/Modules/src/modules-3.2.6
% ./configure
% make
% make install
The INSTALL document recommends making a symbolic link in /usr/local/Modules connecting the current version of Environment Modules to a directory called default:

% cd /usr/local/Modules
% sudo ln -s 3.2.6 default
The reason they recommend using the symbolic link is that, if you upgrade Environment Modules to a new version, you build it in /usr/local/Modules/src and then create a symbolic link from /usr/local/Modules/ to /usr/local/Modules/default, which makes it easier to upgrade.

The next thing to do is copy one (possibly more) of the init files for Environment Modules to a global location for all users. For my particular cluster, I chose to use the sh init file. This file will configure Environment Modules for all of the users. I chose to use the sh version rather than csh or bash, because sh is the least common denominator:

% sudo cp /usr/local/Modules/default/init/sh /etc/profile.d/modules.sh
% chmod 755 /etc/profile.d/modules.sh
Now users can use Environment Modules by just putting the following in their .bashrc or .profile:

%. /etc/profile.d/modules.sh
As a simple test, you can run the above script and then type the command module. If you get some information about how to use modules, such as what you would see if you used the -help option, then you have installed Environment Modules correctly.

Environment Modules in Job Scripts

In this section, I want to show you how you can use Environment Modules in a job script. I am using PBS for this quick example, with this code snippet for the top part of the job script:

#PBS -S /bin/bash
#PBS -l nodes=8:ppn=2

. /etc/profile.d/modules.sh
module load compiler/pgi6.1-X86_64
module load mpi/mpich-1.2.7

(insert mpirun command here)
At the top of the code snippet is the PBS directives that begin with #PBS. After the PBS directives, I invoke the Environment Modules startup script (modules.sh). Immediately after that, you should load the modules you need for your job. For this particular example, taken from a three-year-old job script of mine, I’ve loaded a compiler (pgi 6.1-x86_64) and an MPI library (mpich-1.2.7).

Building Your Own Module File

Creating your own module file is not too difficult. If you happen to know some Tcl, then it’s pretty easy; however, even if you don’t know Tcl, it’s simple to follow an example to create your own.

The modules themselves define what you want to do to the environment when you load the module. For example, you can create new environment variables that you might need to run the application or change $PATH, $LD_LIBRARY_LOAD, or $MANPATH so a particular application will run correctly. Believe it or not, you can even run code within the module or call an external application. This makes Environment Modules very, very flexible.

To begin, remember that all modules are written in Tcl, so this makes them very programmable. For the example, here, all of the module files go in /usr/local/Modules/default/modulefiles. In this directory, you can create subdirectories to better label or organize your modules.

In this example, I’m going to create a module for gcc-4.6.2 that I build and install into my home account. To begin, I create a subdirectory called compilers for any module file that has to do with compilers. Environment Modules has a sort of template you can use to create your own module. I used this as the starting point for my module. As root, do the following:

% cd /usr/local/Modules/default/modulefiles
% mkdir compilers
% cp modules compilers/gcc-4.6.2
The new module will appear in the module list as compilers/gcc-4.6.2. I would recommend that you look at the template to get a feel for the syntax and what the various parts of the modulefile are doing. Again, recall that Environment Modules use Tcl as its language but you don’t have to know much about Tcl to create a module file. The module file I created follows:

## modules compilers/gcc-4.6.2
## modulefiles/compilers/gcc-4.6.2. Written by Jeff Layton
proc ModulesHelp { } {
global version modroot

puts stderr “compilers/gcc-4.6.2 – sets the Environment for GCC 4.6.2 in my home directory”

module-whatis “Sets the environment for using gcc-4.6.2 compilers (C, Fortran)”

# for Tcl script use only
set topdir /home/laytonj/bin/gcc-4.6.2
set version 4.6.2
set sys linux86

setenv CC $topdir/bin/gcc
setenv GCC $topdir/bin/gcc
setenv FC $topdir/bin/gfortran
setenv F77 $topdir/bin/gfortran
setenv F90 $topdir/bin/gfortran
prepend-path PATH $topdir/include
prepend-path PATH $topdir/bin
prepend-path MANPATH $topdir/man
prepend-path LD_LIBRARY_PATH $topdir/lib
The file might seem a bit long, but it is actually fairly compact. The first section provides help with this particular module if a user asks for it (the line that begins with puts stderr); for example:

home8:~> module help compilers/gcc-4.6.2

———– Module Specific Help for ‘compilers/gcc-4.6.2’ ——–

compilers/gcc-4.6.2 – sets the Environment for GCC 4.6.2 in my home directory
You can have multiple strings by using several puts stderr lines in the module (the template has several lines).

After the help section in the procedure ModuleHelp, another line provides some simple information when a user uses the whatis option; for example:

home8:~> module whatis compilers/gcc-4.6.2
compilers/gcc-4.6.2 : Sets the environment for using gcc-4.6.2 compilers (C, Fortran)
After the help and whatis definitions is a section where I create whatever environment variables are needed, as well as modify $PATH, $LD_LIBRARY_PATH, and $MANPATH or other standard environment variables. To make life a little easier for me, I defined some local variables:topdir, version, and sys. I only used topdir, but I defined the other two variables in case I needed to go back and modify the module (the variables can help remind me what the module was designed to do).

In this particular modulefile, I defined a set of environment variables pointing to the compilers (CC, GCC, FC, F77, and F90). After defining those environment variables, I modified $PATH, $LD_LIBRARY_PATH, and $MANPATH so that the compiler was first in these paths by using the prepend-path directive.

This basic module is pretty simple, but you can get very fancy if you want or need to. For example, you could make a module file dependent on another module file so that you have to load a specific module before you load the one you want. Or, you can call external applications – for example, to see whether an application is installed and functioning. You are pretty much limited only by your needs and imagination.

Making Sure It Works Correctly

Now that you’ve defined a module, you need to check to make sure it works. Before you load the module, check to see which gcc is being used:

home8:~> which gcc
home8:~> gcc -v
Reading specs from /usr/lib/gcc/i386-redhat-linux/3.4.3/specs
Configured with: ../configure –prefix=/usr –mandir=/usr/share/man
–infodir=/usr/share/info –enable-shared –enable-threads=posix
–disable-checking –with-system-zlib –enable-__cxa_atexit
–host=i386-redhat-linux Thread model: posix
gcc version 3.4.3 20050227 (Red Hat 3.4.3-22.1)
This means gcc is currently pointing to the system gcc. (Yes, this is a really old gcc; I need to upgrade my simple test box at home).

Next, load the module and check which gcc is being used:

home8:~> module avail

—————————– /usr/local/Modules/versions ——————————

————————- /usr/local/Modules/3.2.6/modulefiles ————————–
compilers/gcc-4.6.2 dot module-info null
compilers/modules module-cvs modules use.own
home8:~> module load compilers/gcc-4.6.2
home8:~> module list
Currently Loaded Modulefiles:
1) compilers/gcc-4.6.2
home8:~> which gcc
home8:~> gcc -v
Using built-in specs.
Target: i686-pc-linux-gnu
Configured with: ./configure –prefix=/home/laytonj/bin/gcc-4.6.2
–enable-languages=c,fortran –enable-libgomp
Thread model: posix
gcc version 4.6.2
This means if you used gcc, you would end up using the version built in your home directory.

As a final check, unload the module and recheck where the default gcc points:

home8:~> module unload compilers/gcc-4.6.2
home8:~> module list
No Modulefiles Currently Loaded.
home8:~> which gcc
home8:~> gcc -v
Reading specs from /usr/lib/gcc/i386-redhat-linux/3.4.3/specs
Configured with: ../configure –prefix=/usr –mandir=/usr/share/man
–infodir=/usr/share/info –enable-shared –enable-threads=posix
–disable-checking –with-system-zlib –enable-__cxa_atexit
Thread model: posix
gcc version 3.4.3 20050227 (Red Hat 3.4.3-22.1)
Notice that after you unload the module, the default gcc goes back to the original version, which means the environment variables are probably correct. If you want to be more thorough, you should check all of the environment variables before loading the module, after the module is loaded, and then after the module is unloaded. But at this point, I’m ready to declare success!

Final Comments

For clusters, Environment Modules are pretty much the best solution for handling multiple compilers, multiple libraries, or even applications. They are easy to use even for beginners to the command line. Just a few commands allow you to add modules to and remove them from your environment easily. You can even use them in job scripts. As you also saw, it’s not too difficult to write your own module and use it. Environment Modules are truly one of the indispensable tools for clusters.

© 2017 Linux New Media USA, LLC

Cell contains one of many things

Cell contains one of many things

Skip to main content
Quick, clean, and to the point
Training Videos Functions Formulas Shortcuts Blog

Cell contains one of many things
Excel formula: Cell contains one of many things
Generic formula
If you want to test a cell to see if it contains one of several things, you can do so with a formula that uses the SEARCH function, with help from the ISNUMBER and SUMPRODUCT functions.

Let’s say you have a list of text strings in the range B5:B11, and you want to test each cell against another list of things in the named range “things” E5:E9. In other words, for each cell in B5:B11, you want to know: does this cell contain any of the things in E5:E9?

You could start build a big formula based on nested IF statements, but that won’t be any fun at all, especially if the list of things you want to check for is large.

The solution is to to create a formula that can test for multiple values and return a list of TRUE / FALSE values. Once we have that, we can process that list (an array, actually) with SUMPRODUCT.

The formula we’re using looks like this:

How this formula works
The key is this snippet:

This is based on another formula (explained in detail here) that simply checks a cell for a single substring. If the cell contains the substring, the formula returns TRUE. If not, the formula returns FALSE.

However, if we give the same formula a list of things (in this case, we are using a named range called “things”, E5:E11) it will give us back a list of TRUE / FALSE values. The result is actually an array that looks like this:


Notice that if we have even one TRUE in the array, we know a cell contains at least one thing in the list. So, we can force the TRUE / FALSE values to 1s and 0s with a double negative (–, also called a double unary):

which yields an array like this:


Now we process the result with SUMPRODUCT, which will add up the entire array. We know if we get a non-zero result, we have a “hit”, so we use >0 to force a final result of either TRUE or FALSE.

With a hard-coded list
There’s no requirement that you use a range for your list of things. If you’re only looking for a small number of things, you can use a list in array format, which is called an array constant. For example, if you’re just looking for the colors red, blue, and green, you can use {“red”,”blue”,”green”} like this:

Dave Bruns
Related formulas
Excel formula: Cell contains one of many with exclusions
Cell contains one of many with exclusions
To test a cell for one of many strings, while excluding others, you can use a formula based on the SEARCH, ISNUMBER, and SUMPRODUCT functions. In the example shown the formula in C5 is: = ( SUMPRODUCT ( — ISNUMBER ( SEARCH ( include , B5 ))) > 0…
Excel formula: Cell contains all of many things
Cell contains all of many things
If you want to test a cell to see if it contains all items in a list, you can do so with a formula that uses the SEARCH function, with help from the ISNUMBER, SUMPRODUCT, and COUNTA functions. Context Let’s say you have a list of text strings in the…
Excel formula: Get first match cell contains
Get first match cell contains
To check a cell for one of several things, and return the first match found in the list, you can use an INDEX / MATCH formula that uses SEARCH or FIND to locate a match. This is an array formula and must be entered with Control + Shift + Enter…
Excel formula: Cell contains specific text
Cell contains specific text
To check if a cell contains specific text, you can use the SEARCH function together with the ISNUMBER function. In the generic version, substring is the specific text you are looking for, and text represents text in the cell you are checking. In the…
Excel formula: Value exists in a range
Value exists in a range
If you need to determine if a value exists in a range of cells, you can use a simple formula based on the COUNTIF function. In the example shown, the formula in D5 is: = COUNTIF ( rng , B5 ) > 0 The COUNTIF function counts cells that meet…
Excel formula: Cell contains which things
Cell contains which things
If you have a list of things (words, substrings, etc) and want to find out which of these things appear in a cell, you can build a simple table and use a formula based on the SEARCH function. Setup Suppose you have a cells that contain text that…
Excel formula: If cell contains one of many things
If cell contains one of many things
To test a cell for one of several things, and return a custom result for the first match found, you can use an INDEX / MATCH formula based on the SEARCH function. In the example shown, the formula in C5 is: { = INDEX ( results , MATCH ( TRUE ,…
Related functions
Excel SUMPRODUCT function
Excel SUMPRODUCT Function
The SUMPRODUCT function multiplies ranges or arrays together and returns the sum of products. This sounds boring, but SUMPRODUCT is an incredibly versatile function that can be used to count and sum like COUNTIFS or SUMIFS, but with more…

Excel ISNUMBER function
Excel ISNUMBER Function
The Excel ISNUMBER function returns TRUE when a cell contains a number, and FALSE if not. You can use ISNUMBER to check that a cell contains a numeric value, or that the result of another function is a number.

Excel SEARCH function
Excel SEARCH Function
The Excel SEARCH function returns the location of one text string inside another. SEARCH returns the position of the first character of find_text inside within_text. Unlike FIND, SEARCH allows wildcards, and is not case-sensitive.

Excel Formula Training
Formulas are the key to getting work done in Excel. In this step-by-step training, you’ll learn how to use formulas to manipulate text, work with dates and times, lookup values with VLOOKUP and INDEX & MATCH, count and sum with criteria, dynamically rank values, and create dynamic ranges. You’ll also learn how to troubleshoot, trace errors, and fix problems. Start building valuable skills with Excel formulas today. Learn more.