This article documents the use of the Anaconda data science platform at NSC and some related concepts, e.g., the conda command, conda-forge, mamba, etc. It includes advanced topics related to integrating compiled and Anaconda-provided software. More specific help on using Python at NSC (via Anaconda or other ways) is found at the NSC software page for Python.
The Anaconda data science platform is maintained by Anaconda Inc. It provides a way to distribute and run software primarily for scientific computing and data analysis. The emphasis is on Python and R with various supporting modules. However, much other software is also covered (in particular, software and libraries that the Python and R modules depend on). In practice, Anaconda works similarly to other container technologies: one can run a program (e.g., written in Python, R, a binary, etc.) in an Anaconda runtime environment - or “conda environment” for short - created from Anaconda packages that provide precise versions of a set of supporting software and libraries. These environments help ensure reproducible behavior across different systems.
The container-like nature of conda environments provides challenges for HPC clusters. Some software and libraries need to interact properly with supercomputing hardware to avoid breakage or degraded performance. Furthermore, the Anaconda-provided versions of some system programs may behave differently from those provided by NSCs systems, which can lead to unexpected behavior that is difficult to diagnose.
Conda-forge is a community-driven library of packages that work with the Anaconda system. These packages provide a rich library of additional software beyond the packages maintained by Anaconda Inc. However, since these are provided by the community of users, they may have undergone less testing and security review than packages provided by Anaconda Inc.
Mamba is an alternative open-source implementation of the conda
tool used to set up and maintain conda environments meant to address performance issues with the standard conda
tool.
Check the availability of NSCs Anaconda modules using the module avail
command:
$ module avail Anaconda
...
Anaconda/2021.05-nsc1
Anaconda/2022.05-nsc1
...
The default location for conda environment installations is in ~/.conda
in your home directory. This location can be problematic since these environments can become very large. Therefore, it is suggested to redirect this directory using a symbolic link to a project space. For example if the ~/.conda
directory already exists:
$ mv ~/.conda /proj/ourprojname/users/x_abcde/conda
$ ln -s /proj/ourprojname/users/x_abcde/conda ~/.conda
(If you do not already have a ~/.conda
directory, just skip the first command.)
Loading an NSC Anaconda module only gives you access to the conda
command (i.e., it does not alter your environment to enable a “base” conda environment). After loading an Anaconda module, you can issue conda create
to create a customized Python environment with exactly the packages (and versions) you need. A basic example for creating a conda environment called myenv
with Python 3.8 with the pandas and seaborn packages:
$ module load Anaconda/2022.05-nsc1
$ conda create -n myenv python=3.8 pandas seaborn
$ conda activate myenv
Now the command python
will refer to the python installed in this environment and provide access to pandas
and seaborn
, e.g.:
$ which python
~/.conda/envs/myownenv/bin/python
$ python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:18)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> pandas.__version__
'1.4.2'
You can, of course, run a python program using this installed python as usual, e.g.:
$ python my_pandas_python_program.py
You do not have to create the environment again when you log in next time. You simply reactivate the same environment using:
$ module load Anaconda/2022.05-nsc1
$ conda activate myenv
and then, e.g., run a Python program in this environment:
$ python my_scipy_python_program.py
You can see what environments you have created by:
$ conda env list
Once you have activated an environment, you can modify it, e.g., by installing additional packages:
$ conda install cython
However, the dependency resolution for a request like this can sometimes lead to unforeseen and destructive changes to the environment, even as far as, e.g., removing packages or downgrading the Python version. Hence, we recommend that environments intended for use “in production” are not altered this way but reinstalled with all desired software constraints specified in a single go using a (possibly very long) conda create
command (or use of the conda feature to specify these requirements using a file: conda env create -f environment.yml
). Furthermore, you may want to try the mamba
alternative to conda
(see the section Mamba and Mambaforge), which tends to be more reliable, faster, and give a more clear output when figuring out how to alter an environment.
To find packages to install, you can use conda search
, which lists all available versions of matching packages, e.g.:
$ conda search sympy
Loading channels: done
# Name Version Build Channel
sympy 1.1.1 py27_0 pkgs/main
sympy 1.1.1 py27hc28188a_0 pkgs/main
[...]
Alternatively, there is also an online list of packages available.
To list the packages installed in an activated environment use conda list
. You can check for the presence of a specific package using grep:
$ conda list | grep -i scipy
scipy 1.8.0 py38h56a6a73_1 conda-forge
If you find yourself doing much work in the same conda environment, activating that environment automatically on every login may seem attractive. However, NSC strongly recommends against this: having every login run inside a conda environment may have far-reaching side effects that can be difficult to diagnose. However, it is fine to load an appropriate Anaconda module (but without also automatically activating an environment). You can do so by adding the following lines to your ~/.baschrc
file:
export NSC_MODULE_SILENT=1
module load Anaconda/2021.05-nsc1
unset NSC_MODULE_SILENT
Here, the steps with NSC_MODULE_SILENT
hide the verbose output when loading the module. With this in your .bashrc
you only have to issue conda activate <name>
to activate your desired environment after logging in.
The Conda-forge community-driven library of packages greatly extends the software available for installation with conda
. Since these are provided by the community of users they may have undergone less testing and security review than packages provided by Anaconda Inc.
To instruct the conda
command to locate packages also in the conda-forge repository, add the flag -c conda-forge
to the install
or create
commands, e.g., to install the package ase
:
$ conda install -c conda-forge ase
You can also add conda-forge
to the channels that are automatically considered for all installations in an active environment by:
$ conda config --add channels conda-forge
In this setup, it is generally recommended also to turn on “strict channel priority,” which alters the behavior of conda so that packages available from multiple sources are only considered from the first source where they are available. This behavior avoids mixing packages from multiple sources in ways that may result in unintended behavior. To activate this setting, do:
$ conda config --set channel_priority strict
Mamba is a drop-in alternative to the conda command for which one of the central aims is to address performance issues with the standard conda
tool.
NSC makes mamba
(and conda
) available via the Mambaforge miniforge distribution, with which users can create, alter and activate conda environments.
For example, to set up a customized environment and run a Python program my_scipy_python_program.py
in it:
$ module load Mambaforge/4.12.0-0-nsc1
$ mamba create -n myenv python=3.8 scipy=1.5.2
$ mamba activate myenv
$ python my_scipy_python_program.py
To run the python program in the same environment when logging in the next time:
$ module load Mambaforge/4.12.0-0-nsc1
$ mamba activate myenv
$ python my_scipy_python_program.py
Notes:
mamba
works interchangeably with exactly the same conda environment directories as the conda
command from the Anaconda
modules. Hence, it is generally safe to swap between loading Anaconda
and Mambaforge
modules, working with the same conda environments.Anaconda
modules, the Mambaforge
modules have conda-forge as the default channel for packages. Without configuring it to do so, it will not install the packages provided by Anaconda Inc.You should be able to replace more or less any occurrence of conda
in this document with mamba
.
As mentioned above, software using MPI for parallelization provide additional challenges since it must interact correctly with NSC hardware. There are two main strategies to get Anaconda-installed MPI software to work at NSC:
conda
to install both the MPI software and MPI-supporting libraries that are compatible with NSC hardware.conda
dependencies, making the conda software use the usual NSC-provided binaries and libraries.Alternative 1 tends to happen automatically if one installs MPI software without carefully considering the dependencies. However, there is a risk that one ends up with a conda-installed MPI configuration and libraries incompatible with the NSC setup or with a degraded performance of the MPI communication. Nevertheless, such degraded performance may be acceptable for software where the performance of the intra-process communication is not critical. This situation appears somewhat common for Python programs using MPI via, e.g., the mpi4py Python module.
The chance of alternative 1 working well is improved by asking conda
to install specific MPI packages compatible with the versions provided by NSC. Either a carefully selected version of OpenMPI or, possibly easier, a version of MPICH compatible with the Intel MPI provided by NSC using the MPICH ABI compatibilitiy. A good choice on Tetralith/Sigma is the 4.<something>
version series of OpenMPI and MPICH version 3.3.2.
For alternative 2 one instead asks conda
to install packages named as openmpi=<something>=external_*
or mpich=<something>=external_*
, corresponding as closely as possible to the versions available at NSC. These packages do not install any MPI binaries or libraries into the conda environment.
However, for this to work, one also needs to provide a way for the conda-installed software to find the NSC MPI libraries. The most direct way to do so is to manually create a symbolic link to the appropriate NSC MPI library into the lib directory of the conda environment. For example:
$ conda install "openmpi=4.1.4=external_*" -c conda-forge
$ ln -s /software/sse/easybuild/prefix/software/OpenMPI/4.0.5-GCC-10.2.0/lib/libmpi.so.40 ~/.conda/myenv/lib/.
or
$ conda install "mpich=3.4.3=external_*" -c conda-forge
$ ln -s /software/sse/easybuild/prefix/software/impi/2018.1.163-GCC-6.4.0-2.28/intel64/lib/release/libmpi.so.12 ~/.conda/myenv/lib/.
Depending on the details of how the software integrates with MPI, this may or may not work. If you get errors referring to “missing symbols”, etc., feel free to contact NSC support for help. Alternatively, it may be easier to get software with tricky dependencies on MPI to work by following the below instructions for building a pip-provided version of the software instead of installing a pre-built conda version.
If you cannot locate a package for a particular software via conda
(either from the Anaconda Inc. default channel or conda-forge) it may be possible to use an alternative Python-oriented software package manager, pip
. The pip
command is installable as an Anaconda package. Packages installed via this pip
command are placed into your active conda environment and will thus not affect your Python environment when the environment is not activated. (Hence, never use the --user
flag to pip
inside a conda environment since that flag will override this useful behavior.)
Some packages that can be installed with pip
require compilation (i.e., they are not pure Python). Such installations introduce extra complications that are addressed in the sections below.
Helpful notes:
The pip
and conda
commands will try to independently maintain their own sets of dependencies, which can lead to major issues with version conflicts. A good strategy is to add all packages you need with conda
first (preferably in a sinle go with one conda create
command) and then only add packages with pip
. Avoid going back and forth between conda
and pip
.
For pip
packages that do not require compilation: only install via pip
if you cannot find the corresponding package in conda
. (For packages that require compilation, see below for other constraints to consider.)
To install a package with pip
, first make sure you have the Anaconda version of pip
in your environment, then use the PyPi search function to find the package name and install the package:
$ conda install pip
$ pip install python-hostlist
There are two main options for compiling software in relation to anaconda: either use the NSC-provided compilers or the compilers provided by the conda-forge package compilers
.
In general, a binary executable and all its library dependencies (linked or dynamically loaded) should use a single compiler; one may otherwise see “missing symbols”, version conflicts, or other errors. One may thus run into such issues if the software is built using the NSC compilers if it links to, dynamically loads, or is dynamically loaded by conda-provided software provided pre-built in binary format.
In particular, interpreters, e.g., Python, R, Perl, Octave, etc., have some modules/packages that dynamically load compiled binary libraries. Hence, if, for example, the PyYAML
Python package is built using NSC compilers, one may encounter problems if it is imported into a conda-provided python
. Likewise, a scientific software package downloaded from GitHub and compiled with NSC compilers inside a conda environment, linked with the conda-provided libxml2
library, may also break. On the other hand, there is no issue if the scientific software instead executes the conda-provided python
as a separate process and runs a script using the conda-provided libxml2
Python bindings.
Using the NSC compilers is the recommended way to build software integrating with other software in the conda environment by executing binaries (i.e., not via linked or dynamically loaded libraries). The following example shows how to set up a conda environment and use NSC-provided compilers to build a C program with source code in my_example_program.c
that expects to be able to execute python
with numpy
and scipy
to be available:
$ module load Anaconda/2022.05-nsc1
$ conda create -n example_env python=3 numpy scipy
$ conda activate example_env
$ module load buildenv-intel/2018a-eb
$ icc my_example_program.c -o my_example_program
If the program being compiled integrates more tightly with conda-installed libraries (e.g., via linking or dynamic loading) or you are building, e.g., a Python module to be dynamically loaded by the conda-provided python
, the recommended strategy is to install and use the conda-provided compilers. Here is an example of how to set up a conda environment and build my_example_program.c
using compilers provided by the compilers
package from conda-forge
:
$ module load Anaconda/2022.05-nsc1
$ conda create -n example_buildenv -c conda-forge python=3 compilers
$ conda activate example_buildenv
$ PATH="$CONDA_PREFIX/bin:/usr/local/bin:/usr/bin" cc my_example_program.c -o my_example_program
In the last line, we modify the PATH
environment variable just while executing cc
to circumvent helper scripts in the NSC environment meant to aid compilation with the NSC-provided compilers.
Some packages that can be installed via pip
requires compilation. It is possible to instruct pip
to use NSC-provided compiler commands for such builds. For example, the following instructions set up a conda environment and use the NSC compilers to build the PyYAML Python module using pip:
$ module load Anaconda/2022.05-nsc1
$ conda create -n yaml_env -c conda-forge python=3 pip
$ conda activate yaml_env
$ module load buildenv-intel/2018a-eb
$ CC=icc CXX=icpc pip install pyyaml
The above can work for relatively simple pip packages (and at the time of writing this, ot seems to work for pyyaml). Nevertheless, as has been discussed above, this build may lead to problems (see Mixing conda and software compiled from source code or via pip) since we end up with a binary YAML library built with the NSC compilers that will be dynamically loaded into the conda-provided Python.
Hence, similar to when compiling software from source, the recommended strategy is to instead install and use the conda-provided compilers. Here is an example of how to do so to build the pyyaml
pip package:
$ module load Anaconda/2022.05-nsc1
$ conda create -n yaml_buildenv -c conda-forge python=3 pip compilers
$ conda activate yaml_buildenv
$ PATH="$CONDA_PREFIX/bin:/usr/local/bin:/usr/bin" pip install pyyaml --no-cache-dir --global-option=build_ext --global-option="--rpath=$ORIGIN/../../.."
Note:
PATH
environment variable just while executing pip install pyyaml
to circumvent helper scripts in the NSC environment meant to aid compilation with NSC-provided compilers.pip
the --no-cache-dir
and the two --global-option
parameters to ensure the rpath feature is set up for the compiled library consistently with how other conda-installable software is built.Software that simultaneously integrates with conda-installed packages and uses MPI for parallelization adds another layer of complexity. This situation is addressed in the next section.
Just as for non-MPI software, if the software that is being compiled only interacts with software in the conda environment by invoking binaries, the recommended way to build it is to use the NSC toolchains. For example, lets consider a software using MPI, which is built using the make
command and, when run, invokes python
as a binary expecting numpy
and scipy
to be available. This software can be built with the following steps:
$ module load Anaconda/2022.05-nsc1
$ conda create -n example_mpi_env python=3 numpy scipy
$ conda activate example_mpi_env
$ module load buildenv-intel/2018a-eb
$ CC=mpicc make
Where, in this example, we assume that setting the environment variable CC
to mpicc
is how to tell the Makefile to build the software using that compiler (other software may use other means of configuration).
As a counter-example, the following example tries to creates a conda environment in which to build and install the asap3
pip package, compiling it using an NSC toolchain:
$ module load Anaconda/2022.05-nsc1
$ conda create -n asap3_env -c conda-forge python=3 numpy ase
$ conda activate asap3_env
$ module load buildenv-intel/2018a-eb
$ CC=mpiicc CXX=mpiicpc pip install asap3
However, the resulting asap3
library does not work:
$ python3 ./md.py
...
ImportError: /software/sse/easybuild/prefix/software/GCCcore/6.4.0/lib64/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /home/rar/.conda/envs/asap3_env/lib/python3.11/site-packages/scipy/linalg/_matfuncs_sqrtm_triu.cpython-311-x86_64-linux-gnu.so)
As discussed in the previous sections, asap3
provides a library to be loaded dynamically into the python
provided by conda. In this case, this leads to a mismatch between glibc versions. However, even if the clash of glibc versions had not occurred, asap3
would have to interact with ase
, which also comes with MPI support but is compiled using a different set of MPI libraries which would likely have caused further issues.
Two strategies for resolving this issue are explored in the subsections below.
The most straightforward strategy to build MPI software that integrates tightly with conda-provided packages is to combine the conda-provided compilers in the compilers
module with the MPI libraries provided by NSC. Lets consider a software using MPI, which is built using the make
command and links to the libnetcdf
library provided by the conda environement. This software can be built with the following steps:
$ module load Anaconda/2022.05-nsc1
$ conda create -n example_mpi_env python=3 numpy scipy
$ conda activate example_mpi_env
$ module load buildenv-intel/2018a-eb
$ PATH="$CONDA_PREFIX/bin:/usr/local/bin:/usr/bin:/software/sse/easybuild/prefix/software/impi/2018.1.163-iccifort-2018.1.163-GCC-6.4.0-2.28/bin64" CC=mpigcc CXX=mpigxx I_MPI_CC="cc" I_MPI_CXX="c++" make
Where, in the example, we assume that the Makefile is set up to use the environment variables CC
and CXX
to build the software using those compilers (other software may use other means of configuration). On the line executing the make
command we:
PATH
environment variable to circumvent helper scripts in the NSC environment meant to aid compilation with NSC-provided compilers and instead point at specific versions of the Intel MPI wrappers suitable for combination with the conda-provided GNU cc
and c++
compilers via the I_MPI_CC
and I_MPI_CXX
environment variables.PATH
environment variable to circumvent helper scripts in the NSC environment meant to aid compilation with NSC-provided compilers.Similarly, to build the asap3
pip package, the steps would be as follows:
$ module load Anaconda/2022.05-nsc1
$ conda create -n buildenv_mpi -c conda-forge python=3 "mpich=3.4.3=external_*" compilers numpy ase
$ conda activate buildenv_mpi
$ module load buildenv-intel/2018a-eb
$ PATH="$CONDA_PREFIX/bin:/usr/local/bin:/usr/bin:/software/sse/easybuild/prefix/software/impi/2018.1.163-iccifort-2018.1.163-GCC-6.4.0-2.28/bin64" CC=mpigcc CXX=mpigxx I_MPI_CC="cc" I_MPI_CXX="c++" pip install asap3 --no-cache-dir --global-option=build_ext --global-option="--rpath=$ORIGIN/../../.."
As an alternative to the above strategy, it is possible to set up a conda environment with the compilers
package and a conda-provided MPI package that is sufficiently compatible with the NSC MPI setup. The following steps build asap3
this way:
$ module load Anaconda/2022.05-nsc1
$ conda create -n buildenv_mpi -c conda-forge python=3 "mpich=3.4.3" compilers numpy ase
$ conda activate buildenv_mpi
$ PATH="$CONDA_PREFIX/bin:/usr/local/bin:/usr/bin" CC=mpicc CXX=mpic++ MPICH_CC="" MPICH_CXX="" pip install asap3 --no-cache-dir --global-option=build_ext --global-option="--rpath=$ORIGIN/../../.."
Where, on the line executing the build command pip install asap3
we:
PATH
environment variable to circumvent helper scripts in the NSC environment meant to aid compilation with NSC-provided compilers.pip
to use the MPI-wrapped versions of the C and C++ compiles, i.e., mpicc
and mpic++
.MPICH_CC
and MPICH_CXX
to ensure these wrappers use the standard (i.e., the conda-provided) C and C++ compilers.The above build and environment often work due to the MPICH ABI compatibility with Intel MPI.
When the above build is completed with the conda-provided compilers and MPI, one may want to explore replacing the conda-provided MPICH MPI package with one referencing the external system MPI software in an attempt to end up with the type of environment discussed as recommended in Conda packages and MPI, which could potentially improve MPI performance. The steps to do so are as follows:
$ conda install -c conda-forge "mpich=3.4.3=external_*"
$ ln -s /software/sse/easybuild/prefix/software/impi/2018.1.163-GCC-6.4.0-2.28/intel64/lib/release/libmpi.so.12 "$CONDA_PREFIX/lib/."
However, depending on the details of how the software integrates with the MPI library, this may not work. In the case of asap3
, this strategy does not work, and trying to start a parallelized MPI run now results in a runtime “No MPI error”.
Guides, documentation and FAQ.
Applying for projects and login accounts.