Anaconda

This article documents the use of the Anaconda data science platform at NSC and some related concepts, e.g., the conda command, conda-forge, mamba, etc. It includes advanced topics related to integrating compiled and Anaconda-provided software. More specific help on using Python at NSC (via Anaconda or other ways) is found at the NSC software page for Python.

Introduction

The Anaconda data science platform is maintained by Anaconda Inc. It provides a way to distribute and run software primarily for scientific computing and data analysis. The emphasis is on Python and R with various supporting modules. However, much other software is also covered (in particular, software and libraries that the Python and R modules depend on). In practice, Anaconda works similarly to other container technologies: one can run a program (e.g., written in Python, R, a binary, etc.) in an Anaconda runtime environment - or "conda environment" for short - created from Anaconda packages that provide precise versions of a set of supporting software and libraries. These environments help ensure reproducible behavior across different systems.

The container-like nature of conda environments provides challenges for HPC clusters. Some software and libraries need to interact properly with supercomputing hardware to avoid breakage or degraded performance. Furthermore, the Anaconda-provided versions of some system programs may behave differently from those provided by NSCs systems, which can lead to unexpected behavior that is difficult to diagnose.

Conda-forge is a community-driven library of packages that work with the Anaconda system. These packages provide a rich library of additional software beyond the packages maintained by Anaconda Inc. However, since these are provided by the community of users, they may have undergone less testing and security review than packages provided by Anaconda Inc.

Mamba is an alternative open-source implementation of the conda tool used to set up and maintain conda environments meant to address performance issues with the standard conda tool.

Basic use of Anaconda at NSC

Check the availability of NSCs Anaconda modules using the module avail command:

$ module avail Anaconda
...
Anaconda/2021.05-nsc1
Anaconda/2022.05-nsc1
...

The default location for conda environment installations is in ~/.conda in your home directory. This location can be problematic since these environments can become very large. Therefore, it is suggested to redirect this directory using a symbolic link to a project space. For example if the ~/.conda directory already exists:

$ mv ~/.conda /proj/ourprojname/users/x_abcde/conda
$ ln -s /proj/ourprojname/users/x_abcde/conda ~/.conda

(If you do not already have a ~/.conda directory, just skip the first command.)

Loading an NSC Anaconda module only gives you access to the conda command (i.e., it does not alter your environment to enable a "base" conda environment). After loading an Anaconda module, you can issue conda create to create a customized Python environment with exactly the packages (and versions) you need. A basic example for creating a conda environment called myenv with Python 3.8 with the pandas and seaborn packages:

$ module load Anaconda/2022.05-nsc1
$ conda create -n myenv python=3.8 pandas seaborn
$ conda activate myenv

Now the command python will refer to the python installed in this environment and provide access to pandas and seaborn, e.g.:

$ which python
~/.conda/envs/myownenv/bin/python
$ python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:18)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> pandas.__version__
'1.4.2'

You can, of course, run a python program using this installed python as usual, e.g.:

$ python my_pandas_python_program.py

You do not have to create the environment again when you log in next time. You simply reactivate the same environment using:

$ module load Anaconda/2022.05-nsc1
$ conda activate myenv

and then, e.g., run a Python program in this environment:

$ python my_scipy_python_program.py

You can see what environments you have created by:

$ conda env list

Once you have activated an environment, you can modify it, e.g., by installing additional packages:

$ conda install cython

However, the dependency resolution for a request like this can sometimes lead to unforeseen and destructive changes to the environment, even as far as, e.g., removing packages or downgrading the Python version. Hence, we recommend that environments intended for use "in production" are not altered this way but reinstalled with all desired software constraints specified in a single go using a (possibly very long) conda create command (or use of the conda feature to specify these requirements using a file: conda env create -f environment.yml). Furthermore, you may want to try the mamba alternative to conda (see the section Mamba and Mambaforge), which tends to be more reliable, faster, and give a more clear output when figuring out how to alter an environment.

To find packages to install, you can use conda search, which lists all available versions of matching packages, e.g.:

$ conda search sympy

Loading channels: done
# Name                       Version           Build  Channel
sympy                          1.1.1          py27_0  pkgs/main
sympy                          1.1.1  py27hc28188a_0  pkgs/main
[...]

Alternatively, there is also an online list of packages available.

To list the packages installed in an activated environment use conda list. You can check for the presence of a specific package using grep:

$ conda list | grep -i scipy
scipy                     1.8.0            py38h56a6a73_1    conda-forge

If you find yourself doing much work in the same conda environment, activating that environment automatically on every login may seem attractive. However, NSC strongly recommends against this: having every login run inside a conda environment may have far-reaching side effects that can be difficult to diagnose. However, it is fine to load an appropriate Anaconda module (but without also automatically activating an environment). You can do so by adding the following lines to your ~/.baschrc file:

export NSC_MODULE_SILENT=1
module load Anaconda/2021.05-nsc1
unset NSC_MODULE_SILENT

Here, the steps with NSC_MODULE_SILENT hide the verbose output when loading the module. With this in your .bashrc you only have to issue conda activate <name> to activate your desired environment after logging in.

Conda-forge

The Conda-forge community-driven library of packages greatly extends the software available for installation with conda. Since these are provided by the community of users they may have undergone less testing and security review than packages provided by Anaconda Inc.

To instruct the conda command to locate packages also in the conda-forge repository, add the flag -c conda-forge to the install or create commands, e.g., to install the package ase:

$ conda install -c conda-forge ase

You can also add conda-forge to the channels that are automatically considered for all installations in an active environment by:

$ conda config --add channels conda-forge

In this setup, it is generally recommended also to turn on "strict channel priority," which alters the behavior of conda so that packages available from multiple sources are only considered from the first source where they are available. This behavior avoids mixing packages from multiple sources in ways that may result in unintended behavior. To activate this setting, do:

$ conda config --set channel_priority strict

Mamba and Mambaforge

Mamba is a drop-in alternative to the conda command for which one of the central aims is to address performance issues with the standard conda tool.

NSC makes mamba (and conda) available via the Mambaforge miniforge distribution, with which users can create, alter and activate conda environments. For example, to set up a customized environment and run a Python program my_scipy_python_program.py in it:

$ module load Mambaforge/4.12.0-0-nsc1
$ mamba create -n myenv python=3.8 scipy=1.5.2
$ mamba activate myenv
$ python my_scipy_python_program.py

To run the python program in the same environment when logging in the next time:

$ module load Mambaforge/4.12.0-0-nsc1
$ mamba activate myenv
$ python my_scipy_python_program.py

Notes:

  • mamba works interchangeably with exactly the same conda environment directories as the conda command from the Anaconda modules. Hence, it is generally safe to swap between loading Anaconda and Mambaforge modules, working with the same conda environments.
  • However, in difference to the Anaconda modules, the Mambaforge modules have conda-forge as the default channel for packages. Without configuring it to do so, it will not install the packages provided by Anaconda Inc.

You should be able to replace more or less any occurrence of conda in this document with mamba.

Conda packages and MPI

As mentioned above, software using MPI for parallelization provide additional challenges since it must interact correctly with NSC hardware. There are two main strategies to get Anaconda-installed MPI software to work at NSC:

  1. Use the dependency resolution in conda to install both the MPI software and MPI-supporting libraries that are compatible with NSC hardware.
  2. Install a conda package to abstractly represent the system MPI libraries to satisfy the conda dependencies, making the conda software use the usual NSC-provided binaries and libraries.

Alternative 1 tends to happen automatically if one installs MPI software without carefully considering the dependencies. However, there is a risk that one ends up with a conda-installed MPI configuration and libraries incompatible with the NSC setup or with a degraded performance of the MPI communication. Nevertheless, such degraded performance may be acceptable for software where the performance of the intra-process communication is not critical. This situation appears somewhat common for Python programs using MPI via, e.g., the mpi4py Python module.

The chance of alternative 1 working well is improved by asking conda to install specific MPI packages compatible with the versions provided by NSC. Either a carefully selected version of OpenMPI or, possibly easier, a version of MPICH compatible with the Intel MPI provided by NSC using the MPICH ABI compatibilitiy. A good choice on Tetralith/Sigma is the 4.<something> version series of OpenMPI and MPICH version 3.3.2.

For alternative 2 one instead asks conda to install packages named as openmpi=<something>=external_* or mpich=<something>=external_*, corresponding as closely as possible to the versions available at NSC. These packages do not install any MPI binaries or libraries into the conda environment.

However, for this to work, one also needs to provide a way for the conda-installed software to find the NSC MPI libraries. The most direct way to do so is to manually create a symbolic link to the appropriate NSC MPI library into the lib directory of the conda environment. For example:

$ conda install "openmpi=4.1.4=external_*" -c conda-forge
$ ln -s /software/sse/easybuild/prefix/software/OpenMPI/4.0.5-GCC-10.2.0/lib/libmpi.so.40 ~/.conda/myenv/lib/.

or

$ conda install "mpich=3.4.3=external_*" -c conda-forge
$ ln -s /software/sse/easybuild/prefix/software/impi/2018.1.163-GCC-6.4.0-2.28/intel64/lib/release/libmpi.so.12 ~/.conda/myenv/lib/.

Depending on the details of how the software integrates with MPI, this may or may not work. If you get errors referring to "missing symbols", etc., feel free to contact NSC support for help. Alternatively, it may be easier to get software with tricky dependencies on MPI to work by following the below instructions for building a pip-provided version of the software instead of installing a pre-built conda version.

Using pip in conda environments

If you cannot locate a package for a particular software via conda (either from the Anaconda Inc. default channel or conda-forge) it may be possible to use an alternative Python-oriented software package manager, pip. The pip command is installable as an Anaconda package. Packages installed via this pip command are placed into your active conda environment and will thus not affect your Python environment when the environment is not activated. (Hence, never use the --user flag to pip inside a conda environment since that flag will override this useful behavior.)

Some packages that can be installed with pip require compilation (i.e., they are not pure Python). Such installations introduce extra complications that are addressed in the sections below.

Helpful notes:

  • The pip and conda commands will try to independently maintain their own sets of dependencies, which can lead to major issues with version conflicts. A good strategy is to add all packages you need with conda first (preferably in a sinle go with one conda create command) and then only add packages with pip. Avoid going back and forth between conda and pip.

  • For pip packages that do not require compilation: only install via pip if you cannot find the corresponding package in conda. (For packages that require compilation, see below for other constraints to consider.)

To install a package with pip, first make sure you have the Anaconda version of pip in your environment, then use the PyPi search function to find the package name and install the package:

$ conda install pip
$ pip install python-hostlist

Mixing conda and software compiled from source code or via pip

There are two main options for compiling software in relation to anaconda: either use the NSC-provided compilers or the compilers provided by the conda-forge package compilers.

In general, a binary executable and all its library dependencies (linked or dynamically loaded) should use a single compiler; one may otherwise see "missing symbols", version conflicts, or other errors. One may thus run into such issues if the software is built using the NSC compilers if it links to, dynamically loads, or is dynamically loaded by conda-provided software provided pre-built in binary format.

In particular, interpreters, e.g., Python, R, Perl, Octave, etc., have some modules/packages that dynamically load compiled binary libraries. Hence, if, for example, the PyYAML Python package is built using NSC compilers, one may encounter problems if it is imported into a conda-provided python. Likewise, a scientific software package downloaded from GitHub and compiled with NSC compilers inside a conda environment, linked with the conda-provided libxml2 library, may also break. On the other hand, there is no issue if the scientific software instead executes the conda-provided python as a separate process and runs a script using the conda-provided libxml2 Python bindings.

Compilation of (non-MPI) software from source

Using the NSC compilers is the recommended way to build software integrating with other software in the conda environment by executing binaries (i.e., not via linked or dynamically loaded libraries). The following example shows how to set up a conda environment and use NSC-provided compilers to build a C program with source code in my_example_program.c that expects to be able to execute python with numpy and scipy to be available:

$ module load Anaconda/2022.05-nsc1
$ conda create -n example_env python=3 numpy scipy
$ conda activate example_env
$ module load buildenv-intel/2018a-eb
$ icc my_example_program.c -o my_example_program

If the program being compiled integrates more tightly with conda-installed libraries (e.g., via linking or dynamic loading) or you are building, e.g., a Python module to be dynamically loaded by the conda-provided python, the recommended strategy is to install and use the conda-provided compilers. Here is an example of how to set up a conda environment and build my_example_program.c using compilers provided by the compilers package from conda-forge:

$ module load Anaconda/2022.05-nsc1
$ conda create -n example_buildenv -c conda-forge python=3 compilers
$ conda activate example_buildenv
$ PATH="$CONDA_PREFIX/bin:/usr/local/bin:/usr/bin" cc my_example_program.c -o my_example_program

In the last line, we modify the PATH environment variable just while executing cc to circumvent helper scripts in the NSC environment meant to aid compilation with the NSC-provided compilers.

Compilation of (non-MPI) software via pip

Some packages that can be installed via pip requires compilation. It is possible to instruct pip to use NSC-provided compiler commands for such builds. For example, the following instructions set up a conda environment and use the NSC compilers to build the PyYAML Python module using pip:

$ module load Anaconda/2022.05-nsc1
$ conda create -n yaml_env -c conda-forge python=3 pip
$ conda activate yaml_env
$ module load buildenv-intel/2018a-eb
$ CC=icc CXX=icpc pip install pyyaml

The above can work for relatively simple pip packages (and at the time of writing this, ot seems to work for pyyaml). Nevertheless, as has been discussed above, this build may lead to problems (see Mixing conda and software compiled from source code or via pip) since we end up with a binary YAML library built with the NSC compilers that will be dynamically loaded into the conda-provided Python.

Hence, similar to when compiling software from source, the recommended strategy is to instead install and use the conda-provided compilers. Here is an example of how to do so to build the pyyaml pip package:

$ module load Anaconda/2022.05-nsc1
$ conda create -n yaml_buildenv -c conda-forge python=3 pip compilers
$ conda activate yaml_buildenv
$ PATH="$CONDA_PREFIX/bin:/usr/local/bin:/usr/bin" pip install pyyaml --no-cache-dir --global-option=build_ext --global-option="--rpath=$ORIGIN/../../.."

Note:

  • In the last line, we modify the PATH environment variable just while executing pip install pyyaml to circumvent helper scripts in the NSC environment meant to aid compilation with NSC-provided compilers.
  • We give pip the --no-cache-dir and the two --global-option parameters to ensure the rpath feature is set up for the compiled library consistently with how other conda-installable software is built.

Software that simultaneously integrates with conda-installed packages and uses MPI for parallelization adds another layer of complexity. This situation is addressed in the next section.

Compilation of MPI software with NSC-provided toolchains

Just as for non-MPI software, if the software that is being compiled only interacts with software in the conda environment by invoking binaries, the recommended way to build it is to use the NSC toolchains. For example, lets consider a software using MPI, which is built using the make command and, when run, invokes python as a binary expecting numpy and scipy to be available. This software can be built with the following steps:

$ module load Anaconda/2022.05-nsc1
$ conda create -n example_mpi_env python=3 numpy scipy
$ conda activate example_mpi_env
$ module load buildenv-intel/2018a-eb
$ CC=mpicc make

Where, in this example, we assume that setting the environment variable CC to mpicc is how to tell the Makefile to build the software using that compiler (other software may use other means of configuration).

As a counter-example, the following example tries to creates a conda environment in which to build and install the asap3 pip package, compiling it using an NSC toolchain:

$ module load Anaconda/2022.05-nsc1
$ conda create -n asap3_env -c conda-forge python=3 numpy ase
$ conda activate asap3_env
$ module load buildenv-intel/2018a-eb
$ CC=mpiicc CXX=mpiicpc pip install asap3

However, the resulting asap3 library does not work:

$ python3 ./md.py
...
ImportError: /software/sse/easybuild/prefix/software/GCCcore/6.4.0/lib64/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /home/rar/.conda/envs/asap3_env/lib/python3.11/site-packages/scipy/linalg/_matfuncs_sqrtm_triu.cpython-311-x86_64-linux-gnu.so)

As discussed in the previous sections, asap3 provides a library to be loaded dynamically into the python provided by conda. In this case, this leads to a mismatch between glibc versions. However, even if the clash of glibc versions had not occurred, asap3 would have to interact with ase, which also comes with MPI support but is compiled using a different set of MPI libraries which would likely have caused further issues.

Two strategies for resolving this issue are explored in the subsections below.

Compilation of MPI software with NSC-provided MPI and conda-provided compilers

The most straightforward strategy to build MPI software that integrates tightly with conda-provided packages is to combine the conda-provided compilers in the compilers module with the MPI libraries provided by NSC. Lets consider a software using MPI, which is built using the make command and links to the libnetcdf library provided by the conda environement. This software can be built with the following steps:

$ module load Anaconda/2022.05-nsc1
$ conda create -n example_mpi_env python=3 numpy scipy
$ conda activate example_mpi_env
$ module load buildenv-intel/2018a-eb
$ PATH="$CONDA_PREFIX/bin:/usr/local/bin:/usr/bin:/software/sse/easybuild/prefix/software/impi/2018.1.163-iccifort-2018.1.163-GCC-6.4.0-2.28/bin64" CC=mpigcc CXX=mpigxx I_MPI_CC="cc" I_MPI_CXX="c++" make

Where, in the example, we assume that the Makefile is set up to use the environment variables CC and CXX to build the software using those compilers (other software may use other means of configuration). On the line executing the make command we:

  • Set the PATH environment variable to circumvent helper scripts in the NSC environment meant to aid compilation with NSC-provided compilers and instead point at specific versions of the Intel MPI wrappers suitable for combination with the conda-provided GNU cc and c++ compilers via the I_MPI_CC and I_MPI_CXX environment variables.
  • Modify the PATH environment variable to circumvent helper scripts in the NSC environment meant to aid compilation with NSC-provided compilers.

Similarly, to build the asap3 pip package, the steps would be as follows:

$ module load Anaconda/2022.05-nsc1
$ conda create -n buildenv_mpi -c conda-forge python=3 "mpich=3.4.3=external_*" compilers numpy ase
$ conda activate buildenv_mpi
$ module load buildenv-intel/2018a-eb
$ PATH="$CONDA_PREFIX/bin:/usr/local/bin:/usr/bin:/software/sse/easybuild/prefix/software/impi/2018.1.163-iccifort-2018.1.163-GCC-6.4.0-2.28/bin64" CC=mpigcc CXX=mpigxx I_MPI_CC="cc" I_MPI_CXX="c++" pip install asap3 --no-cache-dir --global-option=build_ext --global-option="--rpath=$ORIGIN/../../.."

Compilation of MPI software with conda-provided compilers and MPI

As an alternative to the above strategy, it is possible to set up a conda environment with the compilers package and a conda-provided MPI package that is sufficiently compatible with the NSC MPI setup. The following steps build asap3 this way:

$ module load Anaconda/2022.05-nsc1
$ conda create -n buildenv_mpi -c conda-forge python=3 "mpich=3.4.3" compilers numpy ase
$ conda activate buildenv_mpi
$ PATH="$CONDA_PREFIX/bin:/usr/local/bin:/usr/bin" CC=mpicc CXX=mpic++ MPICH_CC="" MPICH_CXX="" pip install asap3 --no-cache-dir --global-option=build_ext --global-option="--rpath=$ORIGIN/../../.."

Where, on the line executing the build command pip install asap3 we:

  • Modify the PATH environment variable to circumvent helper scripts in the NSC environment meant to aid compilation with NSC-provided compilers.
  • Use environment variables to instruct pip to use the MPI-wrapped versions of the C and C++ compiles, i.e., mpicc and mpic++.
  • Empty the environment variables MPICH_CC and MPICH_CXX to ensure these wrappers use the standard (i.e., the conda-provided) C and C++ compilers.

The above build and environment often work due to the MPICH ABI compatibility with Intel MPI.

When the above build is completed with the conda-provided compilers and MPI, one may want to explore replacing the conda-provided MPICH MPI package with one referencing the external system MPI software in an attempt to end up with the type of environment discussed as recommended in Conda packages and MPI, which could potentially improve MPI performance. The steps to do so are as follows:

$ conda install -c conda-forge "mpich=3.4.3=external_*"
$ ln -s /software/sse/easybuild/prefix/software/impi/2018.1.163-GCC-6.4.0-2.28/intel64/lib/release/libmpi.so.12 "$CONDA_PREFIX/lib/."

However, depending on the details of how the software integrates with the MPI library, this may not work. In the case of asap3, this strategy does not work, and trying to start a parallelized MPI run now results in a runtime "No MPI error".


User Area

User support

Guides, documentation and FAQ.

Getting access

Applying for projects and login accounts.

System status

Everything OK!

No reported problems

Self-service

SUPR
NSC Express