Anaconda

This article covers use of the Anaconda data science platform at NSC and some related concepts (e.g., the conda command, conda-forge, etc.)

Introduction

The Anaconda data science platform is maintained by Anaconda Inc. It provides a way to distribute and run software primarily for scientific computing and data anlysis. The empasis is on Python and R with various supporting modules. However, much other software is also covered (in particular software and libraries that the Python and R modules depend on). In practice Anaconda works similar to other container technologies: one can run a program (e.g., written in Python, R, a binary, etc.) in an Anaconda runtime environment created from Anaconda packages that provide precise versions of a set of supporting software and libraries. This helps ensuring a reproducable behavior across different systems.

The container-like Anaconda environments provide challanges for HPC clusters. Some software and libraries need to interact properly with supercomputing hardware, or one will experience breakage or degraded performance. Furthermore, the Anaconda-provided versions of some system programs may behave differently from those in NSCs standard setup, which can lead to issues that are difficult to diagnose.

Conda-forge is a community-driven library of packages that work with the Anaconda system. These packages provide a rich library of additional software beyond the packages maintained by Anaconda Inc. However, since these are provided by the community of users, they may have undergone less testing and security review than packages provided by Anaconda Inc.

Mamba is an alternative open-source implementation of the tool to setup and maintain Anaconda environments meant to address performance issues with the standard conda tool.

Basic use of Anaconda at NSC

Check the availability of NSCs Anaconda modules using the module avail command:

$ module avail Anaconda
...
Anaconda/2020.07-nsc1
Anaconda/2021.05-nsc1
...

The default location for conda environment installations is in ~/.conda in your home directory. This can be problematic since these installations can become very large. Therefore, it is suggested to redirect this directory using a symbolic link to one's project space. For example if the ~/.conda directory already exists:

mv ~/.conda /proj/ourprojname/users/x_abcde/
ln -s /proj/ourprojname/users/x_abcde/.conda ~/.conda

(If you do not already have a ~/.conda directory, just skip the first command.)

The NSC Anaconda modules only give you access to the conda command (i.e., they do not alter your environment to enable a "base" Anaconda environment). After loading an Anaconda module, you can issue conda create to create a customized Python environment with exactly the packages (and versions) you need. A basic example for creating an Anaconda environment called myownenv with Python 3.8 with the pandas and seaborn packages:

$ module load Anaconda/2021.05-nsc1
$ conda create -n myownenv python=3.8 pandas seaborn
$ conda activate myownenv

Now the command python will refer to the python installed in this environment with access to pandas and seaborn, e.g.:

$ which python
~/.conda/envs/myownenv/bin/python
$ python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:18)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> pandas.__version__
'1.4.2'

You can, of course, run a python program using this installed python as usual, e.g.:

python my_scipy_python_program.py

When you login next time, you do not have to create the environment again, but can activate it using:

$ module load Anaconda/2021.05-nsc1
$ conda activate myownenv

and then, e.g., run a Python program in this environment:

python my_scipy_python_program.py

You can see what environments you have created by:

$ conda env list

Once you have activated an environment, you can modify it, e.g., by installing additional packages:

$ conda install cython

However, the dependency resolution for a request like this can sometimes lead to unforseen and destructuve changes to the environment, even as far as, e.g., removing packages or downgrading the Python version. Hence, we recommend that environments used "in production" are not altered this way, but reinstalled with all desired software constaints specified in one go using a (possibly very long) conda create command (or use of the conda feature to specify these requirements using a file: conda env create -f environment.yml). Furthermore, the mamba alternative to conda (see below under Mambaforge) is usually more reliable, faster, and give more clear output when figuring out how to alter an environment.

To find packages to install you can use conda search, which lists all available versions of matching packages, e.g.:

$ conda search sympy

Loading channels: done
# Name                       Version           Build  Channel
sympy                          1.1.1          py27_0  pkgs/main
sympy                          1.1.1  py27hc28188a_0  pkgs/main
[...]

Alternatively, there is also an online list of packages available.

To list the packages installed in an activated environment use conda list. You can check for the precense of a specific package using grep:

$ conda list | grep -i scipy
scipy                     1.8.0            py38h56a6a73_1    conda-forge

If you find yourself often working in the same conda environment, it may seem attractive to try to activate the environment automatically on login. However, NSC strongly recommends against this: having every login run inside anaconda may have far-reaching side effects that can be diffcult to diagnose. However, it should be fine to load an apropriate Anaconda module without automatically activating an environment. You can do so by adding the following lines to your ~/.bascrc file (where the steps with NSC_MODULE_SILENT hides the rather length output when loading the module):

export NSC_MODULE_SILENT=1
module load Anaconda/2021.05-nsc1
unset NSC_MODULE_SILENT

Conda-forge

The Conda-forge community-driven library of packages greatly extends the software available for installation with conda. Since these are provided by the community of users they may have undergone less testing and security review than packages provided by Anaconda Inc.

To instruct the conda command to locate packages also in the conda-forge repostory, add the flag -c conda-forge to the install or create commands, e.g. to install the package ase:

$ conda install -c conda-forge ase

You can also add conda-forge to the channels that are automatically considered for all installations in an active environment by:

$ conda config --add channels conda-forge

In this setup it is generally recommended to also turn on Anacondas strict channel priority which alters the behavior so that packages available from multiple sources are only considered from the first source they are available in. This avoids mixing packages from multiple sources in ways that may result in unintended behavior. This is done with:

$ conda config --set channel_priority strict

Mambaforge

Mamba is a drop-in alternative to the conda command for which one of the central aims is to address performance issues with the standard conda tool.

NSC makes mamba (and conda) available via the Mambaforge miniforge distribution with which users can create named anaconda-like environments and activate them. Note: in difference to the Anaconda modules, the miniforge distribution sets conda-forge as the default channel for packages.

An example, setting up a customized environment and run a Python program my_scipy_python_program.py in it:

$ module load Mambaforge/4.12.0-0-nsc1
$ mamba create -n myenv python=3.8 scipy=1.5.2
$ mamba activate myenv
$ python my_scipy_python_program.py

To run the python program in the same environment when logging in the next time:

$ module load Mambaforge/4.12.0-0-nsc1
$ mabma activate myenv
$ python my_scipy_python_program.py

Using pip inside a conda environment

If you cannot locate a package for a particular software in via conda (either from the Anaconda Inc default channel, or conda-forge) it may be possible to use the alternative (Python-oriented) software package manager pip. The pip command is installable as an Anaconda package. Packages installed via this pip command are placed into your active anaconda environment and will thus not affect your Python environment when the environment is not activate. (Never use the --user flag to pip inside an anaconda environment.)

Do not install packages via pip if you can find packages installable via conda. The pip and conda commands will try to maintain their own sets of dependencies independently which can lead to major issues with version conflicts. The best strategy is to add all relevant packages with conda and then only packages with pip. Avoid going back and forth between conda and pip.

Important: if the packages you want to install with pip require compilation (i.e., which are not pure Python), you must first set up a working NSC compiler environment, or the installation will fail with messages about the compiler missing, or compilation errors. See the section on compiling software integrating with Anaconda below. The most difficult situation is if the packages require compilation and rely on interaction with MPI. In this case you need to carfully consider those sections below before pip will work.

To install a package with pip, first use the PyPi search function to see if the software is available and what the package name is. Then, first make sure you have the Anaconda version of pip in your environment:

conda install pip

Them just install the package by name, e.g. python-hostlist:

pip install python-hostlist

Anaconda and MPI

There are two main strategies to get Anaconda-installed MPI software to work at NSC:

  1. Use the dependency resolution in conda to install both the software using MPI and MPI-supporting libraries.
  2. Install a conda package to represent the system MPI to satisfy dependencies, and then use the NSC-provided binaries and libraries.

Alternative 1 is generally what happens automatically if one install MPI software without carfully considering the dependencies. However, this strategy requires the MPI configuration and libraries to be compatible with the NSC setup, and even when that is the case, the performance for MPI communication is unlikely to be optimal. Nevertheless, this may be acceptable for software not limitied in performance by intra-process communication. We find this situation to be the typical one for Python programs using MPI (likely via mpi4py).

For the best chance of alternative 1 working, try to install a version of OpenMPI compatible with the versions provided by NSC, or a version of MPICH compatible with the Intel MPI provided by NSC. A good choice on Tetralith/Sigma is the 4.<something> version series of OpenMPI and MPICH version 3.3.2.

For alternative 2, the setup is similar, but now explicitly install a openmpi=<something>=external_* or mpich=<something>=external_* module corresponding as closely as possible to the versions available at NSC. This means no MPI binaries or libraries are installed into the conda environment. The most direct way to get the Anaconda MPI software to run in this situation is to create a manual link to the apropriate NSC mpi library into the lib directory inside your conda envronment, e.g.:

  conda install "openmpi=4.1.4=external_*" -c conda-forge
  ln -s /software/sse/easybuild/prefix/software/OpenMPI/4.0.5-GCC-10.2.0/lib/libmpi.so.40 ~/.conda/myenv/lib/.

or

  conda install "mpich=3.4.3=external_*" -c conda-forge
  ln -s /software/sse/easybuild/prefix/software/impi/2018.1.163-GCC-6.4.0-2.28/intel64/lib/release/libmpi.so.12 ~/.conda/myenv/lib/.

(More information is planned to be added here on explicit tested setups. In the meantime, if you need this and have trouble to get this to work, email NSC support.)

Compilation of software inside an Anaconda environment

Technically, all the usual NSC build environments are available to compile software also when an Anaconda environment is active. If you want to build software interacting with Anaconda via, e.g., running the Anaconda environment version of Python via the python binary, you should be able to simply load and use one of the usual NSC buildenvs. For example:

module load buildenv-intel/2018a-eb
mpiicc my_example_program.c

In case you want to install a package using pip that require compilation this way, you can instruct pip to use specific compiler commands this way:

module load buildenv-intel/2018a-eb
CC=mpiicc CXX=mpiicpc pip install asap3

However, in cases where your software need to integrate more tightly with the software in the Anaconda environment, e.g., link with libraries installed via conda, it may be a more successful strategy to install the same compilers that has built other Anaconda software. You do this as follows:

...

At this point you can compile software that does not rely on MPI.

Software that simultaneously integrates with Anaconda and require use of MPI (through something different than) adds another layer of complexity.

Compilation of software inside an Anaconda environment using MPI

The most straightforward way to build MPI software inside an Anaconda environment that because as similar to modules installable by conda is to first install the Anaconda compilers and NSC-compatible MPI binaries into the environment, then build the software, and then replace the MPI module with one referencing the external system MPI software.

The steps should be similar to this, using the pip-buildable asap3 software as example:

conda create -n myenv -c conda-forge python=3 "mpich=3.4.3" c-compiler compilers cxx-compiler fortran-compiler
conda activate myenv
export NSC_LD_FLAG=0
CC=mpicc CXX=mpic++ pip install asap3 --no-cache-dir --global-option=build_ext --global-option="--rpath=$ORIGIN/../../.."
conda install "mpich=3.4.3=external_*" -c conda-forge
ln -s /software/sse/easybuild/prefix/software/impi/2018.1.163-GCC-6.4.0-2.28/intel64/lib/release/libmpi.so.12 ~/.conda/myenv/lib/.

User Area

User support

Guides, documentation and FAQ.

Getting access

Applying for projects and login accounts.

System status

Everything OK!

No reported problems

Self-service

SUPR
NSC Express