This article covers use of the Anaconda data science platform at NSC and some related concepts (e.g., the conda command, conda-forge, etc.)
The Anaconda data science platform is maintained by Anaconda Inc. It provides a way to distribute and run software primarily for scientific computing and data anlysis. The empasis is on Python and R with various supporting modules. However, much other software is also covered (in particular software and libraries that the Python and R modules depend on). In practice Anaconda works similar to other container technologies: one can run a program (e.g., written in Python, R, a binary, etc.) in an Anaconda runtime environment created from Anaconda packages that provide precise versions of a set of supporting software and libraries. This helps ensuring a reproducable behavior across different systems.
The container-like Anaconda environments provide challanges for HPC clusters. Some software and libraries need to interact properly with supercomputing hardware, or one will experience breakage or degraded performance. Furthermore, the Anaconda-provided versions of some system programs may behave differently from those in NSCs standard setup, which can lead to issues that are difficult to diagnose.
Conda-forge is a community-driven library of packages that work with the Anaconda system. These packages provide a rich library of additional software beyond the packages maintained by Anaconda Inc. However, since these are provided by the community of users, they may have undergone less testing and security review than packages provided by Anaconda Inc.
Mamba is an alternative open-source implementation of the tool to setup and maintain Anaconda environments meant to address performance issues with the standard
Check the availability of NSCs Anaconda modules using the
module avail command:
$ module avail Anaconda ... Anaconda/2020.07-nsc1 Anaconda/2021.05-nsc1 ...
The default location for conda environment installations is in
~/.conda in your home directory. This can be problematic since these installations can become very large. Therefore, it is suggested to redirect this directory using a symbolic link to one's project space. For example if the
~/.conda directory already exists:
mv ~/.conda /proj/ourprojname/users/x_abcde/ ln -s /proj/ourprojname/users/x_abcde/.conda ~/.conda
(If you do not already have a
~/.conda directory, just skip the first command.)
The NSC Anaconda modules only give you access to the
conda command (i.e., they do not alter your environment to enable a "base" Anaconda environment). After loading an Anaconda module, you can issue
conda create to create a customized Python environment with exactly the packages (and versions) you need. A basic example for creating an Anaconda environment called
myownenv with Python 3.8 with the pandas and seaborn packages:
$ module load Anaconda/2021.05-nsc1 $ conda create -n myownenv python=3.8 pandas seaborn $ conda activate myownenv
Now the command
python will refer to the python installed in this environment with access to
$ which python ~/.conda/envs/myownenv/bin/python $ python Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:18) [GCC 10.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas >>> pandas.__version__ '1.4.2'
You can, of course, run a python program using this installed python as usual, e.g.:
When you login next time, you do not have to create the environment again, but can activate it using:
$ module load Anaconda/2021.05-nsc1 $ conda activate myownenv
and then, e.g., run a Python program in this environment:
You can see what environments you have created by:
$ conda env list
Once you have activated an environment, you can modify it, e.g., by installing additional packages:
$ conda install cython
However, the dependency resolution for a request like this can sometimes lead to unforseen and destructuve changes to the environment, even as far as, e.g., removing packages or downgrading the Python version. Hence, we recommend that environments used "in production" are not altered this way, but reinstalled with all desired software constaints specified in one go using a (possibly very long)
conda create command (or use of the conda feature to specify these requirements using a file:
conda env create -f environment.yml). Furthermore, the
mamba alternative to
conda (see below under Mambaforge) is usually more reliable, faster, and give more clear output when figuring out how to alter an environment.
To find packages to install you can use
conda search, which lists all available versions of matching packages, e.g.:
$ conda search sympy Loading channels: done # Name Version Build Channel sympy 1.1.1 py27_0 pkgs/main sympy 1.1.1 py27hc28188a_0 pkgs/main [...]
Alternatively, there is also an online list of packages available.
To list the packages installed in an activated environment use
conda list. You can check for the precense of a specific package using grep:
$ conda list | grep -i scipy scipy 1.8.0 py38h56a6a73_1 conda-forge
If you find yourself often working in the same conda environment, it may seem attractive to try to activate the environment automatically on login. However, NSC strongly recommends against this: having every login run inside anaconda may have far-reaching side effects that can be diffcult to diagnose. However, it should be fine to load an apropriate Anaconda module without automatically activating an environment. You can do so by adding the following lines to your
~/.bascrc file (where the steps with
NSC_MODULE_SILENT hides the rather length output when loading the module):
export NSC_MODULE_SILENT=1 module load Anaconda/2021.05-nsc1 unset NSC_MODULE_SILENT
The Conda-forge community-driven library of packages greatly extends the software available for installation with
conda. Since these are provided by the community of users they may have undergone less testing and security review than packages provided by Anaconda Inc.
To instruct the
conda command to locate packages also in the conda-forge repostory, add the flag
-c conda-forge to the
create commands, e.g. to install the package
$ conda install -c conda-forge ase
You can also add
conda-forge to the channels that are automatically considered for all installations in an active environment by:
$ conda config --add channels conda-forge
In this setup it is generally recommended to also turn on Anacondas strict channel priority which alters the behavior so that packages available from multiple sources are only considered from the first source they are available in. This avoids mixing packages from multiple sources in ways that may result in unintended behavior. This is done with:
$ conda config --set channel_priority strict
Mamba is a drop-in alternative to the conda command for which one of the central aims is to address performance issues with the standard
conda) available via the Mambaforge miniforge distribution with which users can create named anaconda-like environments and activate them. Note: in difference to the Anaconda modules, the miniforge distribution sets conda-forge as the default channel for packages.
An example, setting up a customized environment and run a Python program
my_scipy_python_program.py in it:
$ module load Mambaforge/4.12.0-0-nsc1 $ mamba create -n myenv python=3.8 scipy=1.5.2 $ mamba activate myenv $ python my_scipy_python_program.py
To run the python program in the same environment when logging in the next time:
$ module load Mambaforge/4.12.0-0-nsc1 $ mabma activate myenv $ python my_scipy_python_program.py
If you cannot locate a package for a particular software in via
conda (either from the Anaconda Inc default channel, or conda-forge) it may be possible to use the alternative (Python-oriented) software package manager
pip command is installable as an Anaconda package. Packages installed via this
pip command are placed into your active anaconda environment and will thus not affect your Python environment when the environment is not activate. (Never use the
--user flag to
pip inside an anaconda environment.)
Do not install packages via
pip if you can find packages installable via
conda commands will try to maintain their own sets of dependencies independently which can lead to major issues with version conflicts. The best strategy is to add all relevant packages with
conda and then only packages with
pip. Avoid going back and forth between
Important: if the packages you want to install with
pip require compilation (i.e., which are not pure Python), you must first set up a working NSC compiler environment, or the installation will fail with messages about the compiler missing, or compilation errors. See the section on compiling software integrating with Anaconda below. The most difficult situation is if the packages require compilation and rely on interaction with MPI. In this case you need to carfully consider those sections below before
pip will work.
To install a package with
pip, first use the PyPi search function to see if the software is available and what the package name is. Then, first make sure you have the Anaconda version of
pip in your environment:
conda install pip
Them just install the package by name, e.g.
pip install python-hostlist
There are two main strategies to get Anaconda-installed MPI software to work at NSC:
condato install both the software using MPI and MPI-supporting libraries.
Alternative 1 is generally what happens automatically if one install MPI software without carfully considering the dependencies. However, this strategy requires the MPI configuration and libraries to be compatible with the NSC setup, and even when that is the case, the performance for MPI communication is unlikely to be optimal. Nevertheless, this may be acceptable for software not limitied in performance by intra-process communication. We find this situation to be the typical one for Python programs using MPI (likely via mpi4py).
For the best chance of alternative 1 working, try to install a version of OpenMPI compatible with the versions provided by NSC, or a version of MPICH compatible with the Intel MPI provided by NSC. A good choice on Tetralith/Sigma is the
4.<something> version series of OpenMPI and MPICH version 3.3.2.
For alternative 2, the setup is similar, but now explicitly install a
mpich=<something>=external_* module corresponding as closely as possible to the versions available at NSC. This means no MPI binaries or libraries are installed into the conda environment. The most direct way to get the Anaconda MPI software to run in this situation is to create a manual link to the apropriate NSC mpi library into the lib directory inside your conda envronment, e.g.:
conda install "openmpi=4.1.4=external_*" -c conda-forge ln -s /software/sse/easybuild/prefix/software/OpenMPI/4.0.5-GCC-10.2.0/lib/libmpi.so.40 ~/.conda/myenv/lib/.
conda install "mpich=3.4.3=external_*" -c conda-forge ln -s /software/sse/easybuild/prefix/software/impi/2018.1.163-GCC-6.4.0-2.28/intel64/lib/release/libmpi.so.12 ~/.conda/myenv/lib/.
(More information is planned to be added here on explicit tested setups. In the meantime, if you need this and have trouble to get this to work, email NSC support.)
Technically, all the usual NSC build environments are available to compile software also when an Anaconda environment is active. If you want to build software interacting with Anaconda via, e.g., running the Anaconda environment version of Python via the
python binary, you should be able to simply load and use one of the usual NSC buildenvs. For example:
module load buildenv-intel/2018a-eb mpiicc my_example_program.c
In case you want to install a package using
pip that require compilation this way, you can instruct pip to use specific compiler commands this way:
module load buildenv-intel/2018a-eb CC=mpiicc CXX=mpiicpc pip install asap3
However, in cases where your software need to integrate more tightly with the software in the Anaconda environment, e.g., link with libraries installed via
conda, it may be a more successful strategy to install the same compilers that has built other Anaconda software. You do this as follows:
At this point you can compile software that does not rely on MPI.
Software that simultaneously integrates with Anaconda and require use of MPI (through something different than) adds another layer of complexity.
The most straightforward way to build MPI software inside an Anaconda environment that because as similar to modules installable by
conda is to first install the Anaconda compilers and NSC-compatible MPI binaries into the environment, then build the software, and then replace the MPI module with one referencing the external system MPI software.
The steps should be similar to this, using the pip-buildable asap3 software as example:
conda create -n myenv -c conda-forge python=3 "mpich=3.4.3" c-compiler compilers cxx-compiler fortran-compiler conda activate myenv export NSC_LD_FLAG=0 CC=mpicc CXX=mpic++ pip install asap3 --no-cache-dir --global-option=build_ext --global-option="--rpath=$ORIGIN/../../.." conda install "mpich=3.4.3=external_*" -c conda-forge ln -s /software/sse/easybuild/prefix/software/impi/2018.1.163-GCC-6.4.0-2.28/intel64/lib/release/libmpi.so.12 ~/.conda/myenv/lib/.