This is an old revision of the document!


This is a list of problems and solutions that could be encountered by an administrator when using HPC Gateway. This list is not exhaustive and will be enhanced based on the returns of HPC Gateway usage. It is highly recommended that you build and share your own trouble shooting wiki page in your local area for your administrators as well.

You can also consult the User's TroubleShooting.


Problem: Python commands issue SSLError

If the python command are not working and issuing errors related to SSL, it may be a configuration problem between Python and SSL.

If the HTTP interface is also opened, you can confirm the problem by setting the environment variable HPCG_BASE_URL to aim at the HTTP interface instead of the HTTPS interface. This should make the command work by removing the need of SSL. If you want to use this trick as a workaround, you can change the value of HPCG_BASE_URL in /opt/hpcg/core/etc/profile.sh .

Check if the python you are using is the one deployed with HPC Gateway installer and that the SSL module is correctly installed.

[hpcgadmin@hpcgdemo ~]$ which python
/opt/hpcg/external/python2.7/bin/python

[hpcgadmin@hpcgdemo ~]$ python -c "import ssl; print ssl.OPENSSL_VERSION"
OpenSSL 1.0.1e-fips 11 Feb 2013

If the SSL module is not correctly installed, you must install the openssl and openssl-devel packages and reinstall HPC Gateway or just recompile/install python.

We have also experience some problems with the very specific version 1.0.1e-15 of OpenSSL on RedHat 6.5. You can check what is the verion installed on your system.

[hpcgadmin@rnd01 ~]$ rpm -qa | grep openssl
openssl-1.0.1e-15.el6.x86_64
openssl-devel-1.0.1e-15.el6.x86_64   # <== This is a mandatory package that must be installed before HPC Gateway installation

If you are using 1.0.1e-15, you should upgrade it with the package manager.


Note: openssl-devel is mandatory when installing (compiling) python, otherwise the ssl module cannot be installed and therefore cannot be used at runtime.

# yum search openssl-devel
...

# yum install openssl-devel.x86_64
...

Problem: SSH DSA keys slowdown/failures

We have observed problems when using DSA keys with recent OpenSSH deamons (openHPC and Ubuntu > 16). In addition DSA keys are considered too weak and should be regarded as deprecated.

We recommend using SSH RSA keys and not DSA. If you are experiencing some problem with the SSH/SFTP commands issued by HPC Gateway, make sure you are not using DSA keys.


Problem: Authentication fail

There are several reasons why a user can not login into HPC Gateway:

  • The password is wrong
  • The user is not defined in the database
  • The user's uid is too low

The password is wrong

Make sure the password you type is the good one and the keyboard is not in caps locked or using another language.

The user is not defined in the database

HPC Gateway authentication and authorisation mechanism are described in identity management wiki.

Check the database using RoboMongo tool (if installed) or hpcg_dbase_print.py command line.

$ hpcg_dbase_print.py -k configs.webserver.settings
[{u'key': u'autoPopulate', u'value': u'true'}, {u'key': u'defaultTeam', u'value': u'568e3a3cddff3a6ccdaf92c8'}]

$ hpcg_dbase_print.py -k users.hpcgadmin
{u'info': {u'phone': u'', u'role': u'admin', u'address': u''}, u'creationDate': 1464357258L, u'wikiURI': u'', u'tags': [u'hpc', u'hpcgadmin', u'admin'], u'admin': True, u'statusLifecycle': u'active', u'teams': [{u'role': u'admin', u'name': u'Public', u'id': u'568e3a3cddff3a6ccdaf92c8'}], u'modificationDate': 1464357258L, u'fullName': u'hpcgadmin', u'_id': ObjectId('5748518a0ec8114072f11cfe'), u'email': u'', u'projects': [], u'name': u'hpcgadmin'}

If the user is not set and autoPopulate is False, then consider to:

  • Set autoPopulate to True
  • Add the user in the database using hpcg_users.py command line

The user's id is too low

For security reason, there is a minimum id for a user to be able to connect to HPC Gateway. For instance, we do not want that root connect to HPC Gateway, and more generally we want to avoid that system user can connect. This limit is set in jetty configuration file: ${HPCG_HOME}/core/jetty/etc/login.conf. The id of the user must be greater than uidMin.

$ cat ${HPCG_HOME}/core/jetty/etc/login.conf
...
ssh-login-module {
   com.fujitsu.fse.torii.authentication.SshLoginModule required
   debug="true"
   hostname="localhost"
   port="22"
   uidMin="1000";
};


Note that the user id is defined in /etc/passwd and the system user id limit is defined in /etc/login.defs

$ grep hpcgadmin /etc/passwd
hpcgadmin:x:1008:1008::/home/hpcgadmin:/bin/bash

$ grep UID_MIN /etc/login.defs
UID_MIN                  1000
SYS_UID_MIN               201

Problem: The user cannot browse server or submit jobs on cluster

Follow the following steps. If one of these steps fail, then it is a problem and it must be resolved.

- Step 1 : “hpcgadmin” must be able to open a SSH session on behalf of the user

Run the following command:

$ id    # make sure you are hpcgadmin
uid=10020(hpcgadmin) gid=10020(hpcgadmin) groups=10020(hpcgadmin)

$ ssh -i /opt/hpcg/repo/etc/sys/root/id_rsa_hpcg_<server>  <user>@<server>

If the connection fails, then this is the problem: hpcgadmin must be able to connect to the user session using HPC Gateway private key. This is normally setup when the user first connect to HPC Gateway through the script /opt/hpcg/core/etc/profile.d/hpcg.profile.sh.

Check the script and check if there are errors when the user logs in using a standard ssh connection.

  1. Step 2 : the command “groups” must return a non zero output

Run the following command:

$ id    # make sure you are hpcgadmin
uid=10020(hpcgadmin) gid=10020(hpcgadmin) groups=10020(hpcgadmin)

$ ssh -i /opt/hpcg/repo/etc/sys/root/id_rsa_hpcg_<server>  <user>@<server>  "groups ; echo \$?"

HPC Gateway execute “groups” command to assess rights on folders and files during the file system browsing. For example to allow file editing (or not) using the text editor.


Problem: The wiki is slow and sometimes does not display images

The wiki is based on top of dokuwiki embedded in jetty, that use php-cgi. It is possible to configure php-cgi to better serve the http requests.

  • Step 1: check PHP_FCGI_CHILDREN and PHP_FCGI_MAX_REQUESTS environment variables

You can check if you have defaults in /opt/hpcg/etc/setenv_php.sh

$ cat /opt/hpcg/etc/setenv_php.sh

# This file is generated by HPC Gateway installer

export hpcg_php_home=/usr/bin
export HPCG_PHP_HOME=/usr/bin

export PHP_FCGI_CHILDREN=4
export PHP_FCGI_MAX_REQUESTS=128

If PHP_FCGI_CHILDREN and PHP_FCGI_MAX_REQUESTS do not exist or the value are too low, you can consider to overwrite them in your local profile located in /opt/hpcg/repo/etc/profile.<hostname>.sh

You will need to source the updated profile and restart the php process after such update.

  • Step 2: increase memory_limit in /etc/php.ini

memory_limit = 128M


Problem: RVEC notification do not work

The RVEC server is allocated, but the color stays green. This means the notification from RVEC server to HPC Gateway does not work. Check that:

  • RVEC servlet do not have security constraints in /opt/hpcg/core/jetty/webapps/torii/WEB-INF/web.xml.
  • Port 8080 is set for the Manager port in the RVEC Management settings. If it is not set to 8080, the RVEC manager uses by default the same port the the connection one. If your are using 8443 (and you should), then the notification can not work without the setting.
  • Sometimes, if we have a string proxy, the IP doesn't work with the connection. So we have to set in the Manager IP directly the name (ssf.fujitsu.fr and not the IP 69.244.109.5).

Problem: Error during fetch in

The input data are not copied in the run directory. This means the local “hpcg_copy.py” script did not work. You can get more information by

  • Opening the task details
  • Go to the runlog directory
  • Open <batch>-<jobid>.out

Here are some common errors and recovery:

  • hpcg_copy.py not found: the script is not in /home/hpcgadmin/hpcgateway/utilities directory. Restart the cluster agent and check it the script is in the right location.
  • the name of he server is not identical to the hostname: hpcg_copy rely on the name of the server to run scp command. Then, the name of the server must be in /etc/hosts to be recognized as the server

Problem: Error during fetch in