Talk:Release-2010.1

From Dgiref
Jump to: navigation, search

Contents

Cluster

cfengine

cluster_talk:Cfengine/302
trouble:
/usr/lib/libdb-4.3.so: could not read symbols: File in wrong format
identification: Packages for db4 installed for two architectures (i386 and x86_64).
solution: Solution: remove db4 for i386 (E.g.: yum -y remove db4-4.3.29-9.fc6.i386)

torque

cluster_talk:Torque/216/server
trouble:
error: Failed dependencies:
 libtorque.so.0()(64bit) is needed by torque-server-2.1.6-1cri_2dgrid_sl4.x86_64
identification: Torque server installation without torque-client-XXX.rpm will have:
solution: wget torque-client-XXX.rpm

rpm -ihv torque-client-XXX.rpm


trouble: echo ${TORQUE_SERVER} > /var/spool/pbs/server_name

After that, the command makes the error:

pbsnodes -a No default server name.

identification: Setup variable value failed
solution: Make the configuration

echo ${TORQUE_SERVER} > /var/spool/pbs/server_name


worker node

cluster_talk:Worker/gLite/32
trouble: Compatibility Mode
identification: This version of the Worder Node is targeted to setup x86_64 machines in "compatibility" mode, i.e. both 32bit and 64bit libraries are provided. The YAIM tool configures the LD_LIBRARY_PATH variable accordingly.
solution: In order to get both 32bit and 64bit libraries, both versions of certain rpms must be installed. On a x86_64 machine yum usually installs the 64bit rpm over the 32bit rpm but for certain rpms this seems not to be the case (for yet unknown reasons). It has been observed that certain binaries in /opt/lcg/bin may be 32bit on the x86_64 Worker Node. However, this should not be a problem as 32bit binaries can be executed on a x86_64 machine.

trouble: dCache client doesn't work in SL5 (Bug 54065)
identification: There is a bug in the dCache client in SL5 that prevents it from working properly.
solution: Use lcg_util as data management client instead.


Grid middleware

glite

middleware_talk:Glite
trouble: periodically removed the user accounts home directories
identification: the problematic cronjob
/etc/cron.d/cleanup-grid-accounts
should be disabled.
solution: This cronjob creates logfiles:
/var/log/cleanup-grid-accounts.log*
and its action can be controlled with the
/opt/lcg/etc/cleanup-grid-accounts.conf

The recommended solution is to completely disable that cronjob in the gLite CE. Please be aware that:

  • cronjob entry will be recreated by YAIM if you reconfigure your node! (see /opt/glite/yaim/functions/config_users)
  • if you installed the *cluster nodes* (ie, WNs) gLite software by using RPMs instead of the tarball package, then all the WNs will also be affected and running those cleanup cronjobs.

trouble: the maui client do not connect to the maui server. For example

showq

ERROR: lost connection to server ERROR: cannot request service (status))

identification: There is the key exchange problem between server & client
solution: please see https://iwrdgus.fzk.de/ws/ticket_info.php?ticket=1040


YAIM configuration variables

Link: https://twiki.cern.ch/twiki/bin/view/LCG/Site-info_configuration_variables#site_info_pre

middleware_talk:Glite/31/server
trouble: Globus error 3. an I/O operation failed.
identification: Got a job held event
solution: 1. Speicher-Problem. Usually not related to an actual I/O failure. More commonly, GRAM has run out of available memory. Can also be generated by lack of disk/quota space, or permissions issues with the home directory.

2. https://twiki.cern.ch/twiki/bin/view/CMS/ClassificationOfTheGRIDErrorsByGenericReasonsAndCategories

3. (von Jan Ploski <Jan.Ploski@offis.de>, Uni Oldenburg)

die Ursache des Problems ist ein Bug in

/opt/globus/lib/perl/Globus/GRAM/Helper.pm

Die darin enthaltene Prozedur "l_check_memory" versucht, den freien Speicher zu berechnen. Was sie aber in der Wirklichkeit tut, ist die Summierung der Beträge von "free" und "swap" in der Ausgabe des Unix-Befehls "mem". buffers/cached bleiben unberücksichtigt, und so wird irrtümlich angenommen, dass die Maschine keinen freien Speicher hat (swap ist bei mir 0, da ich keinen für die Xen-Maschine eingerichtet habe).


gLite error FAQ on wiki.fzk.de/mwg

globus toolkit

middleware_talk:Globus
trouble: File Transfer Errors
globus-url-copy gsiftp://<middleware server>/etc/hosts file:///tmp/hosts_copy 
error: globus_ftp_client: the server responded with an error
530 530-globus_xio: Authentication Error
530-OpenSSL Error: s3_srvr.c:2010: in library: SSL routines, 
function SSL3_GET_CLIENT_CERTIFICATE: no certificate returned
530-globus_gsi_callback_module: Could not verify credential
530-globus_gsi_callback_module: The certificate is not yet valid:
Cert with subject: /C=DE/O=GermanGrid/OU=FZK/CN=Foued Jrad/CN=694386833 is not yet valid
check clock skew between hosts.
530 End.
identification: There is a Time synchronization problem.
solution: Check that the NTP deamon is active and rightly configured on the middleware Frontend

middleware_talk:Globus/42/server
trouble: After the $GPT_LOCATION/sbin/gpt-postinstall is ready, the following is occurred:

Creating state file directory.

WARNING: It looks like /usr/local/globus/tmp/gram_job_state may not be on a local filesystem. WARNING: The test for local file systems is not 100% reliable. Ignore the below if this is a false positive.
WARNING: The jobmanager requires state dir to be on a local filesystem
WARNING: Rerun the jobmanager setup script with the -state-dir=<state dir> option.Done.

Reading gatekeeper configuration file... Determining system information... Creating job manager configuration file... Done ..Done

identification:
solution:

Unicore

middleware_talk:Unicore
trouble: keytool -import -file /etc/grid-security/hostcert.pem -keystore /opt/unicore6/certs/truststore.jks

Enter keystore password: keytool error: java.lang.Exception: Input not an X.509 certificate

identification: base of the error not clear defined
solution: delete the header in the certificate till the == BEGIN == section.

Data storage

ogsadai

data_talk:Ogsadai/2.2/server
trouble: ant listResourcesClient -Ddai.url=https://${OGSADAI_HOST}:8443/wsrf/services/ogsadai/DataService

Buildfile: build.xml

setupClientSecurity:

listResourcesClient:

    [java] A problem arose during communication with service https://dgiref-ogsadai.fzk.de:8443/wsrf/services/ogsadai/DataService?WSDL.
    [java] Connection refused

BUILD FAILED /localhome/globus/ogsadai-wsrf-2.2/build.xml:1537: Java returned: 1

identification:
solution:

dcache

data_talk:Dcache/195/server
trouble:
rpm -ihv dcache-srmclient-1.9.5-3.noarch.rpm 
error: Failed dependencies:
java >= 1.5 is needed by dcache-srmclient-1.9.5-3.noarch

But the java version is >=1.5

identification: Problem with Java dependencies
solution: yum install java-1.6.0-sun-compat

trouble: createuser -U postgres --no-superuser --no-createrole --createdb --pwprompt srmdcache

createuser: could not connect to database postgres: FATAL: Ident authentication failed for user "postgres"

identification: User is not allow connect to the DB
solution:
tail /var/lib/pgsql/data/pg_hba.conf
 
# "local" is for Unix domain socket connections only
#local   all         all                               ident sameuser
local   all         all                               trust
# IPv4 local connections:
#host    all         all         127.0.0.1/32          ident sameuser
host    all         all         127.0.0.1/32          trust
# IPv6 local connections:
#host    all         all         ::1/128               ident sameuser
host    all         all         ::1/128               trust

trouble:
/opt/d-cache/libexec/chimera/chimera-nfs-run.sh start
/opt/d-cache/libexec/chimera/chimera-nfs-run.sh: line 18: /opt/d-cache/config/dCacheSetup: No such file or directory
identification:
solution:

data talk:Dcache/195/server/test/notes
Testing installed dCache system
  • dCache web interface
If everything is right configured and running, you may try to call the web interface of your dCache instance inside a browser with the generic adress http://dcache-headnode.yourDomain:2288. For the dCache reference installation this looks like http://dgiref-dcache.fzk.de:2288 and can be viewed here: D-Grid dCache reference installation web interface.
  • accessing file system with standard commands
everything alright
[root@dgiref-dcache ~]# ls /pnfs/fzk.de/data/dgtest/
[root@dgiref-dcache ~]# touch /pnfs/fzk.de/data/dgtest/test.blub
[root@dgiref-dcache ~]# ls /pnfs/fzk.de/data/dgtest/
test.blub
[root@dgiref-dcache ~]# rm /pnfs/fzk.de/data/dgtest/test.blub
rm: remove regular empty file `/pnfs/fzk.de/data/dgtest/test.blub'? y
[root@dgiref-dcache ~]# ls /pnfs/fzk.de/data/dgtest/
doesn't work, but that is normal!
[root@dgiref-dcache ~]# cp /bin/bash /pnfs/fzk.de/data/dgtest/test.blub
cp: closing `/pnfs/fzk.de/data/dgtest/test.blub': Input/output error
  • copying data using dCache protocols

Use a UI-server and voms-proxies! (dgiref-login.fzk.de)

without a voms-proxy you get this error
[user@dgiref-login]$ dccp /bin/bash gsidcap://dgiref-dcache.fzk.de:22128/pnfs/fzk.de/data/dgtest/testbin-1
Error ( POLLIN POLLERR POLLHUP) (with data) on control line [3]
Failed to create a control line
Error ( POLLIN POLLERR POLLHUP) (with data) on control line [5]
Failed to create a control line
Failed open file in the dCache.
Can't open destination file : Server rejected "hello"
System error: Input/output error
[user@dgiref-login]$ voms-proxy-info

Couldn't find a valid proxy.
[user@dgiref-login]$ voms-proxy-init -voms dgtest
Cannot find file or dir: /home/site/user/.glite/vomses
Enter GRID pass phrase:
Your identity: /C=DE/O=GermanGrid/OU=FZK/CN=user
Creating temporary proxy ................................................... Done
Contacting  dgrid-voms.fzk.de:15000 [/O=GermanGrid/OU=FZK/CN=host/dgrid-voms.fzk.de] "dgtest" Done
Creating proxy ...................................... Done
Your proxy is valid until Mon Nov 10 23:26:03 2008
[user@dgiref-login]$ voms-proxy-info
subject   : /C=DE/O=GermanGrid/OU=FZK/CN=user/CN=proxy
issuer    : /C=DE/O=GermanGrid/OU=FZK/CN=user
identity  : /C=DE/O=GermanGrid/OU=FZK/CN=user
type      : proxy
strength  : 512 bits
path      : /tmp/x509up_u10824
timeleft  : 11:59:58

Now you can copy data into dCache (using GSIdcap):

first time issued, getting error message but file exists in dCache (see below)
[user@dgiref-login]$ dccp /bin/bash gsidcap://dgiref-dcache.fzk.de:22128/pnfs/fzk.de/data/dgtest/testbin-1
Command failed!
Server error message for [1]: "no such file or directory /pnfs/fzk.de/data/dgtest/testbin-1" (errno 10001).
585908 bytes in 0 seconds
cannot overwrite/edit files in dCache (that's intended!)
[user@dgiref-login]$ dccp /bin/bash gsidcap://dgiref-dcache.fzk.de:22128/pnfs/fzk.de/data/dgtest/testbin-1
Command failed!
Server error message for [2]: "File is readOnly" (errno 1).
Failed open file in the dCache.
Can't open destination file : "File is readOnly"
System error: Input/output error
you may only write to directories linked to your VO! this is also intended!
dCache admins have detailed control over user privileges in the future
[user@dgiref-login]$ dccp /bin/bash gsidcap://dgiref-dcache.fzk.de:22128/pnfs
/fzk.de/data/textgrid/testbin-1
Command failed!
Server error message for [1]: "no such file or directory /pnfs/fzk.de/data/textgrid/testbin-1" (errno 10001).
Command failed!
Server error message for [2]: "Permission denied (Parent)" (errno 2).
Failed open file in the dCache.
Can't open destination file : "Permission denied (Parent)"
System error: Input/output error

For checking copy the file back and make a checksum:

[user@dgiref-login]$ dccp gsidcap://dgiref-dcache.fzk.de:22128/pnfs/fzk.de/data/dgtest/testbin-1 /tmp/testfilefromdCache
585908 bytes in 0 seconds

[user@dgiref-login]$ md5sum /bin/bash /tmp/testfilefromdCache 
dc4e36cfdf491029a67f4e317cab3151  /bin/bash
dc4e36cfdf491029a67f4e317cab3151  /tmp/testfilefromdCache

Same procedure with srmcp tool (using GridFTP protocol) (also known as "srm put"):

Note: Usage of srmcp instead of dccp!
Don't worry about the error message returned.
[user@dgiref-login]$ srmcp file:////bin/bash gridftp://dgiref-dcache.fzk.de:2811/pnfs/fzk.de/data/dgtest/testbin-2
WARNING: SRM_PATH is defined, which might cause a wrong version of srm client to be executed
WARNING: SRM_PATH=/opt/d-cache/srm

And again fetching the file from dCache ("srm get"). Unfortunately dCache in the D-Grid reference installation only supports srm put in stream mode, which means, transfering data with only one single stream (normally up to ten). The reason for this is unknown and has to be investigated.

Note: Usage of srmcp instead of dccp!
Don't worry about the error message returned.
[user@dgiref-login]$ srmcp -debug=true -streams_num=1 \
> gridftp://dgiref-dcache.fzk.de:2811/pnfs/fzk.de/data/dgtest/testbin-2 file:////dev/null
WARNING: SRM_PATH is defined, which might cause a wrong version of srm client to be executed
WARNING: SRM_PATH=/opt/d-cache/srm


  • Deleting files in dCache

Easiest is to delete the namespace entry:

[root@dgiref-dcache ~]# rm /pnfs/fzk.de/data/dgtest/testbin-1
rm: remove regular empty file `/pnfs/fzk.de/data/dgtest/testbin-1'? y

File will disapear (not immediate but) short after so dCache can prioritize read actions over deleting.

Regular users can delete files from the UI with srmrm (equals srm -rm):

[user@dgiref-login]$  srmrm srm://dgiref-dcache.fzk.de:8443/pnfs/fzk.de/data/dgtest/testbin-1
WARNING: SRM_PATH is defined, which might cause a wrong version of srm client to be executed
WARNING: SRM_PATH=/opt/d-cache/srm
GUI pcells for dCache

The core mechanism for administrating a dCache instance is the admin interface. This is a service you may connect to using a ssh client. Using the admin interface, you may communicate with the various components making up dCache, query their status and update their behaviour. Although dCache is a distributed system, you only ever connect to a single node; dCache will route your messages internally. The source for pcells, a graphical user interface which greately simplifys working with the admin interface, as well as an installation guide can be found at dCache.org. Once pcells is installed, this are the steps to take in order to connect to your dCache system:

  1. start pcells and open new session
  2. adjust settings by clicking on Setup
    1. addresses
      hostname = dCache_headnode.yourDomain
  3. login as admin with (default) passphrase dickerel

Of course no one is forced to use this GUI but can still access the admin interface with a plain ssh client:

ssh -1 -c blowfish -p 2223 -l admin dCache_headnode.yourDomain

Guidance on how to use the admin interface is out of the scope of this documentation. Please look into the wiki at dCache.org.

Personal tools