Talk:Release-2010.1
Contents |
Cluster
cfengine
| trouble: | /usr/lib/libdb-4.3.so: could not read symbols: File in wrong format |
| identification: | Packages for db4 installed for two architectures (i386 and x86_64). |
| solution: | Solution: remove db4 for i386 (E.g.: yum -y remove db4-4.3.29-9.fc6.i386)
|
torque
| trouble: | error: Failed dependencies: libtorque.so.0()(64bit) is needed by torque-server-2.1.6-1cri_2dgrid_sl4.x86_64 |
| identification: | Torque server installation without torque-client-XXX.rpm will have: |
| solution: | wget torque-client-XXX.rpm
rpm -ihv torque-client-XXX.rpm |
| trouble: | echo ${TORQUE_SERVER} > /var/spool/pbs/server_name
After that, the command makes the error: pbsnodes -a No default server name. |
| identification: | Setup variable value failed |
| solution: | Make the configuration
echo ${TORQUE_SERVER} > /var/spool/pbs/server_name |
worker node
| trouble: | Compatibility Mode |
| identification: | This version of the Worder Node is targeted to setup x86_64 machines in "compatibility" mode, i.e. both 32bit and 64bit libraries are provided. The YAIM tool configures the LD_LIBRARY_PATH variable accordingly. |
| solution: | In order to get both 32bit and 64bit libraries, both versions of certain rpms must be installed. On a x86_64 machine yum usually installs the 64bit rpm over the 32bit rpm but for certain rpms this seems not to be the case (for yet unknown reasons). It has been observed that certain binaries in /opt/lcg/bin may be 32bit on the x86_64 Worker Node. However, this should not be a problem as 32bit binaries can be executed on a x86_64 machine. |
| trouble: | dCache client doesn't work in SL5 (Bug 54065) |
| identification: | There is a bug in the dCache client in SL5 that prevents it from working properly. |
| solution: | Use lcg_util as data management client instead. |
Grid middleware
glite
| trouble: | periodically removed the user accounts home directories |
| identification: | the problematic cronjob /etc/cron.d/cleanup-grid-accounts |
| solution: | This cronjob creates logfiles:
/var/log/cleanup-grid-accounts.log* /opt/lcg/etc/cleanup-grid-accounts.conf The recommended solution is to completely disable that cronjob in the gLite CE. Please be aware that:
|
| trouble: | the maui client do not connect to the maui server. For example
ERROR: lost connection to server ERROR: cannot request service (status)) |
| identification: | There is the key exchange problem between server & client |
| solution: | please see https://iwrdgus.fzk.de/ws/ticket_info.php?ticket=1040 |
- YAIM configuration variables
Link: https://twiki.cern.ch/twiki/bin/view/LCG/Site-info_configuration_variables#site_info_pre
| trouble: | Globus error 3. an I/O operation failed. |
| identification: | Got a job held event |
| solution: | 1. Speicher-Problem. Usually not related to an actual I/O failure. More commonly, GRAM has run out of available memory. Can also be generated by lack of disk/quota space, or permissions issues with the home directory.
2. https://twiki.cern.ch/twiki/bin/view/CMS/ClassificationOfTheGRIDErrorsByGenericReasonsAndCategories 3. (von Jan Ploski <Jan.Ploski@offis.de>, Uni Oldenburg) die Ursache des Problems ist ein Bug in /opt/globus/lib/perl/Globus/GRAM/Helper.pm Die darin enthaltene Prozedur "l_check_memory" versucht, den freien Speicher zu berechnen. Was sie aber in der Wirklichkeit tut, ist die Summierung der Beträge von "free" und "swap" in der Ausgabe des Unix-Befehls "mem". buffers/cached bleiben unberücksichtigt, und so wird irrtümlich angenommen, dass die Maschine keinen freien Speicher hat (swap ist bei mir 0, da ich keinen für die Xen-Maschine eingerichtet habe). |
gLite error FAQ on wiki.fzk.de/mwg
globus toolkit
| trouble: | File Transfer Errors
globus-url-copy gsiftp://<middleware server>/etc/hosts file:///tmp/hosts_copy error: globus_ftp_client: the server responded with an error 530 530-globus_xio: Authentication Error 530-OpenSSL Error: s3_srvr.c:2010: in library: SSL routines, function SSL3_GET_CLIENT_CERTIFICATE: no certificate returned 530-globus_gsi_callback_module: Could not verify credential 530-globus_gsi_callback_module: The certificate is not yet valid: Cert with subject: /C=DE/O=GermanGrid/OU=FZK/CN=Foued Jrad/CN=694386833 is not yet valid check clock skew between hosts. 530 End. |
| identification: | There is a Time synchronization problem. |
| solution: | Check that the NTP deamon is active and rightly configured on the middleware Frontend |
| trouble: | After the $GPT_LOCATION/sbin/gpt-postinstall is ready, the following is occurred:
Creating state file directory. WARNING: It looks like /usr/local/globus/tmp/gram_job_state may not be on a local filesystem. WARNING: The test for local file systems is not 100% reliable. Ignore the below if this is a false positive. WARNING: The jobmanager requires state dir to be on a local filesystem WARNING: Rerun the jobmanager setup script with the -state-dir=<state dir> option.Done. Reading gatekeeper configuration file... Determining system information... Creating job manager configuration file... Done ..Done |
| identification: | |
| solution: |
Unicore
| trouble: | keytool -import -file /etc/grid-security/hostcert.pem -keystore /opt/unicore6/certs/truststore.jks
Enter keystore password: keytool error: java.lang.Exception: Input not an X.509 certificate |
| identification: | base of the error not clear defined |
| solution: | delete the header in the certificate till the == BEGIN == section. |
Data storage
ogsadai
| trouble: | ant listResourcesClient -Ddai.url=https://${OGSADAI_HOST}:8443/wsrf/services/ogsadai/DataService
Buildfile: build.xml setupClientSecurity: listResourcesClient: [java] A problem arose during communication with service https://dgiref-ogsadai.fzk.de:8443/wsrf/services/ogsadai/DataService?WSDL. [java] Connection refused BUILD FAILED /localhome/globus/ogsadai-wsrf-2.2/build.xml:1537: Java returned: 1 |
| identification: | |
| solution: |
dcache
| trouble: | rpm -ihv dcache-srmclient-1.9.5-3.noarch.rpm error: Failed dependencies: java >= 1.5 is needed by dcache-srmclient-1.9.5-3.noarch But the java version is >=1.5 |
| identification: | Problem with Java dependencies |
| solution: | yum install java-1.6.0-sun-compat |
| trouble: | createuser -U postgres --no-superuser --no-createrole --createdb --pwprompt srmdcache
createuser: could not connect to database postgres: FATAL: Ident authentication failed for user "postgres" |
| identification: | User is not allow connect to the DB |
| solution: | tail /var/lib/pgsql/data/pg_hba.conf # "local" is for Unix domain socket connections only #local all all ident sameuser local all all trust # IPv4 local connections: #host all all 127.0.0.1/32 ident sameuser host all all 127.0.0.1/32 trust # IPv6 local connections: #host all all ::1/128 ident sameuser host all all ::1/128 trust |
| trouble: | /opt/d-cache/libexec/chimera/chimera-nfs-run.sh start /opt/d-cache/libexec/chimera/chimera-nfs-run.sh: line 18: /opt/d-cache/config/dCacheSetup: No such file or directory |
| identification: | |
| solution: |
- data talk:Dcache/195/server/test/notes
- Testing installed dCache system
- dCache web interface
- If everything is right configured and running, you may try to call the web interface of your dCache instance inside a browser with the generic adress http://dcache-headnode.yourDomain:2288. For the dCache reference installation this looks like http://dgiref-dcache.fzk.de:2288 and can be viewed here: D-Grid dCache reference installation web interface.
- accessing file system with standard commands
| everything alright | [root@dgiref-dcache ~]# ls /pnfs/fzk.de/data/dgtest/ [root@dgiref-dcache ~]# touch /pnfs/fzk.de/data/dgtest/test.blub [root@dgiref-dcache ~]# ls /pnfs/fzk.de/data/dgtest/ test.blub [root@dgiref-dcache ~]# rm /pnfs/fzk.de/data/dgtest/test.blub rm: remove regular empty file `/pnfs/fzk.de/data/dgtest/test.blub'? y [root@dgiref-dcache ~]# ls /pnfs/fzk.de/data/dgtest/ |
| doesn't work, but that is normal! | [root@dgiref-dcache ~]# cp /bin/bash /pnfs/fzk.de/data/dgtest/test.blub cp: closing `/pnfs/fzk.de/data/dgtest/test.blub': Input/output error |
- copying data using dCache protocols
Use a UI-server and voms-proxies! (dgiref-login.fzk.de)
| without a voms-proxy you get this error | [user@dgiref-login]$ dccp /bin/bash gsidcap://dgiref-dcache.fzk.de:22128/pnfs/fzk.de/data/dgtest/testbin-1 Error ( POLLIN POLLERR POLLHUP) (with data) on control line [3] Failed to create a control line Error ( POLLIN POLLERR POLLHUP) (with data) on control line [5] Failed to create a control line Failed open file in the dCache. Can't open destination file : Server rejected "hello" System error: Input/output error |
[user@dgiref-login]$ voms-proxy-info Couldn't find a valid proxy.
[user@dgiref-login]$ voms-proxy-init -voms dgtest Cannot find file or dir: /home/site/user/.glite/vomses Enter GRID pass phrase: Your identity: /C=DE/O=GermanGrid/OU=FZK/CN=user Creating temporary proxy ................................................... Done Contacting dgrid-voms.fzk.de:15000 [/O=GermanGrid/OU=FZK/CN=host/dgrid-voms.fzk.de] "dgtest" Done Creating proxy ...................................... Done Your proxy is valid until Mon Nov 10 23:26:03 2008
[user@dgiref-login]$ voms-proxy-info subject : /C=DE/O=GermanGrid/OU=FZK/CN=user/CN=proxy issuer : /C=DE/O=GermanGrid/OU=FZK/CN=user identity : /C=DE/O=GermanGrid/OU=FZK/CN=user type : proxy strength : 512 bits path : /tmp/x509up_u10824 timeleft : 11:59:58
Now you can copy data into dCache (using GSIdcap):
| first time issued, getting error message but file exists in dCache (see below) | [user@dgiref-login]$ dccp /bin/bash gsidcap://dgiref-dcache.fzk.de:22128/pnfs/fzk.de/data/dgtest/testbin-1 Command failed! Server error message for [1]: "no such file or directory /pnfs/fzk.de/data/dgtest/testbin-1" (errno 10001). 585908 bytes in 0 seconds |
| cannot overwrite/edit files in dCache (that's intended!) | [user@dgiref-login]$ dccp /bin/bash gsidcap://dgiref-dcache.fzk.de:22128/pnfs/fzk.de/data/dgtest/testbin-1 Command failed! Server error message for [2]: "File is readOnly" (errno 1). Failed open file in the dCache. Can't open destination file : "File is readOnly" System error: Input/output error |
| you may only write to directories linked to your VO! this is also intended! dCache admins have detailed control over user privileges in the future |
[user@dgiref-login]$ dccp /bin/bash gsidcap://dgiref-dcache.fzk.de:22128/pnfs /fzk.de/data/textgrid/testbin-1 Command failed! Server error message for [1]: "no such file or directory /pnfs/fzk.de/data/textgrid/testbin-1" (errno 10001). Command failed! Server error message for [2]: "Permission denied (Parent)" (errno 2). Failed open file in the dCache. Can't open destination file : "Permission denied (Parent)" System error: Input/output error |
For checking copy the file back and make a checksum:
[user@dgiref-login]$ dccp gsidcap://dgiref-dcache.fzk.de:22128/pnfs/fzk.de/data/dgtest/testbin-1 /tmp/testfilefromdCache 585908 bytes in 0 seconds [user@dgiref-login]$ md5sum /bin/bash /tmp/testfilefromdCache dc4e36cfdf491029a67f4e317cab3151 /bin/bash dc4e36cfdf491029a67f4e317cab3151 /tmp/testfilefromdCache
Same procedure with srmcp tool (using GridFTP protocol) (also known as "srm put"):
| Note: Usage of srmcp instead of dccp! Don't worry about the error message returned. |
[user@dgiref-login]$ srmcp file:////bin/bash gridftp://dgiref-dcache.fzk.de:2811/pnfs/fzk.de/data/dgtest/testbin-2 WARNING: SRM_PATH is defined, which might cause a wrong version of srm client to be executed WARNING: SRM_PATH=/opt/d-cache/srm |
And again fetching the file from dCache ("srm get"). Unfortunately dCache in the D-Grid reference installation only supports srm put in stream mode, which means, transfering data with only one single stream (normally up to ten). The reason for this is unknown and has to be investigated.
| Note: Usage of srmcp instead of dccp! Don't worry about the error message returned. |
[user@dgiref-login]$ srmcp -debug=true -streams_num=1 \ > gridftp://dgiref-dcache.fzk.de:2811/pnfs/fzk.de/data/dgtest/testbin-2 file:////dev/null WARNING: SRM_PATH is defined, which might cause a wrong version of srm client to be executed WARNING: SRM_PATH=/opt/d-cache/srm |
- Deleting files in dCache
Easiest is to delete the namespace entry:
[root@dgiref-dcache ~]# rm /pnfs/fzk.de/data/dgtest/testbin-1 rm: remove regular empty file `/pnfs/fzk.de/data/dgtest/testbin-1'? y
File will disapear (not immediate but) short after so dCache can prioritize read actions over deleting.
Regular users can delete files from the UI with srmrm (equals srm -rm):
[user@dgiref-login]$ srmrm srm://dgiref-dcache.fzk.de:8443/pnfs/fzk.de/data/dgtest/testbin-1 WARNING: SRM_PATH is defined, which might cause a wrong version of srm client to be executed WARNING: SRM_PATH=/opt/d-cache/srm
- GUI pcells for dCache
The core mechanism for administrating a dCache instance is the admin interface. This is a service you may connect to using a ssh client. Using the admin interface, you may communicate with the various components making up dCache, query their status and update their behaviour. Although dCache is a distributed system, you only ever connect to a single node; dCache will route your messages internally. The source for pcells, a graphical user interface which greately simplifys working with the admin interface, as well as an installation guide can be found at dCache.org. Once pcells is installed, this are the steps to take in order to connect to your dCache system:
- start pcells and open new session
- adjust settings by clicking on Setup
- addresses
- hostname = dCache_headnode.yourDomain
- addresses
- login as admin with (default) passphrase dickerel
Of course no one is forced to use this GUI but can still access the admin interface with a plain ssh client:
ssh -1 -c blowfish -p 2223 -l admin dCache_headnode.yourDomain
Guidance on how to use the admin interface is out of the scope of this documentation. Please look into the wiki at dCache.org.