Differences between revisions 6 and 613 (spanning 607 versions)

General Informations

This page lists announcements and status messages for IT services managed by ISG.EE.
For notifications and announcements of central IT services managed by ID, please visit https://www.ethz.ch/services/de/it-services/service-desk.html
For a detailed status overview of central IT services managed by ID, please visit https://ueberwachung.ethz.ch

Status-Key
	Resolved
	Still working but with some errors
	Pending

Current status reports

Planned project/ archive storage downtime and client reboot

Status: Status/green.gif

2020-07-11 12:00: Migration has been completed, all services are back to operational state.
2020-07-11 08:00: Migration started, services are shutdown
2020-07-11 8:00-12:00: Start of planned maintenance work. Project/ archive storage services (known under the names "ocean", "bluebay", "lagoon" and "benderstor") will not be available. ISG-managed Linux clients will be rebooted.

svn.ee.ethz.ch downtime for server upgrade

Status: Status/green.gif

2020-06-04 07:05: Webservices for managing SVN repositories are enabled.
2020-06-04 06:15: Systemupgrade is done and access to the SVN repositories via the svn and https transport protocols are back online.
2020-06-04 06:00: The server servicing the SVN repositories will be upgraded to a new operating system version. During this timeframe outages for access to the SVN repositories are expected.

European HPC cluster abuse

Status: Status/green.gif
Recently European HPC clusters have been attacked and abused for mining purposes. The D-ITET Slurm and SGE clusters have not been compromised. We are monitoring the situation closely.

2020-05 17 08:30: No successful login from known attacker IP addresses could be determined, none of the files indicating being compromised have been found on our file systems
2020-05-16 14:30: No unusal cluster job activity was observed

D-ITET Netscratch downtime for server upgrade

Status: Status/green.gif

2020-05-04 06:00: Server upgrade has been completed.
2020-05-04 06:00: The server servicing the D-ITET Netscratch service will be upgraded to a new operating system version. During this timeframe outages for the NFS service will be expected.

Network outage ETx router

Status: Status/green.gif

2020-04-07 05:30: There was an issue on the Router rou-etx. ID networking team trackled and solved the issue. There was about a 10min interuption for the ETx networking zone affecting almost all ISG.EE maintained systems.

Status: Status/green.gif

2020-04-06 05:35: System behind login.ee.ethz.ch has been rebootet for maintenance and increase available resources.

See the information on access D-ITET resources remotely. To distribute better the load user are encouraged to use the VPN service whenever possible.

itet-stor (FindYourData) Server maintenance: Reconfiguration of VM parameters

Status: Status/green.gif

2020-02-18 19:03: System again up and running.
2020-02-18 19:00: Scheduled downtime for the itet-stor/FindYourData service due to maintenance work on the underlying server.

itet-stor (FindYourData) Server migration: New operating system version

Status: Status/green.gif

2020-01-20 07:15: OS upgrade done, there were short interruptions to the itet-stor/FindYourData service.
2020-01-20 06:00: We will update the server servicing the FindYourData service from Debian jessie 8 to Debian stretch 9. There will be short downtimes accessing this service during the update.

Project Storage maintenance

Status: Status/green.gif

2019-10-01 06:00: The needed maintenance was performend and storage services (NFS and Samba) are again up online.
2019-10-01 05:50: We will perform HW maintenance of the system holding the project storage. NFS and Samba services to access the storage will be offline for 10-15min between 05:50 and 06:15.

Power outage

Status: Status/green.gif

2019-09-17 12:11: Power reestablished.
2019-09-17 10:29: Power outage in ET area. For up-to-date information please visit https://ethz.ch/services/de/it-services/service-desk.html#meldungen

Outage of Project Storage

Status: Status/green.gif

2019-09-17 12:11: At the moment the power has been reestablished after the outage there was again a failure of the storage system until 12:30.
2019-09-05 15:30: Project homes on bluebay are unavailable until 19:10 due to another hardware failure
2019-08-28 07:00: Whilst the service is running stable we have still open ongoing communication with the cluster filesystem developers to further track down the issues seen in the last weeks.
2019-08-22 06:00: Planned reboot of second system in cluster and related managment node.
2019-08-21 06:00: Planned reboot of first system (expect downtime of 5-10 minutes).
2019-08-18 03:45: Full sync of the missing storage targed finally succeeded. For now full redudancy regained. We are still in contact with the developers of the cluster filesystem to further analyse the problem.
2019-08-14 08:00: Full redundancy is still not regained. We are as well in contact with the developers of the cluster filesystem to analyse the current problems.
2019-08-12 07:00: Sync of the last storage target is ongoing and reached 82%.
2019-08-08 18:00: Heavy load on the storage system while still needing to perform resync caused high latencies on access. Several services were unavailable or responded very laggy until up 22:30. Situation is normalizing but until the full re-sync succeeds there is potential for further outages.
2019-08-08 15:55: The storage system is operational again and an initial high load phase due to resynchronization is finished.
2019-08-08 14:28: There are again problems with the storage servers...
2019-08-07 16:30: Due to errors while re-syncing the remaining target we needed to restart a full sync on this last target. Unfortunately this means that redundancy in the storage system will not yet be regained.
2019-08-07 06:55: High I/O load on the system during further syncing caused accessing issues on the projects between 06:15 and 06:55. There is still an issue with one target which needs a re-sync which might cause further access outages.
2019-08-06 10:50: Re-syncing is still ongoing, one storage target is missing, redundancy of the underlying storage system is not yet regained.
2019-08-05 13:00: Re-syncing of the broken targets is still ongoing, thus redundancy of the storage system is not yet regained.
2019-08-02 17:00: NFS throughput has normalized, causes are investigated. Recurrences have to be expected.
2019-08-02 15:50: High load on the storage server slowed down NFS connections down to timeouts.
2019-08-02 15:05: The synchronization from the healthy system 2 to system 1 has again been started. Once this synchronization is complete (in roughly 3 days), the system is finally redundant again.
2019-08-02 12:45: The hardware has been fixed. The storage services (NFS/Samba) have been powered on.
2019-08-02 11:50: The failing hardware component is currently being replaced by the hardware supplier.
2019-08-02 10:00: The failing hardware of system 1 has been identified. Availability of hardware replacement parts and successful data consistency check will determine the availability of the storage services.
2019-08-02 08:30: The replaced hardware in system 1 is again faulty, while system 2 seems (hardware-wise) ok. Due to data inconsistency, the storage service can not provide its services.
2019-08-01 16:30: While system 1 has not synchronized all data from system 2, the latter system goes down. Starting from here, project and archive storage is not available anymore.
2019-07-31 22:00: A high load of one of the backup processes leads to an NFS service interruption. After postponing that backup process, NFS recovers.
2019-07-31 21:30: The hardware failure lead to two corrupt storage targets on system 1. After reinitializing the affected targets the synchronization of data from the healthy system 2 starts.
2019-07-31 15:45: Service is restored, hardware exchange was successful. Restoring full redundancy is pending.
2019-07-31 15:00: Project homes on bluebay are unavailable due to replacement of faulty hardware part. Estimated downtime of ~1h.
2019-07-31 07:40: Service is restored, still running with reduced redundancy
2019-07-31 06:20: Project homes on bluebay are unavailable due to deactivation of faulty hardware part
2019-07-31 02:40: Storage system 2 took over. Service is restored but running with reduced redundancy.
2019-07-31 02:30: Project homes on bluebay are unavailable due to a hardware failure on system 1

Outage net_scratch service

Status: Status/green.gif

2019-08-28 07:00: The old server is now decomissioned. We will re-bootstrap the service on new hardware in next weeks and re-announce the service when ready to be used.
2019-08-12 07:00: Read-only access to partially recoverable data has been removed again as planned. We are planning to re-instantiate the service on new hardware.
2019-07-26 12:00: Read-only access to partially recoverable data has been made available on login.ee.ethz.ch under /mnt. This volume will be available until August 9th.
2019-07-23 08:00: We are in contact with the hardware vendor to see on further steps to take with the respective RAID controller and double-check if there is an issue with it. Data on the net_scratch is with a high change lost.
2019-07-22 15:35: We still try to recover the data but the filesystem got corruption due to a hardware hickup on the underlying raid system. Affected by this is /itet-stor/<username>/net_scratch.
2019-07-22 12:00: Filesystem errors for the device holding the net_scratch service were reported. The server is currently offline due to severe errors on the filesystem.

Disabling ptrace on managed Debian/GNU Linux computers

Status: Status/green.gif

2019-07-22 09:00

New images are rolled out. As clients reboot they will automatically pick up this new release with a fixed Linux kernel.

2019-07-21 15:00

New images are available. On next reboot of an managed client will boot with a patched kernel and the mitigation disabled.

2019-07-19 13:00

A current Linux kernel bug allows any unprivileged user to gain root access. A proof of concept code snippet that exploits this vulnerability is publicly available.
To protect our systems we temporarily disabled ptrace(2) on all managed Linux systems. All software depending on ptrace(2) will completely or at least partially fail. A prominent example is the GNU Debugger gdb.
A patched Linux kernel will come soon. Once this new kernel is running, we will enable ptrace again.

Outage of Project Storage

Status: Status/green.gif

2019-07-18 18:45: Service is back to normal
2019-07-18 17:00: Project homes on bluebay are unavailable

D-ITET services outages

Status: Status/green.gif

2019-04-24 14:18: Virtualisation cluster back with full redudancy. User affecting virtual machines were back online already at 07:30.
2019-04-24 13:00: A issue between the redunant switches caused the network issue, leading the cluster to be in an inconsistent state and rebooting all virtual machines. Networking people are investigating further the issue between the switches.
2019-04-24 09:00: Further analysis ongoing, but healt status of virtualisation cluster was affected leading to resets of all hosted virtual machines.
2019-04-24 07:30: We are bringing services back to normal.
2019-04-24 07:00: A planned outage in ETZ building caused stability issues on serveral services of D-ITET in particular HOME and mail services.

Downtime HOME server: Repair filesystem inconsistency

Status: Status/green.gif

2019-02-25 06:18: System is back online.
2019-02-25 06:00: we identified a filesystem usage accounting discrepancy on one filesystem on the HOME server requiring taking down the server and issuing a repair of the underlying filesystem. The home storage is the default storage location for personal accounts on computers managed by ISG.EE.

One of two RDS servers is not reachable

Status: Status/green.gif

2019-01-25 23:50: Maintenance issues have been resolved. All RDS servers are up and running now.
2019-01-25 14:00: RDS maintenance window is terminated but one server has still pending updates. Logins are not allowed until this issue has been fixed.

Upgrade license, database and distributed computing management server itetmaster01

Status: Status/green.gif

2018-11-12 12:00: PostgreSQL services are online.
2018-11-12 09:31: Arton-Cluster and Condor Grid updates are finished.
2018-11-12 07:30: Arton-Cluster updated (final checks pending)
2018-11-12 07:00: System and license services upgraded. Still pending: Arton, Condor and PostgreSQL Upgrades.
2018-11-12 06:00 - 12:00: Planned upgrade of itetmaster01

Maintenance Project Storage

Status: Status/green.gif

2018-11-08 07:00: Services back online (some recovering slowly)
2018-11-08 06:15: Starting downtime for project storage due to an important maintenance on master node.

D-ITET RDS fronted is difficult to reach due to AD name resolution issues

Status: Status/green.gif

2018-11-02 08:00: RDS is working normally. Protective Measures were put into place to ensure the AD name is updated correctly.
2018-11-01 09:45: worli.d.ethz.ch can not be resolved by the AD name service. DNS works fine but AD-DNS Synchronization seems to be in an unstable state. We are in contact with the responsible team of central IT services.

WORKAROUND:
- D-ITET users: Connenct directly to vela.ee.ethz.ch
- IFE users: Connect directly to neon.ee.ethz.ch

Jabba Tape Library HW Problems

Status: Status/green.gif

2018-08-17 07:00: Update: All tape library issues have been solved. All backup and archive services are back online
2018-08-15 12:00: Update: There are still issues with the tape library. It will be down for at least another day.
2018-08-13 11:30: Update: The failed tape drive has been replaced. The management PC is still not working. Due to the very old hardware the supply with spare parts does take longer as usual. The tape library remains offline at least until tomorrow afternoon.
2018-08-10 10:00: Due to a failed tape drive and a defective management PC Jabba's Tape Library is not working. The Jabba server is not affected. We are in contact with IBM and we hope the problem will be fixed by next Monday.

What does this mean for you:

BACKUP
- Store new data: New data are stored to a SAM-FS cache area first, but can not be written to tape afterwards, i.e. the backup process can not be completed finally, but will started automatically as soon as the tape library works again.
- Access existing data: Only access to data still available in the SAM-FS cache area is possible. NO ACCESS to data located on tape only.
ARCHIVE
- Store new data: New data are stored to a SAM-FS cache area first. In second step data are copied to a archive disk but the second copy to a tape will fail. I.e. the archive process can be completed for one half only. The copy-to-tape will started automatically as soon as the tape library works again.
- Access existing data: there is no limitation in access to archive data

D-ITET mail server downtime: New operating system version

Status: Status/green.gif

2018-06-16 08:24: System is up and running, all tests done. Everything should work as intended. If you find errors, please contact us under <support@ee.ethz.ch>
2018-06-16 08:08: System is up and running, we are performing some final tests before releasing access with little delay as previously announced.
2018-06-16 07:11: Everything works as planned currently
2018-06-16 06:00: Due to a planned operating system update, the D-ITET mail server will be unavailable today, June 16, 2018 between 06:00 and 08:00.

Major outage virtualization cluster/networking switch

Status: Status/green.gif

2018-04-24 08:56: Sending of emails is restored again. Recieving mail should not be lost for any properly sending email server, since the issues caused a temporary error notification to the sending server which should in turn retry resubmitting an email correctly later on with some delay.
2018-04-24 07:45: Bringing back online most important services, including home service; issue being investigated.
2018-04-24 06:29: Major outage of Networking/virtualization Cluster taking down important D-ITET Services (home Server, partially mailsystem, Linux clients).

Jabba Maintenance

Status: Status/green.gif

2018-04-06 08:10: Jabba is back online
2018-04-06 07:00: Jabba is offline due to maintenance work

D-ITET Storage Migration

Status: Status/green.gif

2018-03-10 15:00: Migration of user homes completed.
2018-03-10 14:15: User homes migrated, access is unblocked again, some post-migration tasks still pending.
2018-03-10 10:00: D-ITET user homes will be migrated from ID Storage to D-ITET Storage. During the whole migration time access to the user homes for the affected users is blocked. Affected users are informed directly by an email.

svn.ee.ethz.ch Server migration: New operating system version

Status: Status/green.gif

2018-02-12 08:55: Server upgrade has been completed and all services up and running again.
2018-02-12 06:15: Start updating server from Debian Wheezy 7 to Debian Stretch 9. Downtimes for https://svn.ee.ethz.ch, svn://svn.ee.ethz.ch and https://svnmgr.ee.ethz.ch.

Cronbox/Login Server migration: New operating system version

Status: Status/green.gif

2018-02-05 07:00

The host mira has been upgraded to Debian 9 Stretch. SSH Host keys fingerprints for RSA and ED25519 are:

4096 MD5:fc:a8:00:5b:64:90:86:a1:fb:49:75:ef:55:58:90:b3 (RSA)
4096 SHA256:v48HAAAjr+avnPAESdQzazSriKYZeTGGtIPKfoE8Dg0 (RSA)
256 SHA256:SgvaiZyIgzujLJdbtRij5VGUOXm/IuAs3MkMYtGZNhc (ED25519)
256 MD5:3b:b0:1a:8a:ea:0a:e5:ea:bb:9e:bb:5c:ef:24:c3:92 (ED25519)

The SSH host key is as well listed on: https://people.ee.ethz.ch/

2018-01-31 11:00: The host mira holding the cronbox and login service will be upgraded to Debian 9 Stretch on 2018-02-05 at 06:10.

Upgrade of Server itetnas02

Status: Status/green.gif

2018-01-25 07:30: Upgrade completed.
2018-01-24 16:45: On 2018-01-25 around 06:10 we will upgrade the server itetnas02. Several short outages for Fileservices (Samba, NFS) are expected. Services for project accounts and dedicated shares for biwi, ibt, ini and tik are affected.

Archived status reports

2015 2014 2013 2012 2011 2010

CategoryEDUC

-  ⇤ ← Revision 6 as of 2010-11-15 15:13:32 → 
  Size: 603
  Editor: bonaccos
  Comment:
+   ← Revision 613 as of 2020-08-31 11:38:06 → ⇥
  Size: 22158
  Editor: alders
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-= Status =

<<Anchor(2010-11-15-servers-down)>>

== 2010-11-15: cooling water system outage for some clusters ==

On last friday evening one of the cooling water pumps installed in ETZ/D/96.2 stopped working correctly. This forced some of the racks in this server room to shutdown in order to protect the servers from thermal damage. All clusters from IFH, IBT, BIWI, TIK, IKT and VAW were affected.


The facility management is working on solving the problem. A technician is scheduled to be on-site, we expect the problem to be remedied at 5 PM. The technician is currently on-site.
+#rev 2018-08-27 mreimers

= General Informations =
 * This page lists announcements and status messages for IT services managed by [[http://www.isg.ee.ethz.ch/|ISG.EE]].
 * For notifications and announcements of central IT services managed by ID, please visit https://www.ethz.ch/services/de/it-services/service-desk.html
 * For a detailed status overview of central IT services managed by ID, please visit https://ueberwachung.ethz.ch

||||<style="border-width: 1px 0px; border-color: rgb(85, 136, 238); padding: 0.6em;">'''Status-Key''' ||
||<style="border: medium none;"> {{attachment:Status/green.gif}} ||<style="border: medium none;">Resolved ||
||<style="border: medium none;"> {{attachment:Status/orange.gif}} ||<style="border: medium none;">Still working but with some errors ||
||<style="border-width: medium medium 1px; border-top: medium none rgb(85, 136, 238); border-left: medium none rgb(85, 136, 238); border-right: medium none rgb(85, 136, 238); border-color: rgb(85, 136, 238);"> {{attachment:Status/red.gif}} ||<style="border-width: medium medium 1px; border-top: medium none rgb(85, 136, 238); border-left: medium none rgb(85, 136, 238); border-right: medium none rgb(85, 136, 238); border-color: rgb(85, 136, 238);">Pending ||

= Current status reports =

<<Anchor(2020-07-11-storage-downtime)>>
== Planned project/ archive storage downtime and client reboot ==
'''Status:''' {{attachment:Status/green.gif}}

  2020-07-11 12:00:: Migration has been completed, all services are back to operational state.

  2020-07-11 08:00:: Migration started, services are shutdown

  2020-07-11 8:00-12:00:: Start of planned maintenance work. Project/ archive storage services (known under the names "ocean", "bluebay", "lagoon" and "benderstor") will not be available. ISG-managed Linux clients will be rebooted.



<<Anchor(2020-06-04-svnsrv-upgrade)>>
== svn.ee.ethz.ch downtime for server upgrade ==
'''Status:''' {{attachment:Status/green.gif}}

  2020-06-04 07:05:: Webservices for managing SVN repositories are enabled.
  2020-06-04 06:15:: Systemupgrade is done and access to the SVN repositories via the `svn` and `https` transport protocols are back online.
  2020-06-04 06:00:: The server servicing the SVN repositories will be upgraded to a new operating system version. During this timeframe outages for access to the SVN repositories are expected.

<<Anchor(2020-05-17-cluster-abuse)>>
== European HPC cluster abuse ==
'''Status:''' {{attachment:Status/green.gif}}<<BR>>
Recently European HPC clusters have been attacked and abused for mining purposes. The D-ITET Slurm and SGE clusters have not been compromised. We are monitoring the situation closely.
  2020-05 17 08:30:: No successful login from known attacker IP addresses could be determined, none of the files indicating being compromised have been found on our file systems  
  2020-05-16 14:30:: No unusal cluster job activity was observed

<<Anchor(2020-05-04-itetnas04-upgrade)>>
== D-ITET Netscratch downtime for server upgrade ==
'''Status:''' {{attachment:Status/green.gif}}

  2020-05-04 06:00:: Server upgrade has been completed.
  2020-05-04 06:00:: The server servicing the D-ITET Netscratch service will be upgraded to a new operating system version. During this timeframe outages for the NFS service will be expected.

<<Anchor(2020-04-07-network-interuption)>>
== Network outage ETx router ==
'''Status:''' {{attachment:Status/green.gif}}
  2020-04-07 05:30:: There was an issue on the Router `rou-etx`. ID networking team trackled and solved the issue. There was about a 10min interuption for the ETx networking zone affecting almost all ISG.EE maintained systems.

<<Anchor(2020-04-06-mira-maintenance)>>
== login.ee.ethz.ch: Reboot for maintenance ==
'''Status:''' {{attachment:Status/green.gif}}
  2020-04-06 05:35:: System behind `login.ee.ethz.ch` has been rebootet for maintenance and increase available resources.

See the [[RemoteAccess|information on access D-ITET resources remotely]]. To distribute better the load user are encouraged to use the VPN service whenever possible.

<<Anchor(2020-02-18-nostro-maintenance)>>
== itet-stor (FindYourData) Server maintenance: Reconfiguration of VM parameters ==
'''Status:''' {{attachment:Status/green.gif}}

  2020-02-18 19:03:: System again up and running.
  2020-02-18 19:00:: Scheduled downtime for the [[Workstations/FindYourData|itet-stor/FindYourData service]] due to maintenance work on the underlying server.

<<Anchor(2020-01-20-nostro-os-upgrade)>>
== itet-stor (FindYourData) Server migration: New operating system version ==
'''Status:''' {{attachment:Status/green.gif}}

  2020-01-20 07:15:: OS upgrade done, there were short interruptions to the [[Workstations/FindYourData|itet-stor/FindYourData service]].
  2020-01-20 06:00:: We will update the server servicing the [[Workstations/FindYourData|FindYourData service]] from Debian jessie 8 to Debian stretch 9. There will be short downtimes accessing this service during the update.

<<Anchor(2019-10-01-maintenance-project-storage)>>
== Project Storage maintenance ==
'''Status:''' {{attachment:Status/green.gif}}
  2019-10-01 06:00:: The needed maintenance was performend and storage services (NFS and Samba) are again up online.
  2019-10-01 05:50:: We will perform HW maintenance of the system holding the project storage. NFS and Samba services to access the storage will be offline for 10-15min between 05:50 and 06:15.

<<Anchor(2019-09-17-power-outage)>>
== Power outage ==
'''Status:''' {{attachment:Status/green.gif}}

  2019-09-17 12:11:: Power reestablished.
  2019-09-17 10:29:: Power outage in ET area. For up-to-date information please visit https://ethz.ch/services/de/it-services/service-desk.html#meldungen

<<Anchor(2019-07-31-outage-project-storage)>>
== Outage of Project Storage ==
'''Status:''' {{attachment:Status/green.gif}}  

  2019-09-17 12:11:: At the moment the power has been reestablished after the outage there was again a failure of the storage system until 12:30.
  2019-09-05 15:30:: Project homes on bluebay are unavailable until 19:10 due to another hardware failure
  2019-08-28 07:00:: Whilst the service is running stable we have still open ongoing communication with the cluster filesystem developers to further track down the issues seen in the last weeks.
  2019-08-22 06:00:: Planned reboot of second system in cluster and related managment node.
  2019-08-21 06:00:: Planned reboot of first system (expect downtime of 5-10 minutes).
  2019-08-18 03:45:: Full sync of the missing storage targed finally succeeded. For now full redudancy regained. We are still in contact with the developers of the cluster filesystem to further analyse the problem.
  2019-08-14 08:00:: Full redundancy is still not regained. We are as well in contact with the developers of the cluster filesystem to analyse the current problems.
  2019-08-12 07:00:: Sync of the last storage target is ongoing and reached 82%.
  2019-08-08 18:00:: Heavy load on the storage system while still needing to perform resync caused high latencies on access. Several services were unavailable or responded very laggy until up 22:30. Situation is normalizing but until the full re-sync succeeds there is potential for further outages.
  2019-08-08 15:55:: The storage system is operational again and an initial high load phase due to resynchronization is finished.
  2019-08-08 14:28:: There are again problems with the storage servers...
  2019-08-07 16:30:: Due to errors while re-syncing the remaining target we needed to restart a full sync on this last target. Unfortunately this means that redundancy in the storage system will not yet be regained.
  2019-08-07 06:55:: High I/O load on the system during further syncing caused accessing issues on the projects between 06:15 and 06:55. There is still an issue with one target which needs a re-sync which might cause further access outages.
  2019-08-06 10:50:: Re-syncing is still ongoing, one storage target is missing, redundancy of the underlying storage system is not yet regained.
  2019-08-05 13:00:: Re-syncing of the broken targets is still ongoing, thus redundancy of the storage system is not yet regained.
  2019-08-02 17:00:: NFS throughput has normalized, causes are investigated. Recurrences have to be expected.
  2019-08-02 15:50:: High load on the storage server slowed down NFS connections down to timeouts.
  2019-08-02 15:05:: The synchronization from the healthy system 2 to system 1 has again been started. Once this synchronization is complete (in roughly 3 days), the system is finally redundant again.
  2019-08-02 12:45:: The hardware has been fixed. The storage services (NFS/Samba) have been powered on.
  2019-08-02 11:50:: The failing hardware component is currently being replaced by the hardware supplier.
  2019-08-02 10:00:: The failing hardware of system 1 has been identified. Availability of hardware replacement parts and successful data consistency check will determine the availability of the storage services.
  2019-08-02 08:30:: The replaced hardware in system 1 is again faulty, while system 2 seems (hardware-wise) ok. Due to data inconsistency, the storage service can not provide its services.
  2019-08-01 16:30:: While system 1 has not synchronized all data from system 2, the latter system goes down. Starting from here, project and archive storage is not available anymore.
  2019-07-31 22:00:: A high load of one of the backup processes leads to an NFS service interruption. After postponing that backup process, NFS recovers.
  2019-07-31 21:30:: The hardware failure lead to two corrupt storage targets on system 1. After reinitializing the affected targets the synchronization of data from the healthy system 2 starts.
  2019-07-31 15:45:: Service is restored, hardware exchange was successful. Restoring full redundancy is pending.
  2019-07-31 15:00:: Project homes on bluebay are unavailable due to replacement of faulty hardware part. Estimated downtime of ~1h.
  2019-07-31 07:40:: Service is restored, still running with reduced redundancy
  2019-07-31 06:20:: Project homes on bluebay are unavailable due to deactivation of faulty hardware part
  2019-07-31 02:40:: Storage system 2 took over. Service is restored but running with reduced redundancy.
  2019-07-31 02:30:: Project homes on bluebay are unavailable due to a hardware failure on system 1


<<Anchor(2019-07-22-outage-net-scratch)>>
== Outage net_scratch service ==
'''Status:''' {{attachment:Status/green.gif}}

  2019-08-28 07:00:: The old server is now decomissioned. We will re-bootstrap the service on new hardware in next weeks and re-announce the service when ready to be used.

  2019-08-12 07:00:: Read-only access to partially recoverable data has been removed again as planned. We are planning to re-instantiate the service on new hardware.

  2019-07-26 12:00:: Read-only access to partially recoverable data has been made available on `login.ee.ethz.ch` under `/mnt`. This volume will be available until August 9th.

  2019-07-23 08:00:: We are in contact with the hardware vendor to see on further steps to take with the respective RAID controller and double-check if there is an issue with it. Data on the `net_scratch` is with a high change lost.

  2019-07-22 15:35:: We still try to recover the data but the filesystem got corruption due to a hardware hickup on the underlying raid system. Affected by this is [[Services/NetScratch|/itet-stor/<username>/net_scratch]].

  2019-07-22 12:00:: Filesystem errors for the device holding the `net_scratch` service were reported. The server is currently offline due to severe errors on the filesystem.

<<Anchor(2019-07-19-ptrace-disable)>>
== Disabling ptrace on managed Debian/GNU Linux computers ==
'''Status:''' {{attachment:Status/green.gif}}

  2019-07-22 09:00::
   New images are rolled out. As clients reboot they will automatically pick up this new release with a fixed Linux kernel.

  2019-07-21 15:00::
   New images are available. On next reboot of an managed client will boot with a patched kernel and the mitigation disabled.

  2019-07-19 13:00::
   A current Linux kernel bug allows any unprivileged user to gain root access. A proof of concept code snippet that exploits this vulnerability is publicly available.

   To protect our systems we temporarily disabled {{{ptrace(2)}}} on all managed Linux systems. All software depending on {{{ptrace(2)}}} will completely or at least partially fail. A prominent example is the GNU Debugger {{{gdb}}}.

   A patched Linux kernel will come soon. Once this new kernel is running, we will enable {{{ptrace}}} again.

<<Anchor(2019-07-18-outage-project-storage)>>
== Outage of Project Storage ==
'''Status:''' {{attachment:Status/green.gif}}

  2019-07-18 18:45:: Service is back to normal
  2019-07-18 17:00:: Project homes on bluebay are unavailable

<<Anchor(2019-04-24-Power-Outage-Services-down)>>
== D-ITET services outages ==
'''Status:''' {{attachment:Status/green.gif}}

  2019-04-24 14:18:: Virtualisation cluster back with full redudancy. User affecting virtual machines were back online already at 07:30.
  2019-04-24 13:00:: A issue between the redunant switches caused the network issue, leading the cluster to be in an inconsistent state and rebooting all virtual machines. Networking people are investigating further the issue between the switches.
  2019-04-24 09:00:: Further analysis ongoing, but healt status of virtualisation cluster was affected leading to resets of all hosted virtual machines.
  2019-04-24 07:30:: We are bringing services back to normal.
  2019-04-24 07:00:: A planned outage in ETZ building caused stability issues on serveral services of D-ITET in particular HOME and mail services.

<<Anchor(2019-02-25-verdi-repair-filesystem-inconsitency)>>
== Downtime HOME server: Repair filesystem inconsistency ==
'''Status:''' {{attachment:Status/green.gif}}

  2019-02-25 06:18:: System is back online.
  2019-02-25 06:00:: we identified a filesystem usage accounting discrepancy on one filesystem on the HOME  server requiring taking down the server and issuing a repair of the underlying filesystem. The home storage is the default storage location for personal accounts on computers managed by ISG.EE.

<<Anchor(2019-01-25-D-ITET-RDS-Maintenance)>>
== One of two RDS servers is not reachable ==
'''Status:''' {{attachment:Status/green.gif}}

  2019-01-25 23:50:: Maintenance issues have been resolved. All RDS servers are up and running now.
  2019-01-25 14:00:: RDS maintenance window is terminated but one server has still pending updates. Logins are not allowed until this issue has been fixed.


<<Anchor(2018-11-06-D-ITET-itetmaster01)>>
== Upgrade license, database and distributed computing management server itetmaster01 ==
'''Status:''' {{attachment:Status/green.gif}}

  2018-11-12 12:00:: PostgreSQL services are online.
  2018-11-12 09:31:: Arton-Cluster and Condor Grid updates are finished.
  2018-11-12 07:30:: Arton-Cluster updated (final checks pending)
  2018-11-12 07:00:: System and license services upgraded. Still pending: Arton, Condor and PostgreSQL Upgrades.
  2018-11-12 06:00 - 12:00:: Planned upgrade of `itetmaster01`

<<Anchor(2018-11-08-beegfs-storage-maintenance)>>
== Maintenance Project Storage ==
'''Status:''' {{attachment:Status/green.gif}}

  2018-11-08 07:00:: Services back online (some recovering slowly)
  2018-11-08 06:15:: Starting downtime for project storage due to an important maintenance on master node.

<<Anchor(2018-11-01-D-ITET-RDS-UNSTABLE)>>
== D-ITET RDS fronted is difficult to reach due to AD name resolution issues ==
'''Status:''' {{attachment:Status/green.gif}}

  2018-11-02 08:00:: RDS is working normally. Protective Measures were put into place to ensure the AD name is updated correctly.
  2018-11-01 09:45:: worli.d.ethz.ch can not be resolved by the AD name service. DNS works fine but AD-DNS Synchronization seems to be in an unstable state. We are in contact with the responsible team of central IT services.

 * WORKAROUND:
  * D-ITET users: Connenct directly to vela.ee.ethz.ch
  * IFE users: Connect directly to neon.ee.ethz.ch


<<Anchor(2018-08-10-jabba-tape-library-problems)>>
== Jabba Tape Library HW Problems ==
'''Status:''' {{attachment:Status/green.gif}}

  2018-08-17 07:00:: '''Update:''' All tape library issues have been solved. All backup and archive services are back online

  2018-08-15 12:00:: '''Update:''' There are still issues with the tape library. It will be down for at least another day.

  2018-08-13 11:30:: '''Update:''' The failed tape drive has been replaced. The management PC is still not working. Due to the very old hardware the supply with spare parts does take longer as usual. The tape library remains offline at least until tomorrow afternoon.

  2018-08-10 10:00:: Due to a failed tape drive and a defective management PC Jabba's Tape Library is not working. The Jabba server is not affected. We are in contact with IBM and we hope the problem will be fixed by next Monday.

What does this mean for you:
  * BACKUP
    * Store new data: New data are stored to a SAM-FS cache area first, but can not be written to tape afterwards, i.e. the backup process can not be completed finally, but will started automatically as soon as the tape library works again.
    * Access existing data: Only access to data still available in the SAM-FS cache area is possible. NO ACCESS to data located on tape only.

  * ARCHIVE
    * Store new data: New data are stored to a SAM-FS cache area first. In second step data are copied to a archive disk but the second copy to a tape will fail. I.e. the archive process can  be completed for one half only. The copy-to-tape will started automatically as soon as the tape library works again.
    * Access existing data: there is no limitation in access to archive data

<<Anchor(2018-06-16-D-ITET-mail-server-downtime)>>
== D-ITET mail server downtime: New operating system version ==
'''Status:''' {{attachment:Status/green.gif}}

  2018-06-16 08:24:: System is up and running, all tests done. Everything should work as intended. If you find errors, please contact us under <support@ee.ethz.ch>
  2018-06-16 08:08:: System is up and running, we are performing some final tests before releasing access with little delay as previously announced.
  2018-06-16 07:11:: Everything works as planned currently
  2018-06-16 06:00:: Due to a planned operating system update, the D-ITET mail server will be unavailable today, June 16, 2018 between 06:00 and 08:00.

<<Anchor(2018-04-24-major-outage-incident)>>
== Major outage virtualization cluster/networking switch ==
'''Status:''' {{attachment:Status/green.gif}}

  2018-04-24 08:56:: Sending of emails is restored again. Recieving mail should not be lost for any properly sending email server, since the issues caused a temporary error notification to the sending server which should in turn retry resubmitting an email correctly later on with some delay.
  2018-04-24 07:45:: Bringing back online most important services, including home service; issue being investigated.
  2018-04-24 06:29:: Major outage of Networking/virtualization Cluster taking down important D-ITET Services (home Server, partially mailsystem, Linux clients).

<<Anchor(2018-04-06-jabba-maintenance)>>
== Jabba Maintenance ==
'''Status:''' {{attachment:Status/green.gif}}
  2018-04-06 08:10:: Jabba is back online
  2018-04-06 07:00:: Jabba is offline due to maintenance work

<<Anchor(2018-03-10-d-itet-storage-migration)>>
== D-ITET Storage Migration ==
'''Status:''' {{attachment:Status/green.gif}}

  2018-03-10 15:00:: Migration of user homes completed.
  2018-03-10 14:15:: User homes migrated, access is unblocked again, some post-migration tasks still pending.
  2018-03-10 10:00:: D-ITET user homes will be migrated from ID Storage to D-ITET Storage. During the whole migration time access to the user homes for the affected users is blocked. Affected users are informed directly by an email.

<<Anchor(2018-02-12-svnsrv-os-upgrade)>>
== svn.ee.ethz.ch Server migration: New operating system version ==
'''Status:''' {{attachment:Status/green.gif}}

  2018-02-12 08:55:: Server upgrade has been completed and all services up and running again.

  2018-02-12 06:15:: Start updating server from Debian Wheezy 7 to Debian Stretch 9. Downtimes for `https://svn.ee.ethz.ch`, `svn://svn.ee.ethz.ch` and `https://svnmgr.ee.ethz.ch`.

<<Anchor(2018-02-05-cronbox-os-upgrade)>>
== Cronbox/Login Server migration: New operating system version ==
'''Status:''' {{attachment:Status/green.gif}}

  2018-02-05 07:00:: The host `mira` has been upgraded to Debian 9 Stretch. SSH Host keys fingerprints for RSA and ED25519 are:
  {{{
4096 MD5:fc:a8:00:5b:64:90:86:a1:fb:49:75:ef:55:58:90:b3 (RSA)
4096 SHA256:v48HAAAjr+avnPAESdQzazSriKYZeTGGtIPKfoE8Dg0 (RSA)
256 SHA256:SgvaiZyIgzujLJdbtRij5VGUOXm/IuAs3MkMYtGZNhc (ED25519)
256 MD5:3b:b0:1a:8a:ea:0a:e5:ea:bb:9e:bb:5c:ef:24:c3:92 (ED25519)
}}}
The SSH host key is as well listed on: https://people.ee.ethz.ch/

  2018-01-31 11:00:: The host `mira` holding the cronbox and login service will be upgraded to Debian 9 Stretch on 2018-02-05 at 06:10.

<<Anchor(2018-01-25-upgrade-itetnas02)>>
== Upgrade of Server itetnas02 ==
'''Status:''' {{attachment:Status/green.gif}}

  2018-01-25 07:30:: Upgrade completed.

  2018-01-24 16:45:: On 2018-01-25 around 06:10 we will upgrade the server `itetnas02`. Several short outages for Fileservices (Samba, NFS) are expected. Services for project accounts and dedicated shares for biwi, ibt, ini and tik are affected.


= Archived status reports =

[[Status/Archive/2015|2015]]
[[Status/Archive/2014|2014]]
[[Status/Archive/2013|2013]]
[[Status/Archive/2012|2012]]
[[Status/Archive/2011|2011]]
[[Status/Archive/2010|2010]]
-Line 13:
+Line 313:
+[[CategoryEDUC]]

Wiki

Page

General Informations

Current status reports

Planned project/ archive storage downtime and client reboot

svn.ee.ethz.ch downtime for server upgrade

European HPC cluster abuse

D-ITET Netscratch downtime for server upgrade

Network outage ETx router

itet-stor (FindYourData) Server maintenance: Reconfiguration of VM parameters

itet-stor (FindYourData) Server migration: New operating system version

Project Storage maintenance

Power outage

Outage of Project Storage

Outage net_scratch service

Disabling ptrace on managed Debian/GNU Linux computers

Outage of Project Storage

D-ITET services outages

Downtime HOME server: Repair filesystem inconsistency

One of two RDS servers is not reachable

Upgrade license, database and distributed computing management server itetmaster01

Maintenance Project Storage

D-ITET RDS fronted is difficult to reach due to AD name resolution issues

Jabba Tape Library HW Problems

D-ITET mail server downtime: New operating system version

Major outage virtualization cluster/networking switch

Jabba Maintenance

D-ITET Storage Migration

svn.ee.ethz.ch Server migration: New operating system version

Cronbox/Login Server migration: New operating system version

Upgrade of Server itetnas02

Archived status reports