General Informations

Status-Key

Status/green.gif

Resolved

Status/orange.gif

Still working but with some errors

Status/red.gif

Pending

Project Storage maintenance

Status: Status/green.gif

2019-10-01 06:00
The needed maintenance was performend and storage services (NFS and Samba) are again up online.
2019-10-01 05:50
We will perform HW maintenance of the system holding the project storage. NFS and Samba services to access the storage will be offline for 10-15min between 05:50 and 06:15.

Power outage

Status: Status/green.gif

2019-09-17 12:11
Power reestablished.
2019-09-17 10:29

Power outage in ET area. For up-to-date information please visit https://ethz.ch/services/de/it-services/service-desk.html#meldungen

Outage of Project Storage

Status: Status/green.gif

2019-09-17 12:11
At the moment the power has been reestablished after the outage there was again a failure of the storage system until 12:30.
2019-09-05 15:30
Project homes on bluebay are unavailable until 19:10 due to another hardware failure
2019-08-28 07:00
Whilst the service is running stable we have still open ongoing communication with the cluster filesystem developers to further track down the issues seen in the last weeks.
2019-08-22 06:00
Planned reboot of second system in cluster and related managment node.
2019-08-21 06:00
Planned reboot of first system (expect downtime of 5-10 minutes).
2019-08-18 03:45
Full sync of the missing storage targed finally succeeded. For now full redudancy regained. We are still in contact with the developers of the cluster filesystem to further analyse the problem.
2019-08-14 08:00
Full redundancy is still not regained. We are as well in contact with the developers of the cluster filesystem to analyse the current problems.
2019-08-12 07:00
Sync of the last storage target is ongoing and reached 82%.
2019-08-08 18:00
Heavy load on the storage system while still needing to perform resync caused high latencies on access. Several services were unavailable or responded very laggy until up 22:30. Situation is normalizing but until the full re-sync succeeds there is potential for further outages.
2019-08-08 15:55
The storage system is operational again and an initial high load phase due to resynchronization is finished.
2019-08-08 14:28
There are again problems with the storage servers...
2019-08-07 16:30
Due to errors while re-syncing the remaining target we needed to restart a full sync on this last target. Unfortunately this means that redundancy in the storage system will not yet be regained.
2019-08-07 06:55
High I/O load on the system during further syncing caused accessing issues on the projects between 06:15 and 06:55. There is still an issue with one target which needs a re-sync which might cause further access outages.
2019-08-06 10:50
Re-syncing is still ongoing, one storage target is missing, redundancy of the underlying storage system is not yet regained.
2019-08-05 13:00
Re-syncing of the broken targets is still ongoing, thus redundancy of the storage system is not yet regained.
2019-08-02 17:00
NFS throughput has normalized, causes are investigated. Recurrences have to be expected.
2019-08-02 15:50
High load on the storage server slowed down NFS connections down to timeouts.
2019-08-02 15:05
The synchronization from the healthy system 2 to system 1 has again been started. Once this synchronization is complete (in roughly 3 days), the system is finally redundant again.
2019-08-02 12:45
The hardware has been fixed. The storage services (NFS/Samba) have been powered on.
2019-08-02 11:50
The failing hardware component is currently being replaced by the hardware supplier.
2019-08-02 10:00
The failing hardware of system 1 has been identified. Availability of hardware replacement parts and successful data consistency check will determine the availability of the storage services.
2019-08-02 08:30
The replaced hardware in system 1 is again faulty, while system 2 seems (hardware-wise) ok. Due to data inconsistency, the storage service can not provide its services.
2019-08-01 16:30
While system 1 has not synchronized all data from system 2, the latter system goes down. Starting from here, project and archive storage is not available anymore.
2019-07-31 22:00
A high load of one of the backup processes leads to an NFS service interruption. After postponing that backup process, NFS recovers.
2019-07-31 21:30
The hardware failure lead to two corrupt storage targets on system 1. After reinitializing the affected targets the synchronization of data from the healthy system 2 starts.
2019-07-31 15:45
Service is restored, hardware exchange was successful. Restoring full redundancy is pending.
2019-07-31 15:00
Project homes on bluebay are unavailable due to replacement of faulty hardware part. Estimated downtime of ~1h.
2019-07-31 07:40
Service is restored, still running with reduced redundancy
2019-07-31 06:20
Project homes on bluebay are unavailable due to deactivation of faulty hardware part
2019-07-31 02:40
Storage system 2 took over. Service is restored but running with reduced redundancy.
2019-07-31 02:30
Project homes on bluebay are unavailable due to a hardware failure on system 1

Outage net_scratch service

Status: Status/green.gif

2019-08-28 07:00
The old server is now decomissioned. We will re-bootstrap the service on new hardware in next weeks and re-announce the service when ready to be used.
2019-08-12 07:00
Read-only access to partially recoverable data has been removed again as planned. We are planning to re-instantiate the service on new hardware.
2019-07-26 12:00

Read-only access to partially recoverable data has been made available on login.ee.ethz.ch under /mnt. This volume will be available until August 9th.

2019-07-23 08:00

We are in contact with the hardware vendor to see on further steps to take with the respective RAID controller and double-check if there is an issue with it. Data on the net_scratch is with a high change lost.

2019-07-22 15:35

We still try to recover the data but the filesystem got corruption due to a hardware hickup on the underlying raid system. Affected by this is /itet-stor/<username>/net_scratch.

2019-07-22 12:00

Filesystem errors for the device holding the net_scratch service were reported. The server is currently offline due to severe errors on the filesystem.

Disabling ptrace on managed Debian/GNU Linux computers

Status: Status/green.gif

2019-07-22 09:00
  • New images are rolled out. As clients reboot they will automatically pick up this new release with a fixed Linux kernel.
2019-07-21 15:00
  • New images are available. On next reboot of an managed client will boot with a patched kernel and the mitigation disabled.
2019-07-19 13:00
  • A current Linux kernel bug allows any unprivileged user to gain root access. A proof of concept code snippet that exploits this vulnerability is publicly available.

    To protect our systems we temporarily disabled ptrace(2) on all managed Linux systems. All software depending on ptrace(2) will completely or at least partially fail. A prominent example is the GNU Debugger gdb.

    A patched Linux kernel will come soon. Once this new kernel is running, we will enable ptrace again.

Outage of Project Storage

Status: Status/green.gif

2019-07-18 18:45
Service is back to normal
2019-07-18 17:00
Project homes on bluebay are unavailable

D-ITET services outages

Status: Status/green.gif

2019-04-24 14:18
Virtualisation cluster back with full redudancy. User affecting virtual machines were back online already at 07:30.
2019-04-24 13:00
A issue between the redunant switches caused the network issue, leading the cluster to be in an inconsistent state and rebooting all virtual machines. Networking people are investigating further the issue between the switches.
2019-04-24 09:00
Further analysis ongoing, but healt status of virtualisation cluster was affected leading to resets of all hosted virtual machines.
2019-04-24 07:30
We are bringing services back to normal.
2019-04-24 07:00
A planned outage in ETZ building caused stability issues on serveral services of D-ITET in particular HOME and mail services.

Downtime HOME server: Repair filesystem inconsistency

Status: Status/green.gif

2019-02-25 06:18
System is back online.
2019-02-25 06:00
we identified a filesystem usage accounting discrepancy on one filesystem on the HOME server requiring taking down the server and issuing a repair of the underlying filesystem. The home storage is the default storage location for personal accounts on computers managed by ISG.EE.

One of two RDS servers is not reachable

Status: Status/green.gif

2019-01-25 23:50
Maintenance issues have been resolved. All RDS servers are up and running now.
2019-01-25 14:00
RDS maintenance window is terminated but one server has still pending updates. Logins are not allowed until this issue has been fixed.

Status/Archive/2019 (last edited 2020-08-31 11:41:55 by alders)