Differences between revisions 494 and 653 (spanning 159 versions)
Revision 494 as of 2018-04-24 13:48:18
Size: 10594
Editor: bonaccos
Comment:
Revision 653 as of 2022-07-11 12:47:04
Size: 9846
Editor: maegger
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
#rev 2018-08-27 mreimers
#rev 2020-08-31 alders
Line 13: Line 16:
<<Anchor(2018-04-24-major-outage-incident)>> <<Anchor(2022-07-16-planned-server-downtime)>>
== Planned ISG.EE Mailsystem downtime because of IMAP server platform switch ==
'''Status:''' {{attachment:Status/red.gif}}
Line 15: Line 20:
== Major outage virtualization cluster/networking switch ==   2022-07-16 07:00:: On 2022-07-16 between 07:00 and 12:00 ISG is going to migrate the IMAP Server to another system. During the migration there will be no access to the E-Mails for the following Domains:
   * ee.ethz.ch
   * vision.ee.ethz.ch
   * isi.ee.ethz.ch
   * mwe.ee.ethz.ch
   * mins.ee.ethz.ch
   * nari.ee.ethz.ch

<<Anchor(2022-07-09-netscratch-maintenance)>>
== NetScratch Server Filesystem Maintenance ==
Line 18: Line 32:
  2018-04-24 08:56:: Sending of emails is restored again. Recieving mail should not be lost for any properly sending email server, since the issues caused a temporary error notification to the sending server which should in turn retry resubmitting an email correctly later on with some delay.
  2018-04-24 07:45:: Bringing back online most important services, including home service; issue being investigated.
  2018-04-24 06:29:: Major outage of Networking/virtualization Cluster taking down important D-ITET Services (home Server, partially mailsystem, Linux clients).
  2022-07-10 10:55:: Fileysstem for [[Services/NetScratch|NetScratch]] is online and in read-write mode.
  2022-07-10 09:30:: First run of checks are completed and we are proceeding with the next steps to put the filesystem again online.
  2022-07-09 13:00:: The [[Services/NetScratch|NetScratch]] filesystem will be put into read-only mode for maintenance.
Line 22: Line 36:
<<Anchor(2018-04-06-jabba-maintenance)>>

== Jabba Maintenance ==
'''Status:''' {{attachment:Status/green.gif}}
  2018-04-06 08:10:: Jabba is back online
  2018-04-06 07:00:: Jabba is offline due to maintenance work

<<Anchor(2018-03-10-d-itet-storage-migration)>>

== D-ITET Storage Migration ==
'''Status:''' {{attachment:Status/green.gif}}
 
  2018-03-10 15:00:: Migration of user homes completed.
  2018-03-10 14:15:: User homes migrated, access is unblocked again, some post-migration tasks still pending.
  2018-03-10 10:00:: D-ITET user homes will be migrated from ID Storage to D-ITET Storage. During the whole migration time access to the user homes for the affected users is blocked. Affected users are informed directly by an email.

<<Anchor(2018-02-12-svnsrv-os-upgrade)>>
== svn.ee.ethz.ch Server migration: New operating system version ==
<<Anchor(2022-06-16-exchange-support)>>
== Unreachable ee.ethz.ch Email recipients over ID Exchange Mailserver ==
Line 42: Line 40:
  2018-02-12 08:55:: Server upgrade has been completed and all services up and running again.
  
  2018-02-12 06:15:: Start updating server from Debian Wheezy 7 to Debian Stretch 9. Downtimes for `https://svn.ee.ethz.ch`, `svn://svn.ee.ethz.ch` and `https://svnmgr.ee.ethz.ch`.
  2022-06-16 16:45:: Configuration issue has been resolved.
  2022-06-16 15:00:: Emails with ee.ethz.ch recipients sent over the ID Exchange Server do not reach destination. ID Exchange Admins are working on fixing the problem.
Line 46: Line 43:
<<Anchor(2018-02-05-cronbox-os-upgrade)>>
== Cronbox/Login Server migration: New operating system version ==
<<Anchor(2021-11-01-home-maintenance)>>
== HOME Server maintenance to repair fileystem inconsistence ==
Line 50: Line 47:
  2018-02-05 07:00:: The host `mira` has been upgraded to Debian 9 Stretch. SSH Host keys fingerprints for RSA and ED25519 are:
  {{{
4096 MD5:fc:a8:00:5b:64:90:86:a1:fb:49:75:ef:55:58:90:b3 (RSA)
4096 SHA256:v48HAAAjr+avnPAESdQzazSriKYZeTGGtIPKfoE8Dg0 (RSA)
256 SHA256:SgvaiZyIgzujLJdbtRij5VGUOXm/IuAs3MkMYtGZNhc (ED25519)
256 MD5:3b:b0:1a:8a:ea:0a:e5:ea:bb:9e:bb:5c:ef:24:c3:92 (ED25519)
}}}
The SSH host key is as well listed on: https://people.ee.ethz.ch/
  2021-11-01 06:30:: System back online and HOME directories are again accessible for all D-ITET user
  2021-11-01 05:50:: HOME Server will be put offline to start a repair of a filesystem inconsistence
Line 59: Line 50:
  2018-01-31 11:00:: The host `mira` holding the cronbox and login service will be upgraded to Debian 9 Stretch on 2018-02-05 at 06:10. <<Anchor(2021-07-05-linux-printing)>>
== Linux printing affected by PrintNightmare vulnerability patch ==
'''Status:''' {{attachment:Status/green.gif}}
  2021-07-05 13:00:: ID resolved the issue
  2021-07-05 09:41:: Workaround: Use [[Printing#Platform-independent_printing| platform-independent printing]]
  2021-07-05 09:41:: Authentification for printing fails. Ticket at ID servicedesk opened.
Line 61: Line 57:
<<Anchor(2018-01-25-upgrade-itetnas02)>>
== Upgrade of Server itetnas02 ==
<<Anchor(2021-04-27-itetmaster01-update)>>
== Downtime various D-ITET services for server maintenance ==
Line 65: Line 61:
  2018-01-25 07:30:: Upgrade completed.   2021-04-27 08:30:: Condor is back online, all services restored.
  2021-04-27 08:15:: Matrix/Element Chat services back online.
  2021-04-27 08:00:: Database upgrade done and online.
  2021-04-27 07:30:: Slurm services are back online.
  2021-04-27 07:00:: Base system has been upgraded, main database services in progress of upgrade.
  2021-04-27 06:00:: On 2021-04-27 between 06:00 and 08:30 ISG is going to update a server providing access to various D-ITET services. During the migration the following services will be affected and offline:
   * Matrix/Element Chat services (the instances will be unavailable)
   * IFA/Control Website: Access to the IFA database is blocked
   * Slurm (D-ITET Arton Cluster): It won't be possible to submit new jobs or view Slurm statistics. Already running jobs will not be affected.
   * Condor: Condor clients will be shut down the evening before to avoid running jobs during the migration.
Line 67: Line 72:
  2018-01-24 16:45:: On 2018-01-25 around 06:10 we will upgrade the server `itetnas02`. Several short outages for Fileservices (Samba, NFS) are expected. Services for project accounts and dedicated shares for biwi, ibt, ini and tik are affected.

<<Anchor(2017-11-10-outage-itetnas03)>>
== Outage of Server itetnas03 ==
'''Status:''' {{attachment:Status/green.gif}}
 
  2017-11-15 07:00:: Battery unit replaced

  2017-11-10 07:20:: Server is back online but without battery unit. We will need to shutdown `itetnas03` again once the problem is isolated and can be fixed.
 
  2017-11-10 06:15:: The server itetnas03 is down due to hardware problems (A battery replacement caused controller problems). ISG and the hardware vendor are currently working to get this problem solved.

<<Anchor(2017-11-07-CifsHome)>>
== User Home accessibility ==
<<Anchor(2021-03-31-network-disruption)>>
== Network disruption affecting several ISG.EE services ==
Line 83: Line 76:
  2017-11-08 06:25:: Informatikdienste have reverted a change which caused the problems for accessing all user's HOME via the CIFS (SAMBA) protocol.   2021-03-31 09:30:: The configuration error was found. The configuration change will be deployed on '''2021-04-01 around 06:15''' and a short network of about 1min is expected.
  2021-03-31 08:00:: ID Networking team has rolled-back a deployed configuration, pending further investigation/analysis.
  2021-03-31 07:30:: There are currently disruption affecting a VPZ with servers managed by ISG.EE. Networking team of ID is investigating the issue. There are several ISG.EE services affected/malfunctioning due to this in particluar the FindYourData service.
Line 85: Line 80:
  2017-11-07 08:00:: All users' HOME are currently not accessible by CIFS (SAMBA) protocol. NFS access is still available.

<<Anchor(2017-10-24-ibtnas02)>>
== Outage of Server ibtnas02 ==
<<Anchor(2021-03-11-mira-upgrade)>>
== login.ee.ethz.ch: downtime for server upgrade ==
Line 91: Line 84:
  2017-10-31 08:00:: Upgrade successfully completed   2021-03-11 06:30:: Upgrade completed and service is up and running again.
  2021-03-11 06:00:: The server servicing login.ee.ethz.ch will be upgraded to a new OS version (Debian buster). During the time of the update logins might not be possible.
Line 93: Line 87:
  2017-10-30 16:50:: The server will be upgraded to a new OS release on 2017-10-31 starting around 06:15. Short outages of Samba and NFS services are going to be expected. <<Anchor(2020-07-11-storage-downtime)>>
== Planned project/ archive storage downtime and client reboot ==
'''Status:''' {{attachment:Status/green.gif}}
Line 95: Line 91:
  2017-10-25 10:00:: ibtnas02 now serves all partitions but the problem is not yet identified   2020-07-11 12:00:: Migration has been completed, all services are back to operational state.
Line 97: Line 93:
  2017-10-24 15:00:: The server ibtnas02 is up again (partition data-08 is not available)   2020-07-11 08:00:: Migration started, services are shutdown
Line 99: Line 95:
  2017-10-24 12:50:: The server ibtnas02 is down again

  2017-10-24 09:30:: The server ibtnas02 is back online

  2017-10-24 08:00:: The server ibtnas02 is down due to hardware problems

<<Anchor(2017-10-21-itetnas03)>>
== Outage of Server itetnas03 ==
'''Status:''' {{attachment:Status/green.gif}}
  
  2017-10-23 18:15:: Data are also accessible via NFS.

  2017-10-23 9:30:: The server is up. Data are accessible via Samba. NFS file service is still down.

  2017-10-21 15:00:: The server itetnas03 is down due to hardware problems
  2020-07-11 8:00-12:00:: Start of planned maintenance work. Project/ archive storage services (known under the names "ocean", "bluebay", "lagoon" and "benderstor") will not be available. ISG-managed Linux clients will be rebooted.
Line 117: Line 99:
<<Anchor(2017-10-18-outage-etz-d-96-2)>>
== Outage of Servers in Serverroom ETZ/D/96.2 ==
'''Status:''' {{attachment:Status/green.gif}}
  
  2017-10-20 13:45:: All racks in ETZ/D/96.2 are working again (cooling problem solved).

  2017-10-20 10:00:: The technician will arrive at 13:00 hours. Some servers are running, but without watercooling. So any rack might shutdown at any time if the air cooling is not sufficient. This will most probably again happen when the technician will be working in the room (i.e. this afternoon).

  2017-10-19 18:30:: The cooling engineer could not fix the problem, so some servers are still offline. Another technicial will try to fix the cooling system tomorrow morning.

  2017-10-18 14:00:: Cooling system is still not working correctly, we only selectively powered on a couple of compute machines.

  2017-10-18 12:50:: The problem has been localized and repaired. We need to wait that the circuit is cooling down.

  2017-10-18 10:30:: Outage of most racks in ETZ/D/96.2 (cooling problem) . Most compute servers are offline.


<<Anchor(2017-05-13-outage-etz-d-96-2)>>
== Outage Servers in Serverroom ETZ/D/96.2 ==
<<Anchor(2020-06-04-svnsrv-upgrade)>>
== svn.ee.ethz.ch downtime for server upgrade ==
Line 138: Line 103:
  2017-05-13 20:00:: Outage of some racks in ETZ/D/96.2. Several compute servers offline.
  2017-05-13 23:59:: Most of the servers are back online.
  2017-05-15 08:45:: Status of remaining servers verified. All back online.
  2020-06-04 07:05:: Webservices for managing SVN repositories are enabled.
  2020-06-04 06:15:: Systemupgrade is done and access to the SVN repositories via the `svn` and `https` transport protocols are back online.
  2020-06-04 06:00:: The server servicing the SVN repositories will be upgraded to a new operating system version. During this timeframe outages for access to the SVN repositories are expected.
Line 142: Line 107:
<<Anchor(2017-03-24-cronbox-login-ssh-keys)>>
== Cronbox/Login Server migration: new SSH host key ==
<<Anchor(2020-05-17-cluster-abuse)>>
== European HPC cluster abuse ==
'''Status:''' {{attachment:Status/green.gif}}<<BR>>
Recently European HPC clusters have been attacked and abused for mining purposes. The D-ITET Slurm and SGE clusters have not been compromised. We are monitoring the situation closely.
  2020-05 17 08:30:: No successful login from known attacker IP addresses could be determined, none of the files indicating being compromised have been found on our file systems
  2020-05-16 14:30:: No unusal cluster job activity was observed

<<Anchor(2020-05-04-itetnas04-upgrade)>>
== D-ITET Netscratch downtime for server upgrade ==
Line 146: Line 118:
  2017-03-24 17:00:: The cronbox and login server has moved to a new host. A new SSH host key has been generated:
  {{{
4096 MD5:fc:a8:00:5b:64:90:86:a1:fb:49:75:ef:55:58:90:b3 (RSA)
4096 SHA256:v48HAAAjr+avnPAESdQzazSriKYZeTGGtIPKfoE8Dg0 (RSA)
}}}
The SSH host key is as well listed on: https://people.ee.ethz.ch/
  2020-05-04 06:00:: Server upgrade has been completed.
  2020-05-04 06:00:: The server servicing the D-ITET Netscratch service will be upgraded to a new operating system version. During this timeframe outages for the NFS service will be expected.
Line 153: Line 121:
  Remember:: '''Always''' verify a fingerprint of a SSH host key before accepting it. <<Anchor(2020-04-07-network-interuption)>>
== Network outage ETx router ==
'''Status:''' {{attachment:Status/green.gif}}
  2020-04-07 05:30:: There was an issue on the Router `rou-etx`. ID networking team trackled and solved the issue. There was about a 10min interuption for the ETx networking zone affecting almost all ISG.EE maintained systems.
Line 155: Line 126:
<<Anchor(2017-01-07-Mailsystem migration)>>
== EE Mailsystem migration ==
'''STATUS:''' {{attachment:Status/green.gif}} '''Mailsystem up'''
<<Anchor(2020-04-06-mira-maintenance)>>
== login.ee.ethz.ch: Reboot for maintenance ==
'''Status:''' {{attachment:Status/green.gif}}
  2020-04-06 05:35:: System behind `login.ee.ethz.ch` has been rebootet for maintenance and increase available resources.
Line 159: Line 131:
  2017-01-08 15:00:: The new mailsystem is now started. In case of unattended problems we will stop it again to prevent data loss and to analyze the problem. See the [[RemoteAccess|information on access D-ITET resources remotely]]. To distribute better the load user are encouraged to use the VPN service whenever possible.
Line 161: Line 133:
  2017-01-07 24:00:: Not all testcases could be performed. We now plan to enable the new system about noon. <<Anchor(2020-02-18-nostro-maintenance)>>
== itet-stor (FindYourData) Server maintenance: Reconfiguration of VM parameters ==
'''Status:''' {{attachment:Status/green.gif}}
Line 163: Line 137:
  2017-01-07 20:45:: Old Mailserver Configuration migrated, starting the mailserver testing   2020-02-18 19:03:: System again up and running.
  2020-02-18 19:00:: Scheduled downtime for the [[Workstations/FindYourData|itet-stor/FindYourData service]] due to maintenance work on the underlying server.
Line 165: Line 140:
  2017-01-07 14:00:: User mailbox data migrated, starting mailserver configuration migration <<Anchor(2020-01-20-nostro-os-upgrade)>>
== itet-stor (FindYourData) Server migration: New operating system version ==
'''Status:''' {{attachment:Status/green.gif}}
Line 167: Line 144:
  2017-01-07 07:00:: All mail services are stopped. Mailbox data copy started.   2020-01-20 07:15:: OS upgrade done, there were short interruptions to the [[Workstations/FindYourData|itet-stor/FindYourData service]].
  2020-01-20 06:00:: We will update the server servicing the [[Workstations/FindYourData|FindYourData service]] from Debian jessie 8 to Debian stretch 9. There will be short downtimes accessing this service during the update.
Line 169: Line 147:
<<Anchor(2016-09-12-network-outage)>>
== Networkoutage ETH ==
'''STATUS:''' {{attachment:Status/green.gif}}

  2016-02-09 08:20:: ETH wide network outage due to hardware problems for the firewall infrastructure. In any case, please reboot your computer before continue.

  2016-02-09 12:35:: Network is back online and services are being recovered. Due to the hardware failure 53 network zones were affected. The problem got localized and resolved.

  2016-02-09 14:25:: Our systems should be all back to normal. In case you experience any problem please contact support via mailto:support@ee.ethz.ch.

<<Anchor(2016-02-10-maintenance-polaris)>>
== Maintenance login.ee.ethz.ch and cronbox.ee.ethz.ch service ==
'''STATUS:''' {{attachment:Status/green.gif}}

  2016-02-10: 06:05:: The server for the [[Services/Cronjob|cronbox]] and login service is currently beeing updated from Debian Wheezy to Debian Jessie. The services will be temporarly unavailable.

  2016-02-10: 12:00:: Server update is done.
Line 189: Line 150:
[[Status/Archive/2010|2010]]
[[Status/Archive/2011|2011]]
[[Status/Archive/2012|2012]]
[[Status/Archive/2013|2013]]
[[Status/Archive/2014|2014]]
Line 190: Line 156:
[[Status/Archive/2014|2014]]
[[Status/Archive/2013|2013]]
[[Status/Archive/2012|2012]]
[[Status/Archive/2011|2011]]
[[Status/Archive/2010|2010]]
[[Status/Archive/2016|2016]]
[[Status/Archive/2017|2017]]
[[Status/Archive/2018|2018]]
[[Status/Archive/2019|2019]]

General Informations

Status-Key

Status/green.gif

Resolved

Status/orange.gif

Still working but with some errors

Status/red.gif

Pending

Current status reports

Planned ISG.EE Mailsystem downtime because of IMAP server platform switch

Status: Status/red.gif

2022-07-16 07:00
On 2022-07-16 between 07:00 and 12:00 ISG is going to migrate the IMAP Server to another system. During the migration there will be no access to the E-Mails for the following Domains:
  • ee.ethz.ch
  • vision.ee.ethz.ch
  • isi.ee.ethz.ch
  • mwe.ee.ethz.ch
  • mins.ee.ethz.ch
  • nari.ee.ethz.ch

NetScratch Server Filesystem Maintenance

Status: Status/green.gif

2022-07-10 10:55

Fileysstem for NetScratch is online and in read-write mode.

2022-07-10 09:30
First run of checks are completed and we are proceeding with the next steps to put the filesystem again online.
2022-07-09 13:00

The NetScratch filesystem will be put into read-only mode for maintenance.

Unreachable ee.ethz.ch Email recipients over ID Exchange Mailserver

Status: Status/green.gif

2022-06-16 16:45
Configuration issue has been resolved.
2022-06-16 15:00
Emails with ee.ethz.ch recipients sent over the ID Exchange Server do not reach destination. ID Exchange Admins are working on fixing the problem.

HOME Server maintenance to repair fileystem inconsistence

Status: Status/green.gif

2021-11-01 06:30
System back online and HOME directories are again accessible for all D-ITET user
2021-11-01 05:50
HOME Server will be put offline to start a repair of a filesystem inconsistence

Linux printing affected by PrintNightmare vulnerability patch

Status: Status/green.gif

2021-07-05 13:00
ID resolved the issue
2021-07-05 09:41

Workaround: Use platform-independent printing

2021-07-05 09:41
Authentification for printing fails. Ticket at ID servicedesk opened.

Downtime various D-ITET services for server maintenance

Status: Status/green.gif

2021-04-27 08:30
Condor is back online, all services restored.
2021-04-27 08:15
Matrix/Element Chat services back online.
2021-04-27 08:00
Database upgrade done and online.
2021-04-27 07:30
Slurm services are back online.
2021-04-27 07:00
Base system has been upgraded, main database services in progress of upgrade.
2021-04-27 06:00
On 2021-04-27 between 06:00 and 08:30 ISG is going to update a server providing access to various D-ITET services. During the migration the following services will be affected and offline:
  • Matrix/Element Chat services (the instances will be unavailable)
  • IFA/Control Website: Access to the IFA database is blocked
  • Slurm (D-ITET Arton Cluster): It won't be possible to submit new jobs or view Slurm statistics. Already running jobs will not be affected.
  • Condor: Condor clients will be shut down the evening before to avoid running jobs during the migration.

Network disruption affecting several ISG.EE services

Status: Status/green.gif

2021-03-31 09:30

The configuration error was found. The configuration change will be deployed on 2021-04-01 around 06:15 and a short network of about 1min is expected.

2021-03-31 08:00
ID Networking team has rolled-back a deployed configuration, pending further investigation/analysis.
2021-03-31 07:30

There are currently disruption affecting a VPZ with servers managed by ISG.EE. Networking team of ID is investigating the issue. There are several ISG.EE services affected/malfunctioning due to this in particluar the FindYourData service.

login.ee.ethz.ch: downtime for server upgrade

Status: Status/green.gif

2021-03-11 06:30
Upgrade completed and service is up and running again.
2021-03-11 06:00
The server servicing login.ee.ethz.ch will be upgraded to a new OS version (Debian buster). During the time of the update logins might not be possible.

Planned project/ archive storage downtime and client reboot

Status: Status/green.gif

2020-07-11 12:00
Migration has been completed, all services are back to operational state.
2020-07-11 08:00
Migration started, services are shutdown
2020-07-11 8:00-12:00
Start of planned maintenance work. Project/ archive storage services (known under the names "ocean", "bluebay", "lagoon" and "benderstor") will not be available. ISG-managed Linux clients will be rebooted.

svn.ee.ethz.ch downtime for server upgrade

Status: Status/green.gif

2020-06-04 07:05
Webservices for managing SVN repositories are enabled.
2020-06-04 06:15

Systemupgrade is done and access to the SVN repositories via the svn and https transport protocols are back online.

2020-06-04 06:00
The server servicing the SVN repositories will be upgraded to a new operating system version. During this timeframe outages for access to the SVN repositories are expected.

European HPC cluster abuse

Status: Status/green.gif
Recently European HPC clusters have been attacked and abused for mining purposes. The D-ITET Slurm and SGE clusters have not been compromised. We are monitoring the situation closely.

2020-05 17 08:30
No successful login from known attacker IP addresses could be determined, none of the files indicating being compromised have been found on our file systems
2020-05-16 14:30
No unusal cluster job activity was observed

D-ITET Netscratch downtime for server upgrade

Status: Status/green.gif

2020-05-04 06:00
Server upgrade has been completed.
2020-05-04 06:00
The server servicing the D-ITET Netscratch service will be upgraded to a new operating system version. During this timeframe outages for the NFS service will be expected.

Network outage ETx router

Status: Status/green.gif

2020-04-07 05:30

There was an issue on the Router rou-etx. ID networking team trackled and solved the issue. There was about a 10min interuption for the ETx networking zone affecting almost all ISG.EE maintained systems.

login.ee.ethz.ch: Reboot for maintenance

Status: Status/green.gif

2020-04-06 05:35

System behind login.ee.ethz.ch has been rebootet for maintenance and increase available resources.

See the information on access D-ITET resources remotely. To distribute better the load user are encouraged to use the VPN service whenever possible.

itet-stor (FindYourData) Server maintenance: Reconfiguration of VM parameters

Status: Status/green.gif

2020-02-18 19:03
System again up and running.
2020-02-18 19:00

Scheduled downtime for the itet-stor/FindYourData service due to maintenance work on the underlying server.

itet-stor (FindYourData) Server migration: New operating system version

Status: Status/green.gif

2020-01-20 07:15

OS upgrade done, there were short interruptions to the itet-stor/FindYourData service.

2020-01-20 06:00

We will update the server servicing the FindYourData service from Debian jessie 8 to Debian stretch 9. There will be short downtimes accessing this service during the update.

Archived status reports

2010 2011 2012 2013 2014 2015 2016 2017 2018 2019


CategoryEDUC

Status (last edited 2023-10-16 11:24:17 by alders)