Differences between revisions 275 and 612 (spanning 337 versions)
Revision 275 as of 2013-11-18 11:45:00
Size: 40545
Editor: bonaccos
Comment:
Revision 612 as of 2020-08-31 11:36:39
Size: 26889
Editor: alders
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
#rev 2018-08-27 mreimers
Line 3: Line 5:
 * For notifications and announcements of central IT services managed by ID, please visit https://www1.ethz.ch/id/servicedesk/sysstat/index_EN
 * For a detailed status overview of central IT services managed by ID, please visit http://eranger3.ethz.ch/Ueberwachung/index.html
 * For notifications and announcements of central IT services managed by ID, please visit https://www.ethz.ch/services/de/it-services/service-desk.html
 * For a detailed status overview of central IT services managed by ID, please visit https://ueberwachung.ethz.ch
Line 7: Line 9:
||<style="border: medium none;"> {{attachment:green.gif}} ||<style="border: medium none;">Resolved ||
||<style="border: medium none;"> {{attachment:orange.gif}} ||<style="border: medium none;">Still working but with some errors ||
||<style="border-width: medium medium 1px; border-top: medium none rgb(85, 136, 238); border-left: medium none rgb(85, 136, 238); border-right: medium none rgb(85, 136, 238); border-color: rgb(85, 136, 238);"> {{attachment:red.gif}} ||<style="border-width: medium medium 1px; border-top: medium none rgb(85, 136, 238); border-left: medium none rgb(85, 136, 238); border-right: medium none rgb(85, 136, 238); border-color: rgb(85, 136, 238);">Pending ||

<<Anchor(2013-11-18-mysql-server-problem)>>

= MySQL Server downtime =
'''STATUS:''' {{attachment:green.gif}}

 2013-11-18: 12:00:: We are experiencing Problems with the MySQL database cluster. Affected are also all websites using the MySQL instance on remi.ee.ethz.ch

 2013-11-18: 12:45:: The system is working again.

----

<<Anchor(2013-11-12-oenone-nfs-problems)>>

= Server oenone NFS problems =
'''STATUS:''' {{attachment:green.gif}}

 2013-11-12: 04:00:: We see NFS related problems on server oenone again. Affected were homes of all users from '''AUT''', '''BIWI''', '''IKT''' and '''VAW'''.

 2013-11-12: 07:40:: The system is again back online.
----

<<Anchor(2013-10-16-oenone-nfs-problems)>>

= Server oenone NFS problems =
'''STATUS:''' {{attachment:green.gif}}

 2013-10-16: 02:00:: Starting from around 02:00 in the morning we had NFS related problems on server oenone. Affected were homes of all users from '''AUT''', '''BIWI''', '''IKT''' and '''VAW'''. Webpages of these institutes may also have been affected.

 2013-10-16: 03:00:: Server oenone recovered itself and is running as normal again.

----

<<Anchor(2013-09-28-solaris-server-patching)>>

= Maintenance work on D-ITET's central IT infrastructure =
'''STATUS:''' {{attachment:green.gif}}

 2013-09-28: 14:30:: All services are back online

 2013-09-28: 10:00-16:00:: To keep our systems up to date with the newest software and security releases, we need to update our servers on a regular basis. For this reason we are going to install latest Oracle patches on our main servers. The servers will be rebooted during this maintenance

To prevent data corruption/loss please do the following:
 * save all open files
 * close all running applications
 * logout from all ITET systems (Linux, Windows)
 * shutdown your personal PC/Desktop
 * do not establish any connection from outside

More details:
 * servers concerned: DRWHO, TARDIS, OENONE, SPITFIRE, YOSEMITE
 * webpages hosted on these systems are NOT available
 * NO mail access, NO outgoing mails (incoming mails WON'T get lost)

----

<<Anchor(2013-09-05-oenone-servercrash)>>

= Server oenone crash =
'''STATUS:''' {{attachment:green.gif}}

 2013-09-04: 21:30:: oenone nfs server crashed. Affecting users of '''AUT''', '''BIWI''', '''IKT''' and '''VAW'''. Homes and Webpages of these institutes.

 2013-09-05: 04:00:: we needed to reset the server to bring it again up.

----

<<Anchor(2013-06-15-oenone-servercrash)>>

= Server oenone crash =
'''STATUS:''' {{attachment:green.gif}}

 2013-06-15: 08:30:: oenone nfs server crashed. Affecting users of '''AUT''', '''BIWI''', '''IKT''' and '''VAW'''. Homes and Webpages of these institutes.

 2013-06-15: 11:45:: we needed to reset the machine as the service for providing the home directories was unresponsive.

----

<<Anchor(2013-05-16-VPP-Poster-Printer-ETZSPEZ)>>

= ETZSPEZ - HP 6100 encountered print quality issues =
'''STATUS:''' {{attachment:green.gif}}

 2013-05-30 12:00:: The Printer has been fixed.

 2013-05-30 09:00:: Technician is mending the printer right now. Printer is offline for the next 4 hours.

 2013-05-16 12:00:: An external technician has been informed and will fix the printer in the room ETZ J66 (ETZSPEZ) as soon as possible.

----

<<Anchor(2013-04-23-ID-NAS-problems)>>

= The Informatikdienste have some problems with their NAS =
'''STATUS:''' {{attachment:green.gif}}

 2013-04-23 14:42:: As a result of the disorders of the ID-NAS, the VPP printers don't work properly at the moment.

 2013-04-24 07:00:: All ID Services should be back to normal.

----

<<Anchor(2013-03-06-network-outage-ETH)>>

= Outage Network Infrastructure ETH =
'''STATUS:''' {{attachment:green.gif}}

 2013-03-06 11:55:: Again Informatikdienste has a global problem with the network infrastructure.
 2013-03-06 13:00:: Services should be all back. We got informed about the cause by the [[https://www1.ethz.ch/id/servicedesk/sysstat/index|Informatikdienste]]. A Hardware-Loadbalancer in RZ crashed and was in an undefined state. Therefore the failover to HCI did not work.

----

<<Anchor(2011-11-12-solaris-server-patching)>>

= Maintenance work on D-ITET's central IT infrastructure =
'''STATUS:''' {{attachment:green.gif}}


 2013-02-23: 16:00:: All systems are back online.

 2013-02-23: 10:00-14:00:: To keep our systems up to date with the newest software and security releases, we need to update our servers on a regular basis. For this reason we are going to install latest Oracle patches on our main servers. The servers will be rebooted during this maintenance

To prevent data corruption/loss please do the following:
 * save all open files
 * close all running applications
 * logout from all ITET systems (Linux, Solaris, Windows)
 * shutdown your personal PC/Desktop
 * do not establish any connection from outside

More details:
 * servers concerned: DRWHO, TARDIS, OENONE, SPITFIRE, YOSEMITE, MALINA
 * webpages hosted on these systems are NOT available
 * NO mail access, NO outgoing mails (incoming mails WON'T get lost)

----

<<Anchor(2013-02-15-network-outage-ETH)>>

= Outage Network Infrastructure ETH =
'''STATUS:''' {{attachment:green.gif}}

 2013-02-15 08:05:: Again Informatikdienste has a global problem with the network infrastructure.

 2013-02-15 09:10:: Network is coming back, still problems present.

 2013-02-15 09:20:: Network ist still not 100% recovered.

 2013-02-15 09:45:: Network completly down again. We still don't have an update from Informatikdienste what is going on.

 2013-02-15 10:45:: Network ist still not stable.

 2013-02-15 12:10:: Network is coming back to normal, we are working on restoring the services.

 2013-02-15 12:30:: Most of our services are back online.

 2013-02-15 14:00:: We got informed about the cause by the [[https://www1.ethz.ch/id/servicedesk/sysstat/index|Informatikdienste]]. Basically a virtuall firewall which was not needed anymore was deleted by the Networkteam at ID. In consequence all interfaces of all virtual firrewalls got down bringing down a big part of the ETH network. Neither a reload of the firewall hardware helped on their side, so they needed to reinstall all appliances. See the status website for a german written explanation.
----

<<Anchor(2013-02-14-dns-problems)>>

= DNS Outage Network ETH =
'''STATUS:''' {{attachment:green.gif}}

 2013-02-14 14:35:: Currently DNS at ETH is down. ID will keep us informed. This affects all services.

 2013-02-14 14:45:: Services are stabilizing.

----

<<Anchor(2013-02-12-dns-problems)>>

= DNS Outage Network ETH =
'''STATUS:''' {{attachment:green.gif}}

 2013-02-12 16:30:: Currently DNS at ETH is down. ID will keep us informed. This affects all services.

 2013-02-12 18:50:: Services are stabilizing.

 2013-02-12 19:30:: Not all DNS names are yet restored. However most services affecting D-ITET customers should work again. More informations will be publisched at [[https://www1.ethz.ch/id/servicedesk/sysstat/index|Informatikdienste Statusseite]]. In case you experience some specific problems please contact us at support@ee.ethz.ch

----

<<Anchor(2013-02-07-oenone-servercrash)>>

= Server oenone crash =
'''STATUS:''' {{attachment:green.gif}}

 2013-02-06: 23:00:: oenone nfs server crashed. Affecting users of '''AUT''', '''BIWI''', '''IKT''' and '''VAW'''. Homes and Webpages of these institutes.

 2013-02-07: 07:10:: we needed to reset the machine as the service for providing the home directories was unresonsive.

----

<<Anchor(2013-02-05-people-webserver)>>

= Hanging people.ee.ethz.ch Webserver =
'''STATUS:''' {{attachment:green.gif}}

 2013-02-05: 06:00:: The webserver for people.ee.ethz.ch was hanging this morning and was not pingable anymore.

 2013-02-05: 07:00:: We reseted the webserver and now investigating the issue. It's probably kernel- and hardware related.

----

<<Anchor(2013-01-14-oenone-servercrash)>>

= Server oenone crash =
'''STATUS:''' {{attachment:green.gif}}

 2013-01-14: 21:28:: oenone crashed. Affecting users of '''AUT''', '''BIWI''', '''IKT''' and '''VAW'''.

 2013-01-14 21:52:: oenone rebooted and is running again.

----
<<Anchor(2012-11-15-Webserver Outage)>>

= Webserver Outage =
'''STATUS:''' {{attachment:green.gif}}

 2012-11-15: 14:30:: Erroneous deletion of some apache configuration files led to outages of the webservers '''oenone''' and '''yosemite''' today between 13:30 and 14:15.

<<Anchor(2012-11-13-VPP Service Outage)>>

= Email Service Outage =
'''STATUS:''' {{attachment:green.gif}}

 2012-11-13: 07:30:: VPP announces print service outage for about 15 min. . This should resolve the latest VPP service issues.

----

<<Anchor(2012-11-09-Email Service Outage)>>

= Email Service Outage =
'''STATUS:''' {{attachment:green.gif}}

 2012-11-09: 10:05:: Emails can not be sent/received. ISG.EE is working to resolve this problem as soon as possible.
 2012-11-09: 12:20:: We are currently still investigating what caused the outage of our mail server.
 2012-11-09: 15:30:: The amavis Daemon (a high-performance interface between mailer (MTA) and AVI content checker) stopped working because his temporary directories were removed. It is not clear what removed these directories. We are still investigating this but in the meantime the mailserver is up and running. '''NO EMAILS WERE LOST''' but the mails were sent/received with a 1-2 hours delay.

----

<<Anchor(2012-11-02-VPP Outage)>>

= VPP Outage =
'''STATUS:''' {{attachment:green.gif}}

 2012-11-02: 16:00:: Jobs sent to VPP Printers don't get printed. We are investigating this problem together with VPP.
 2012-11-05: 07:00:: Services up and running again.

----

<<Anchor(2012-10-25-Network Outage)>>

= Complete Network Outage =
'''STATUS:''' {{attachment:green.gif}}

 2012-10-25: 07:05 - 09:15:: Complete network outage. Cause is still unclear but might be a side effect of the router hardware replacement due to a defect hardware announced yesterday.
 2012-10-25: 09:15:: The Informatik Dienste have posted a statement on their [[https://www1.ethz.ch/id/servicedesk/index|status page]]
----

<<Anchor(2012-10-16-Network Outage)>>

= Complete Network Outage =
'''STATUS:''' {{attachment:green.gif}}

 2012-10-16: 20:20 - 20:40 / 22:10 - 23:20 :: Complete network outtage could be a side effect of a central router upgrade from informatik dienste. We are investigating it.
 2012-10-17: 08:52 :: The informatik dienste have posted a statement on their [[https://www1.ethz.ch/id/servicedesk/index|status page]]

----

<<Anchor(2012-10-16-VPP Plotter at ETZSPEZ J66)>>

= Outage Poster printer at ETL J 66 / ETZSPEZ - 16. October 2012 =
'''STATUS:''' {{attachment:green.gif}}

 2012-10-16: 08:30 - 12:30 :: plotter HP6100 at ETZ J66 will be under maintenance and is not available during that time

----

<<Anchor(2012-10-10-lists.ee.ethz.ch_downtime)>>

= Mailinglist Downtime (lists.ee.ethz.ch) =
'''STATUS:''' {{attachment:green.gif}}

 2012-10-10: 09:00 - 17:00 :: Due to the migration of our mailing list software to sympa, we will take down the [[http://lists.ee.ethz.ch|lists.ee.ethz.ch]] website. Mails sent to lists.ee.ethz.ch will put into a HOLD queue and delivered to the mailinglist after migration. So '''no mails get lost'''!

 2012-10-10: 17:04 :: Mailinglists converted. Services up on running.

----

<<Anchor(2012-09-08-Ldap-Migration)>>

= LDAP Migration =
'''STATUS:''' {{attachment:green.gif}}

 2012-09-08: 10:00 - 17:00 :: All systems and services are not available during migration
 
 2012-09-08: 17:30 :: All systems are back online

----

<<Anchor(2012-08-17-Network-Problem-ETH)>>

= Network outage ETH =
'''STATUS:''' {{attachment:green.gif}}

 2012-08-17: 16:30:: Currently on ETH network there seem to be problems related on networking level. No information is available yet.

 2012-08-17: 17:15:: Systems are back online.

----

<<Anchor(2012-06-28-power-outage)>>

= Power outage at ETH =
'''STATUS:''' {{attachment:green.gif}}

 2012-06-28: 19:17:: ETH had power outage affecting many services. D-ITET infrastructure was partially affected too.

 2012-06-29: 08:00:: We are currently working on resolving the outstanding issues and bringing back online services which are still down.

 2012-06-29: 09:30:: On ID Status website https://www1.ethz.ch/id/servicedesk/sysstat/index you can find now further information.

 2012-06-29: 10:05:: The cause was a transformator on fire in the main building, causing a power downtime in the computer centres from 19:30 to 22:00.

----

<<Anchor(2012-06-20-oenone-servercrash)>>

= Server oenone crash =
'''STATUS:''' {{attachment:green.gif}}

 2012-06-20: 00:00:: oenone crashed. Affecting users of '''AUT''', '''BIWI''', '''COLLEGIUM''', '''IKT''' and '''VAW'''.

 2012-06-05: 00:30:: oenone rebooted and is running again.


(!) We are still investigating what caused the crash and will report further information here.

----

<<Anchor(2012-06-05-oenone-server-problem)>>

= oenone: hanging lockd affecting some User homes and webpages =
'''STATUS:''' {{attachment:green.gif}}

 2012-06-05: 21:37:: oenone is unresponsive. Affecting users of '''AUT''', '''BIWI''', '''COLLEGIUM''', '''IKT''' and '''VAW'''.

 2012-06-05: 00:00:: oenone was rebooted and is running again.

----

<<Anchor(2012-06-05-alumni-mailserver-problem)>>

= Outage on 2/3 of alumni.ethz.ch mailservers =
'''STATUS:''' {{attachment:green.gif}}

 2012-06-04: 16:00:: Our mailserver can't deliver emails to alumni.ethz.ch addresses. Reason: The receiving Servers have temporary errors: "451 unable to verify user". Looks like something is misconfigured there.
 2012-06-05: 10:00:: Looks like all alumni servers hosted on tophost.ch have a problem. We have created a temporary transport map which delivers the mails to the genotec.ch (the only one which works) alumni mailserver. As long as you use our mailserver to send emails to alumni addresses they will be delivered immediately.
----
<<Anchor(2012-05-13-oenone-server-problem)>>

= oenone: hanging lockd affecting some User homes and webpages =
'''STATUS:''' {{attachment:green.gif}}

 2012-05-13: 22:50:: oenone is unresponsive. Affecting users of '''AUT''', '''BIWI''', '''COLLEGIUM''', '''IKT''' and '''VAW'''.

 2012-05-14: 00:15:: oenone was rebooted and is running again.

----
<<Anchor(2012-05-08-drwho-server-problem)>>

= drwho: Main server outage affecting 64bit diskless Linux clients =
'''STATUS:''' {{attachment:green.gif}}

 2012-05-08: 11:50:: We currently experience some problems with one of our main server. All 64bit diskless clients are affected. We are working on the Problem. Furthermore some svn repositories might be affected.

 2012-05-08: 13:15:: The system is going back to normal but needs some time to fully recover.

 2012-05-08: 13:45:: The system should be back to normal. We still are working on some single hosts to recover them.

----


<<Anchor(2011-11-12-solaris-server-patching)>>

= Maintenance work on D-ITET's central IT infrastructure =
'''STATUS:''' {{attachment:green.gif}}

 2012-03-13: 21:30 :: All systems are back online.

 2012-03-13: 19:00 - 22:00 :: To keep our systems up to date with the newest software and security releases, we need to update our servers on a regular basis. For this reason we are going to install latest Oracle patches on our main servers. The servers will be rebooted during this maintenance

To prevent data corruption/loss please do the following:
 * save all open files
 * close all running applications
 * logout from all ITET systems (Linux, Solaris, Windows, Mac OS X)
 * shutdown your personal PC/Desktop
 * do not establish any connection from outside

More details:
 * servers concerned: DRWHO, TARDIS, OENONE, SPITFIRE, YOSEMITE, MALINA
 * webpages hosted on these systems are NOT available
 * NO mail access, NO outgoing mails (incoming mails WON'T get lost)

----


<<Anchor(2012-02-16-zaan-maintenance)>>

= Maintenance downtime of Server behind ipp2vpp.ee.ethz.ch (printing/licenses/dhcp for selfmanaged hosts) =
'''STATUS:''' {{attachment:green.gif}}

On '''2012-02-16''' around '''8:00AM''' we are scheduling a maintenance downtime of zaan (serving printing, license Server and DHCP Server for D-ITET). The downtime will be as short as possible. We plan to have it down for around 45 minutes at most.

 '''2012-02-16 08:45:''' Server is up and running again.

----

<<Anchor(2012-02-13-oenone-crash)>>

= oenone home server crash =
'''STATUS:''' {{attachment:green.gif}}

 2012-02-13:: During this night at around 22:40 '''oenone''' one of our home-servers crashed. Users with homes on oenone where affected, these are '''BIWI''', '''VAW''', '''Collegium Helveticum''', '''Control''', '''ISI''', '''IKT'''. The system was up again at 00:40.

We are sorry for the caused inconvenience and we are investigating the problem.


----
<<Anchor(2012-01-31-oenone-emergency-reboot)>>

= Emergency reboot of ITET's server OENONE =
'''STATUS:''' {{attachment:green.gif}}

 2012-01-31: 21:00:: Emergency reboot of server OENONE due to a not responding storage area. User concerned: '''BIWI, VAW, Collegium Helveticum, Control, ISI, IKT, EEH'''

 2012-01-31: 21:45:: Server oenone is up again and all services are running

----
<<Anchor(2012-01-23-linux-reboots)>>

= Emergency reboots of all Linux Clients and Servers =
'''STATUS:''' {{attachment:green.gif}}

 2012-01-23: 15:00 PM:: Due to a critical issue we were forced to reboot all affecting hosts We are sorry for the short notice and the inconvenience caused to you.

 2012-01-23: 20:45 PM:: All Clients rebooted.

----
<<Anchor(2011-11-29-oenone-crash)>>

= Maintenance work on D-ITET's home server TARDIS and OENONE =
'''STATUS:''' {{attachment:green.gif}}

 2011-12-17: 11:00 AM:: successful reboot of TARDIS and OENONE
 
 2011-12-17 10:00AM - 11:00AM:: During the last installation of Oracle patches a bug within the automount daemon was introduced causing high CPU load on systems with a high number of auto-mounted file systems. We have investigated this problem together with Oracle. Now, a bug fix is available, but requests a server reboot. Due to this requirement we are going to reboot TARDIS and OENONE at

To prevent data corruption/loss please do the following:
 * save all open files
 * close all running applications
 * logout from all ITET systems (Linux, Solaris, Windows, Mac OS X)
 * shutdown your personal PC/Desktop
 * do not establish any connection from outside


More details:
 * servers concerned: TARDIS, OENONE
 * webpages hosted on these systems are NOT available
 * NO mail access, NO outgoing mails (incoming mails WON'T get lost)

----
<<Anchor(2011-11-29-oenone-crash)>>

= oenone home server crash =
'''STATUS:''' {{attachment:green.gif}}

 2011-11-29:: During this night at around 21:30 '''oenone''' one of our home-servers crashed. Users with homes on oenone where affected, these are '''BIWI''', '''VAW''', '''Collegium Helveticum''', '''Control''', '''ISI''', '''IKT'''. The system was up again at 23:00.

We are sorry for the caused inconvenience and we are investigating the problem.


----

<<Anchor(2011-11-14-power-outage)>>

= Poweroutage affecting compute clusters =
'''STATUS:''' {{attachment:green.gif}}
 2011-11-14:: Due to a power outage some racks containing our compute clusters went down.

 2011-11-14 08:00:: All compute clusters should be up and running again.

----

<<Anchor(2011-11-12-solaris-server-patching)>>

= Maintenance work on D-ITET's central IT infrastructure =
'''STATUS:''' {{attachment:green.gif}}

 2011-11-12: 05:50 PM:: Final reboot of TARDIS successfully terminated.

 2011-11-12: 04:00 PM:: The final reboot of tardis is still outstanding due to a broken disk within TARDIS internal RAID (boot device). The broken disk has been successfully replaced but the RAID is still syncing. The reboot is postponed until the sync process is finished and the reboot can safely be carried out. So be prepared for a short interrupt today or tomorrow.

 2011-11-12: 04:00 PM:: All systems are back online.

 2011-11-12: 10:00 AM - 04:00 PM:: To keep our systems up to date with the newest software and security releases, we need to update our servers on a regular basis. For this reason we are going to upgrade specific storage software and install latest Oracle patches on our main servers. The servers will be rebooted multiple times during this maintenance

To prevent data corruption/loss please do the following:
 * save all open files
 * close all running applications
 * logout from all ITET systems (Linux, Solaris, Windows, Mac OS X)
 * shutdown your personal PC/Desktop
 * do not establish any connection from outside

More details:
 * servers concerned: DRWHO, TARDIS, OENONE, SPITFIRE, YOSEMITE, MALINA
 * webpages hosted on these systems are NOT available
 * NO mail access, NO outgoing mails (incoming mails WON'T get lost)

----

<<Anchor(2011-08-19-switch-routing-loop)>>

= Routing problems on switch.ch network =
'''STATUS:''' {{attachment:green.gif}}
 2011-08-19: 05:00 PM:: Routing Problems solved.
 2011-08-19: 02:22 PM:: Because of a routing problem on the switch network, all traffic to http://www.virginia.edu and their mailserver is disturbed.

----

<<Anchor(2011-10-14-cronbox-migration)>>
= Migration of cronbox.ee.ethz.ch to Debian Squeeze =
'''STATUS:''' {{attachment:green.gif}}
 2011-10-14: 07:30:: We plan to migrate the server behind cronbox.ee.ethz.ch to the new version of the Debian operating system. Expected downtime is from 07:30 up to around 10:00. The affected services are '''cron''' and '''ssh''' logins to the machine.
 2011-10-14: 11:00:: Migration completed.

----

<<Anchor(2011-09-30-matlab-license-server-down)>>
= Matlab License Server down =
'''STATUS:''' {{attachment:green.gif}}
 2011-09-30: 07:00:: Currently the Server from Informatik Dienste providing the license server service is down. You can track the curren status at their [[https://www.komcenter.ethz.ch/home/idServices/show|ID-Service Status]] page under Lizenzen -> 1965@vnava.

 2011-09-30: 08:45:: License server from Informatikdienste is now up again.

 2011-09-30: 15:30:: The license server lic-matlab.ethz.ch is unavailable again. Due it is outside our control we cannot estimate when it will work again.

 2011-09-30: 16:00:: lib-matlab.ethz.ch is available again.

----

<<Anchor(2011-08-30-nfs-problems-on-oenone)>>
= NFS outage on oenone =
'''STATUS:''' {{attachment:green.gif}}
 2011-08-30: 02:00 AM - 08:30 AM:: During this night at around 02:00 the NFS Services on '''oenone''' crashed. Users with homes on oenone where affected, these are '''BIWI''', '''VAW''', '''Collegium Helveticum''', '''Control''', '''IBT''', '''IKT'''. This crash also affected all webservices which depend on oenone and the mailserver (at least for those having the home directory on oenone). No mails are lost as we put them into a hold queue!

 Update: 08:30:: oenone is now up and running again. All Mails on the hold queue are now gradualy delivered. The webservices are also all available now.

We are sorry for the caused inconvenience and we are investigating the problem.

----

<<Anchor(2011-08-23-admin-ch-connection-problems)>>
= Connection problems to all admin.ch servers =
'''STATUS:''' {{attachment:green.gif}}
 2011-08-23: 03:00 PM:: All admin.ch websites an mailservers are reachable.
 2011-08-23: 01:25 PM:: Currently all traffic to the admin.ch servers is disturbed. This includes the websites and also email connections. Your sent mails are not lost as our server keeps them until it can connect to the destination.

----

<<Anchor(2011-08-23-solaris-server-patching)>>

= Maintenance work on D-ITET's central IT infrastructure =
'''STATUS:''' {{attachment:green.gif}}

 2011-08-23: CANCELLATION:: Due to unresolved error within Solaris operating system introduced by latest patch set

 2011-08-23: 7:00 PM - 10:00 PM:: To keep our systems up to date with the newest software and security releases, we need to update our servers on a regular basis. For this reason we are going to patch and reboot some of our Solaris servers.

To prevent data corruption/loss please do the following:
 * save all open files
 * close all running applications
 * logout from all ITET systems (Linux, Solaris, Windows, Mac OS X)
 * shutdown your personal PC/Desktop
 * do not establish any connection from outside

More details:
 * servers concerned: DRWHO, TARDIS, OENONE, SPITFIRE, YOSEMITE, MALINA
 * webpages hosted on these systems are NOT available
 * NO mail access, NO outgoing mails (incoming mails WON'T get lost)
----

<<Anchor(2011-08-19-switch-routing-loop)>>

= Routing problems on switch.ch network =
'''STATUS:''' {{attachment:green.gif}}
 2011-08-19: 05:00 PM:: Routing Problems solved.
 2011-08-19: 02:22 PM:: Because of a routing problem on the switch network, all traffic to http://www.virginia.edu and their mailserver is disturbed.

----

<<Anchor(2011-08-08-drwho-problems)>>

= Outage drwho.ee.ethz.ch =
'''STATUS:''' {{attachment:green.gif}}

 2011-08-13 09:00:: Since around 1:00 we experience server problems on one of our main servers affecting most of the services.

 2011-08-13 11:00:: All services back to normal.

----

<<Anchor(2011-08-08-dns_outage)>>

= ETH wide DNS outage =
'''STATUS:''' {{attachment:green.gif}}

 2011-08-08 10:00:: ETH wide DNS outage. All Services using Name Resolution do not work. Our Mailserver denys all incoming messages with an {{{450 4.3.2 Service currently unavailable}}}. Properly configured Mailservers should retry the message delivery later, so no mail is lost.

 2011-08-08 10:20:: DNS works again. All Services up and running.


----

<<Anchor(2011-06-22-colombo04)>>

= colombo04 not available =
'''STATUS:''' {{attachment:green.gif}}

 2011-06-23 11:00:: Powersupply of colombo04 replaced. colombo04 up and ready.

 2011-06-22 07:00:: Powersupply of colombo04 broke. Now waiting for replacement from Oracle.

----

<<Anchor(2011-06-06-biwinas01)>>

= biwinas01 not available =
'''STATUS:''' {{attachment:green.gif}}

 2011-06-08 08:00:: biwinas01 is now back online.

 2011-06-07 15:00:: The hardware supplier returned the server today. They had to replace the following hardware:
 * 1 CPU
 * 1 power supply
 * 1 fan

We will test the server now and bring it back online as soon as possible.

 2011-06-06 15:00:: biwinas01 is currently out of order due to a hardware failure. It might take several days until biwinas01 is back online.

----
<<Anchor(2011-06-06-ifhlux11)>>

= ifhlux11 not available =
'''STATUS:''' {{attachment:green.gif}}

 2011-06-08 11:20:: iflux11 is now back online.

 2011-06-07 15:00:: We are in contact with the supplier. Unfortunately, the reason for the crash of ifhlux11 is not known yet.

 2011-06-06 15:00:: ifhlux11 is currently out of order due to a hardware failure. It might take several days until ifhlux11 is back online.

----
<<Anchor(2011-06-03-Cooling-System-Replacement)>>

= Coming soon: Outage of several IT services due to cooling system replacement =

'''STATUS:''' {{attachment:green.gif}}

 2011-06-03 15:00:: All servers are again up and running with the exceptions of '''biwinas01''' and '''ifhlux11''' (see above).

'''UPCOMING: 2011-06-03 09:00 - 2011-06-06 09:00'''

The refrigeration supply in the ETZ building will be replaced. One server room
has to be shutdown completely (ETZ/D/96.2; all compute servers of the institutes).
The other server room will get an emergency cooling system (ETZ/F/66). This will,
however, not allow the cooling of all currently running servers within that room.
Consequently we will shut down as many servers as possible.

'''Basic services like email, user homes and other network shares, printing and login to
Windows or Linux workstations are not expected to be affected by this construction work.'''
In general, if you are not familiar with the server names or service terms below, you
should not be affected at all.

'''These services will not be available during the construction work times:'''
 * Shut down 2011-06-03 at '''09:00'''. Back online 2011-06-06 at 09:00:
  * Remote-Desktop-Access to the Windows Terminal Servers
   * quinn
   * sivi
   * vega7
  * All institute NAS servers (no access via Samba, link in home, ssh, etc.)
   * biwinas01-03
   * hamam01
   * ifenas01
   * ibtnas01
  * Most IFE compute servers
   * bernstein
   * coltrane
   * dylan
   * haydn
   * marley
   * mozart
  * Publication databases on sato (IFA)

 * Shut down 2011-06-03 at '''16:00'''. Back online 2011-06-06 at 09:00:
  * Computer rooms for students
   * ETZ D 61.1
   * ETZ D 61.2
   * ETZ D 96
  * All institute compute servers
   * autserv*
   * bender*
   * biwilux*
   * casseri*
   * colombo*
   * IFE compute servers cash and elvis
   * ifhlux*
   * nariwork*
   * tik*x
   * vierzack*

----
<<Anchor(2011-05-26-Firewall-Problem)>>

= Firewall Problem: some Services not reachable =
'''STATUS:''' {{attachment:green.gif}}

'''2011-05-26 07:15 - 2011-05-26 16:00'''

Due to problems with the firewall hardware, some services of the D-ITET-servers were not reachable.

----
<<Anchor(2011-05-19-polaris-and-zaan-reboot)>>

= Maintenance reboot of Servers behind login.ee.ethz.ch and ipp2vpp.ee.ethz.ch (printing/licenses) =
'''STATUS:''' {{attachment:green.gif}}

On '''2011-05-19''' around '''7:00AM''' we will perform a maintenance reboot of polaris (serving login.ee.ethz.ch) and zaan (serving printing at D-ITET). During this downtime it will not be possible to print via samba or cups.

----
<<Anchor(2011-05-09-galen-reboot)>>

= Maintenance reboot of Server behind people.ee.ethz.ch =
'''STATUS:''' {{attachment:green.gif}}

'''2011-05-10 07:45'''

On '''2011-05-10''' around '''7:00AM''' we will perform a maintenance reboot of galen. galen is the Server serving your personal homepage on people.ee.ethz.ch.

----

<<Anchor(2011-05-02-horde-outage)>>

= Horde Webmail outage =
'''STATUS:''' {{attachment:green.gif}}

'''2011-05-02 23:49 - 2011-05-03 02:27'''

The [[https://email.ee.ethz.ch|Horde Webmail Client]] had to be taken down for security reasons. It was not clear if someone used a zero day exploit or a phished account to send spam over our server. It turned out, the attackers used a phished account.

/!\ '''Please Remember: We at ISG.EE will NEVER ask you for your password.'''

----

<<Anchor(2011-03-01-short-smtp-outage)>>

= Short SMTP outage =
'''STATUS:''' {{attachment:green.gif}}

 2011-02-28 14:49 - 15:04:: As a result of an LDAP failure we had to stop the mail server for 15 minutes, to prevent the rejection of incoming emails. While the Mainserver was down, our two backup MX collected incoming emails.

----

<<Anchor(2011-02-28-zhadum-crash)>>

= Windows Terminal Server zhadum out-of-operation =
'''STATUS:''' {{attachment:green.gif}}

 2011-02-28 15:00:: Please use the server '''vega7''' from now on. If you had access to zhadum before, you should also be able to access vega7. '''Please use your NETHZ username and password to log in.'''

 2011-02-28 14:00:: The hardware of the departemental Windows terminal server '''zhadum''' is broken. A replacement server should be available soon...

----
<<Anchor(2011-02-22-jabba-upgrade)>>

= Upgrade of Backup Server JABBA =
'''STATUS:''' {{attachment:green.gif}}

 2011-02-22 18:15 PM:: The migration is finished and the new JABBA server is online.

 2011-02-22 7:00 AM - 7:00 PM (approx.):: We are going to upgrade the departments backup server JABBA. This upgrade includes changes in software as well as in hardware. During the upgrade no restore request of lost data can be fulfilled.

'''The complete backup infrastructure and all belonging services are NOT available during the upgrade'''

----
<<Anchor(2011-02-07-agilent-license-server-change)>>

= Agilent ADS/ICCAP License Server Change =
'''STATUS:''' {{attachment:green.gif}}

 2011-02-07 10:00:: As announced a week ago, the license server for Agilent ADS and ICCAP software has changed. If your client software still uses the old license server information, please make sure you change the license server to '''lic-agilent.ee.ethz.ch'''. The port stays the same.

----
<<Anchor(2011-01-31-svn-firewall)>>

= Subversion server not reachable from outside ETH Zurich =
'''STATUS:''' {{attachment:green.gif}}

 2011-01-31 16:30:: Our subversion server is now available again for users outside of ETH Zurich.

 2011-01-31: 16:15:: We have been told that the central firewall rules cannot be changed right now due to other (completely unrelated) problems. At the moment it is unknown when this will be fixed.

 2011-01-31: 15:30:: Our subversion server svn.ee.ethz.ch is at the moment not accessible through the svn:// protocol from outside ETH Zurich due to a firewall configuration error. We are working on it...

----
<<Anchor(2011-01-26-mailser-problems-after-patching)>>

= Mailserver problems while patching =
'''STATUS:''' {{attachment:green.gif}}

 2011-01-26: 11:00 AM:: During the server upgrade last night a patch has temporarily misconfigured the mail server. The server accepted incoming mails but could not place them into the users mailbox. EVERY such mail created a bounce. Because of this, our statement that no mails get lost while updating the servers, is not fully true anymore. Incoming Mails which bounced back have to be resent by the sender.

----
<<Anchor(2010-12-14-solaris-server-patching)>>

= Solaris Server Patching =
'''STATUS:''' {{attachment:green.gif}}

 2011-01-25: 10:30 PM:: All servers and services are back online

 2011-01-25: 7:00 PM - 10:00 PM:: To keep our systems up to date with the newest software and security releases, we need to update our servers on a regular base. For this reason we are going to patch and reboot some of our Solaris servers.

Servers concerned: '''drwho''', '''tardis''', '''oenone''', '''spitfire''', '''yosemite''', '''malina'''.

To prevent data corruption/loss please do the following:

 * All diskless clients (Linux): please logout and shutdown all DL clients
 * All Windows systems: please logout and shutdown all Windows clients
 * All Mac systems: please logout and shutdown all Mac clients
 * All user homes: please logout from these servers
 * No NFS or SAMBA access to user homes
 * No mail access, no outgoing mails (incoming mails WON'T get lost)
 * Webpages hosted on these systems are unavailable

----
<<Anchor(2010-12-19-delayed-email-delivery)>>

= Delayed Email delivery =
'''STATUS:''' {{attachment:green.gif}}

 2011-01-19: 11:00 AM - 4:30 PM:: As a result of a faulty [[http://lurker.clamav.net/thread/20110119.125839.2b4ce0e1.en.html|ClamAV signature File]] every Email that contained a PDF-file was marked as infected. Before we could resend the quarantined emails we had to fix the issue. No mail was lost and everything was resent.

 Update: 2011-01-20: 10:33 AM:: ClamAV Signatues have been updated and tested. Everything is working as it should.

----
<<Anchor(2010-12-07-reboot-yosemite)>>

= Maintenance Reboot of Solaris Server Yosemite =
'''STATUS:''' {{attachment:green.gif}}

 2010-12-07: 7:30 AM:: Server yosemite has been rebooted successfully. All services are available.

 2010-12-07: 7:00 AM:: Due to a shortage of available memory we are forced to reboot the solaris server yosemite. Downtime approx. 30 minutes.

----
<<Anchor(2010-11-26-servers-down)>>

= cooling water system outage on clusters =
'''STATUS:''' {{attachment:green.gif}}

 2010-11-26: 5:00 PM:: Host autserv02 is running as well. All hosts can be used.

 2010-11-26: 4:40 PM:: Server racks are cooled again, all hosts except of autserv02 are running and can be used.

 2010-11-26: 4:00 PM:: Server racks are still down. --> Update follows at 5 PM or earlier

 2010-11-26: 3:10 PM:: One of the cooling water pumps installed in ETZ/D/96.2 does not work correctly. This forces some of the racks in this server room to shutdown in order to protect the servers from thermal damage. '''clusters from IFH, IBT, BIWI, TIK, IKT and VAW are affected.''' the facility management is working on solving the problem. --> Update follows at 4 PM

----
<<Anchor(2010-11-23-email-phishing-attack)>>

= email phishing attack =
'''STATUS:''' {{attachment:green.gif}}

 2010-11-23:: Yesterday between 18:20 and 19:50 about 320 Phishing Mails have been sent to different Users at D-ITET. The Mails pretend to come from ''IT Support Group'' and contain the subject ''ISG.EE Webmail Alert''. The mail tells something about ''spammers'' that have compromised ''the'' ISG.EE Webmail Account and that you should provide your '''Username, Password''' and some '''Alternate Email'''. Please remember, that the ISG.EE Team will '''NEVER ask you for your Password!''' If you still have replied to this phishing mail please contact us '''immediately''' under support@ee.ethz.ch so that we can plan with you the next steps to keep your account safe.

----
<<Anchor(2010-11-17-oenone-crash)>>

= oenone home server crash =
'''STATUS:''' {{attachment:green.gif}}

 2010-11-17:: During this night at around 00:15 '''oenone''' one of our home-servers crashed. Users with homes on oenone where affected, these are '''BIWI''', '''VAW''', '''Collegium Helveticum''', '''Control''', '''IBT''', '''IKT'''. The server is now checking the filesystems and comming up again.

We are sorry for the caused inconvenience and we are investigating the problem.

 Update: 08:00:: oenone is now up and running again.
 Update: 2010-11-18 07:30:: We opened a support case at Sun/Oracle for this server.

----
<<Anchor(2010-11-16-servers-down)>>

= cooling water system outage for some clusters =
'''STATUS:''' {{attachment:green.gif}}

 2010-11-16:: On last friday evening one of the cooling water pumps installed in ETZ/D/96.2 stopped working correctly. This forced some of the racks in this server room to shutdown in order to protect the servers from thermal damage. '''All clusters from IFH, IBT, BIWI, TIK, IKT and VAW were affected.'''

The facility management is working on solving the problem.

The servers are currently (08:35) down again. Please, even if they come up again, do not use them for long-timed computations as we still do not know when exactly the technician has solved the issue.

 Update: 2010-11-17 08:25:: The rack systems are running now with only one cooling water pump. A new pump is ordered by the rack company.
 Update: 2010-11-18 16:00:: Planed substitution of broken pump will be on 25.11 or 26.11.
||<style="border: medium none;"> {{attachment:Status/green.gif}} ||<style="border: medium none;">Resolved ||
||<style="border: medium none;"> {{attachment:Status/orange.gif}} ||<style="border: medium none;">Still working but with some errors ||
||<style="border-width: medium medium 1px; border-top: medium none rgb(85, 136, 238); border-left: medium none rgb(85, 136, 238); border-right: medium none rgb(85, 136, 238); border-color: rgb(85, 136, 238);"> {{attachment:Status/red.gif}} ||<style="border-width: medium medium 1px; border-top: medium none rgb(85, 136, 238); border-left: medium none rgb(85, 136, 238); border-right: medium none rgb(85, 136, 238); border-color: rgb(85, 136, 238);">Pending ||

= Current status reports =

<<Anchor(2020-07-11-storage-downtime)>>
== Planned project/ archive storage downtime and client reboot ==
'''Status:''' {{attachment:Status/green.gif}}

  2020-07-11 12:00:: Migration has been completed, all services are back to operational state.

  2020-07-11 08:00:: Migration started, services are shutdown

  2020-07-11 8:00-12:00:: Start of planned maintenance work. Project/ archive storage services (known under the names "ocean", "bluebay", "lagoon" and "benderstor") will not be available. ISG-managed Linux clients will be rebooted.



<<Anchor(2020-06-04-svnsrv-upgrade)>>
== svn.ee.ethz.ch downtime for server upgrade ==
'''Status:''' {{attachment:Status/green.gif}}

  2020-06-04 07:05:: Webservices for managing SVN repositories are enabled.
  2020-06-04 06:15:: Systemupgrade is done and access to the SVN repositories via the `svn` and `https` transport protocols are back online.
  2020-06-04 06:00:: The server servicing the SVN repositories will be upgraded to a new operating system version. During this timeframe outages for access to the SVN repositories are expected.

<<Anchor(2020-05-17-cluster-abuse)>>
== European HPC cluster abuse ==
'''Status:''' {{attachment:Status/green.gif}}<<BR>>
Recently European HPC clusters have been attacked and abused for mining purposes. The D-ITET Slurm and SGE clusters have not been compromised. We are monitoring the situation closely.
  2020-05 17 08:30:: No successful login from known attacker IP addresses could be determined, none of the files indicating being compromised have been found on our file systems
  2020-05-16 14:30:: No unusal cluster job activity was observed

<<Anchor(2020-05-04-itetnas04-upgrade)>>
== D-ITET Netscratch downtime for server upgrade ==
'''Status:''' {{attachment:Status/green.gif}}

  2020-05-04 06:00:: Server upgrade has been completed.
  2020-05-04 06:00:: The server servicing the D-ITET Netscratch service will be upgraded to a new operating system version. During this timeframe outages for the NFS service will be expected.

<<Anchor(2020-04-07-network-interuption)>>
== Network outage ETx router ==
'''Status:''' {{attachment:Status/green.gif}}
  2020-04-07 05:30:: There was an issue on the Router `rou-etx`. ID networking team trackled and solved the issue. There was about a 10min interuption for the ETx networking zone affecting almost all ISG.EE maintained systems.

<<Anchor(2020-04-06-mira-maintenance)>>
== login.ee.ethz.ch: Reboot for maintenance ==
'''Status:''' {{attachment:Status/green.gif}}
  2020-04-06 05:35:: System behind `login.ee.ethz.ch` has been rebootet for maintenance and increase available resources.

See the [[RemoteAccess|information on access D-ITET resources remotely]]. To distribute better the load user are encouraged to use the VPN service whenever possible.

<<Anchor(2020-02-18-nostro-maintenance)>>
== itet-stor (FindYourData) Server maintenance: Reconfiguration of VM parameters ==
'''Status:''' {{attachment:Status/green.gif}}

  2020-02-18 19:03:: System again up and running.
  2020-02-18 19:00:: Scheduled downtime for the [[Workstations/FindYourData|itet-stor/FindYourData service]] due to maintenance work on the underlying server.

<<Anchor(2020-01-20-nostro-os-upgrade)>>
== itet-stor (FindYourData) Server migration: New operating system version ==
'''Status:''' {{attachment:Status/green.gif}}

  2020-01-20 07:15:: OS upgrade done, there were short interruptions to the [[Workstations/FindYourData|itet-stor/FindYourData service]].
  2020-01-20 06:00:: We will update the server servicing the [[Workstations/FindYourData|FindYourData service]] from Debian jessie 8 to Debian stretch 9. There will be short downtimes accessing this service during the update.

<<Anchor(2019-10-01-maintenance-project-storage)>>
== Project Storage maintenance ==
'''Status:''' {{attachment:Status/green.gif}}
  2019-10-01 06:00:: The needed maintenance was performend and storage services (NFS and Samba) are again up online.
  2019-10-01 05:50:: We will perform HW maintenance of the system holding the project storage. NFS and Samba services to access the storage will be offline for 10-15min between 05:50 and 06:15.

<<Anchor(2019-09-17-power-outage)>>
== Power outage ==
'''Status:''' {{attachment:Status/green.gif}}

  2019-09-17 12:11:: Power reestablished.
  2019-09-17 10:29:: Power outage in ET area. For up-to-date information please visit https://ethz.ch/services/de/it-services/service-desk.html#meldungen

<<Anchor(2019-07-31-outage-project-storage)>>
== Outage of Project Storage ==
'''Status:''' {{attachment:Status/green.gif}}

  2019-09-17 12:11:: At the moment the power has been reestablished after the outage there was again a failure of the storage system until 12:30.
  2019-09-05 15:30:: Project homes on bluebay are unavailable until 19:10 due to another hardware failure
  2019-08-28 07:00:: Whilst the service is running stable we have still open ongoing communication with the cluster filesystem developers to further track down the issues seen in the last weeks.
  2019-08-22 06:00:: Planned reboot of second system in cluster and related managment node.
  2019-08-21 06:00:: Planned reboot of first system (expect downtime of 5-10 minutes).
  2019-08-18 03:45:: Full sync of the missing storage targed finally succeeded. For now full redudancy regained. We are still in contact with the developers of the cluster filesystem to further analyse the problem.
  2019-08-14 08:00:: Full redundancy is still not regained. We are as well in contact with the developers of the cluster filesystem to analyse the current problems.
  2019-08-12 07:00:: Sync of the last storage target is ongoing and reached 82%.
  2019-08-08 18:00:: Heavy load on the storage system while still needing to perform resync caused high latencies on access. Several services were unavailable or responded very laggy until up 22:30. Situation is normalizing but until the full re-sync succeeds there is potential for further outages.
  2019-08-08 15:55:: The storage system is operational again and an initial high load phase due to resynchronization is finished.
  2019-08-08 14:28:: There are again problems with the storage servers...
  2019-08-07 16:30:: Due to errors while re-syncing the remaining target we needed to restart a full sync on this last target. Unfortunately this means that redundancy in the storage system will not yet be regained.
  2019-08-07 06:55:: High I/O load on the system during further syncing caused accessing issues on the projects between 06:15 and 06:55. There is still an issue with one target which needs a re-sync which might cause further access outages.
  2019-08-06 10:50:: Re-syncing is still ongoing, one storage target is missing, redundancy of the underlying storage system is not yet regained.
  2019-08-05 13:00:: Re-syncing of the broken targets is still ongoing, thus redundancy of the storage system is not yet regained.
  2019-08-02 17:00:: NFS throughput has normalized, causes are investigated. Recurrences have to be expected.
  2019-08-02 15:50:: High load on the storage server slowed down NFS connections down to timeouts.
  2019-08-02 15:05:: The synchronization from the healthy system 2 to system 1 has again been started. Once this synchronization is complete (in roughly 3 days), the system is finally redundant again.
  2019-08-02 12:45:: The hardware has been fixed. The storage services (NFS/Samba) have been powered on.
  2019-08-02 11:50:: The failing hardware component is currently being replaced by the hardware supplier.
  2019-08-02 10:00:: The failing hardware of system 1 has been identified. Availability of hardware replacement parts and successful data consistency check will determine the availability of the storage services.
  2019-08-02 08:30:: The replaced hardware in system 1 is again faulty, while system 2 seems (hardware-wise) ok. Due to data inconsistency, the storage service can not provide its services.
  2019-08-01 16:30:: While system 1 has not synchronized all data from system 2, the latter system goes down. Starting from here, project and archive storage is not available anymore.
  2019-07-31 22:00:: A high load of one of the backup processes leads to an NFS service interruption. After postponing that backup process, NFS recovers.
  2019-07-31 21:30:: The hardware failure lead to two corrupt storage targets on system 1. After reinitializing the affected targets the synchronization of data from the healthy system 2 starts.
  2019-07-31 15:45:: Service is restored, hardware exchange was successful. Restoring full redundancy is pending.
  2019-07-31 15:00:: Project homes on bluebay are unavailable due to replacement of faulty hardware part. Estimated downtime of ~1h.
  2019-07-31 07:40:: Service is restored, still running with reduced redundancy
  2019-07-31 06:20:: Project homes on bluebay are unavailable due to deactivation of faulty hardware part
  2019-07-31 02:40:: Storage system 2 took over. Service is restored but running with reduced redundancy.
  2019-07-31 02:30:: Project homes on bluebay are unavailable due to a hardware failure on system 1


<<Anchor(2019-07-22-outage-net-scratch)>>
== Outage net_scratch service ==
'''Status:''' {{attachment:Status/green.gif}}

  2019-08-28 07:00:: The old server is now decomissioned. We will re-bootstrap the service on new hardware in next weeks and re-announce the service when ready to be used.

  2019-08-12 07:00:: Read-only access to partially recoverable data has been removed again as planned. We are planning to re-instantiate the service on new hardware.

  2019-07-26 12:00:: Read-only access to partially recoverable data has been made available on `login.ee.ethz.ch` under `/mnt`. This volume will be available until August 9th.

  2019-07-23 08:00:: We are in contact with the hardware vendor to see on further steps to take with the respective RAID controller and double-check if there is an issue with it. Data on the `net_scratch` is with a high change lost.

  2019-07-22 15:35:: We still try to recover the data but the filesystem got corruption due to a hardware hickup on the underlying raid system. Affected by this is [[Services/NetScratch|/itet-stor/<username>/net_scratch]].

  2019-07-22 12:00:: Filesystem errors for the device holding the `net_scratch` service were reported. The server is currently offline due to severe errors on the filesystem.

<<Anchor(2019-07-19-ptrace-disable)>>
== Disabling ptrace on managed Debian/GNU Linux computers ==
'''Status:''' {{attachment:Status/green.gif}}

  2019-07-22 09:00::
   New images are rolled out. As clients reboot they will automatically pick up this new release with a fixed Linux kernel.

  2019-07-21 15:00::
   New images are available. On next reboot of an managed client will boot with a patched kernel and the mitigation disabled.

  2019-07-19 13:00::
   A current Linux kernel bug allows any unprivileged user to gain root access. A proof of concept code snippet that exploits this vulnerability is publicly available.

   To protect our systems we temporarily disabled {{{ptrace(2)}}} on all managed Linux systems. All software depending on {{{ptrace(2)}}} will completely or at least partially fail. A prominent example is the GNU Debugger {{{gdb}}}.

   A patched Linux kernel will come soon. Once this new kernel is running, we will enable {{{ptrace}}} again.

<<Anchor(2019-07-18-outage-project-storage)>>
== Outage of Project Storage ==
'''Status:''' {{attachment:Status/green.gif}}

  2019-07-18 18:45:: Service is back to normal
  2019-07-18 17:00:: Project homes on bluebay are unavailable

<<Anchor(2019-04-24-Power-Outage-Services-down)>>
== D-ITET services outages ==
'''Status:''' {{attachment:Status/green.gif}}

  2019-04-24 14:18:: Virtualisation cluster back with full redudancy. User affecting virtual machines were back online already at 07:30.
  2019-04-24 13:00:: A issue between the redunant switches caused the network issue, leading the cluster to be in an inconsistent state and rebooting all virtual machines. Networking people are investigating further the issue between the switches.
  2019-04-24 09:00:: Further analysis ongoing, but healt status of virtualisation cluster was affected leading to resets of all hosted virtual machines.
  2019-04-24 07:30:: We are bringing services back to normal.
  2019-04-24 07:00:: A planned outage in ETZ building caused stability issues on serveral services of D-ITET in particular HOME and mail services.

<<Anchor(2019-02-25-verdi-repair-filesystem-inconsitency)>>
== Downtime HOME server: Repair filesystem inconsistency ==
'''Status:''' {{attachment:Status/green.gif}}

  2019-02-25 06:18:: System is back online.
  2019-02-25 06:00:: we identified a filesystem usage accounting discrepancy on one filesystem on the HOME server requiring taking down the server and issuing a repair of the underlying filesystem. The home storage is the default storage location for personal accounts on computers managed by ISG.EE.

<<Anchor(2019-01-25-D-ITET-RDS-Maintenance)>>
== One of two RDS servers is not reachable ==
'''Status:''' {{attachment:Status/green.gif}}

  2019-01-25 23:50:: Maintenance issues have been resolved. All RDS servers are up and running now.
  2019-01-25 14:00:: RDS maintenance window is terminated but one server has still pending updates. Logins are not allowed until this issue has been fixed.


<<Anchor(2018-11-06-D-ITET-itetmaster01)>>
== Upgrade license, database and distributed computing management server itetmaster01 ==
'''Status:''' {{attachment:Status/green.gif}}

  2018-11-12 12:00:: PostgreSQL services are online.
  2018-11-12 09:31:: Arton-Cluster and Condor Grid updates are finished.
  2018-11-12 07:30:: Arton-Cluster updated (final checks pending)
  2018-11-12 07:00:: System and license services upgraded. Still pending: Arton, Condor and PostgreSQL Upgrades.
  2018-11-12 06:00 - 12:00:: Planned upgrade of `itetmaster01`

<<Anchor(2018-11-08-beegfs-storage-maintenance)>>
== Maintenance Project Storage ==
'''Status:''' {{attachment:Status/green.gif}}

  2018-11-08 07:00:: Services back online (some recovering slowly)
  2018-11-08 06:15:: Starting downtime for project storage due to an important maintenance on master node.

<<Anchor(2018-11-01-D-ITET-RDS-UNSTABLE)>>
== D-ITET RDS fronted is difficult to reach due to AD name resolution issues ==
'''Status:''' {{attachment:Status/green.gif}}

  2018-11-02 08:00:: RDS is working normally. Protective Measures were put into place to ensure the AD name is updated correctly.
  2018-11-01 09:45:: worli.d.ethz.ch can not be resolved by the AD name service. DNS works fine but AD-DNS Synchronization seems to be in an unstable state. We are in contact with the responsible team of central IT services.

 * WORKAROUND:
  * D-ITET users: Connenct directly to vela.ee.ethz.ch
  * IFE users: Connect directly to neon.ee.ethz.ch


<<Anchor(2018-08-10-jabba-tape-library-problems)>>
== Jabba Tape Library HW Problems ==
'''Status:''' {{attachment:Status/green.gif}}

  2018-08-17 07:00:: '''Update:''' All tape library issues have been solved. All backup and archive services are back online

  2018-08-15 12:00:: '''Update:''' There are still issues with the tape library. It will be down for at least another day.

  2018-08-13 11:30:: '''Update:''' The failed tape drive has been replaced. The management PC is still not working. Due to the very old hardware the supply with spare parts does take longer as usual. The tape library remains offline at least until tomorrow afternoon.

  2018-08-10 10:00:: Due to a failed tape drive and a defective management PC Jabba's Tape Library is not working. The Jabba server is not affected. We are in contact with IBM and we hope the problem will be fixed by next Monday.

What does this mean for you:
  * BACKUP
    * Store new data: New data are stored to a SAM-FS cache area first, but can not be written to tape afterwards, i.e. the backup process can not be completed finally, but will started automatically as soon as the tape library works again.
    * Access existing data: Only access to data still available in the SAM-FS cache area is possible. NO ACCESS to data located on tape only.

  * ARCHIVE
    * Store new data: New data are stored to a SAM-FS cache area first. In second step data are copied to a archive disk but the second copy to a tape will fail. I.e. the archive process can be completed for one half only. The copy-to-tape will started automatically as soon as the tape library works again.
    * Access existing data: there is no limitation in access to archive data

<<Anchor(2018-06-16-D-ITET-mail-server-downtime)>>
== D-ITET mail server downtime: New operating system version ==
'''Status:''' {{attachment:Status/green.gif}}

  2018-06-16 08:24:: System is up and running, all tests done. Everything should work as intended. If you find errors, please contact us under <support@ee.ethz.ch>
  2018-06-16 08:08:: System is up and running, we are performing some final tests before releasing access with little delay as previously announced.
  2018-06-16 07:11:: Everything works as planned currently
  2018-06-16 06:00:: Due to a planned operating system update, the D-ITET mail server will be unavailable today, June 16, 2018 between 06:00 and 08:00.

<<Anchor(2018-04-24-major-outage-incident)>>
== Major outage virtualization cluster/networking switch ==
'''Status:''' {{attachment:Status/green.gif}}

  2018-04-24 08:56:: Sending of emails is restored again. Recieving mail should not be lost for any properly sending email server, since the issues caused a temporary error notification to the sending server which should in turn retry resubmitting an email correctly later on with some delay.
  2018-04-24 07:45:: Bringing back online most important services, including home service; issue being investigated.
  2018-04-24 06:29:: Major outage of Networking/virtualization Cluster taking down important D-ITET Services (home Server, partially mailsystem, Linux clients).

<<Anchor(2018-04-06-jabba-maintenance)>>
== Jabba Maintenance ==
'''Status:''' {{attachment:Status/green.gif}}
  2018-04-06 08:10:: Jabba is back online
  2018-04-06 07:00:: Jabba is offline due to maintenance work

<<Anchor(2018-03-10-d-itet-storage-migration)>>
== D-ITET Storage Migration ==
'''Status:''' {{attachment:Status/green.gif}}

  2018-03-10 15:00:: Migration of user homes completed.
  2018-03-10 14:15:: User homes migrated, access is unblocked again, some post-migration tasks still pending.
  2018-03-10 10:00:: D-ITET user homes will be migrated from ID Storage to D-ITET Storage. During the whole migration time access to the user homes for the affected users is blocked. Affected users are informed directly by an email.

<<Anchor(2018-02-12-svnsrv-os-upgrade)>>
== svn.ee.ethz.ch Server migration: New operating system version ==
'''Status:''' {{attachment:Status/green.gif}}

  2018-02-12 08:55:: Server upgrade has been completed and all services up and running again.

  2018-02-12 06:15:: Start updating server from Debian Wheezy 7 to Debian Stretch 9. Downtimes for `https://svn.ee.ethz.ch`, `svn://svn.ee.ethz.ch` and `https://svnmgr.ee.ethz.ch`.

<<Anchor(2018-02-05-cronbox-os-upgrade)>>
== Cronbox/Login Server migration: New operating system version ==
'''Status:''' {{attachment:Status/green.gif}}

  2018-02-05 07:00:: The host `mira` has been upgraded to Debian 9 Stretch. SSH Host keys fingerprints for RSA and ED25519 are:
  {{{
4096 MD5:fc:a8:00:5b:64:90:86:a1:fb:49:75:ef:55:58:90:b3 (RSA)
4096 SHA256:v48HAAAjr+avnPAESdQzazSriKYZeTGGtIPKfoE8Dg0 (RSA)
256 SHA256:SgvaiZyIgzujLJdbtRij5VGUOXm/IuAs3MkMYtGZNhc (ED25519)
256 MD5:3b:b0:1a:8a:ea:0a:e5:ea:bb:9e:bb:5c:ef:24:c3:92 (ED25519)
}}}
The SSH host key is as well listed on: https://people.ee.ethz.ch/

  2018-01-31 11:00:: The host `mira` holding the cronbox and login service will be upgraded to Debian 9 Stretch on 2018-02-05 at 06:10.

<<Anchor(2018-01-25-upgrade-itetnas02)>>
== Upgrade of Server itetnas02 ==
'''Status:''' {{attachment:Status/green.gif}}

  2018-01-25 07:30:: Upgrade completed.

  2018-01-24 16:45:: On 2018-01-25 around 06:10 we will upgrade the server `itetnas02`. Several short outages for Fileservices (Samba, NFS) are expected. Services for project accounts and dedicated shares for biwi, ibt, ini and tik are affected.

<<Anchor(2017-11-10-outage-itetnas03)>>
== Outage of Server itetnas03 ==
'''Status:''' {{attachment:Status/green.gif}}

  2017-11-15 07:00:: Battery unit replaced

  2017-11-10 07:20:: Server is back online but without battery unit. We will need to shutdown `itetnas03` again once the problem is isolated and can be fixed.

  2017-11-10 06:15:: The server itetnas03 is down due to hardware problems (A battery replacement caused controller problems). ISG and the hardware vendor are currently working to get this problem solved.

<<Anchor(2017-11-07-CifsHome)>>
== User Home accessibility ==
'''Status:''' {{attachment:Status/green.gif}}

  2017-11-08 06:25:: Informatikdienste have reverted a change which caused the problems for accessing all user's HOME via the CIFS (SAMBA) protocol.

  2017-11-07 08:00:: All users' HOME are currently not accessible by CIFS (SAMBA) protocol. NFS access is still available.

<<Anchor(2017-10-24-ibtnas02)>>
== Outage of Server ibtnas02 ==
'''Status:''' {{attachment:Status/green.gif}}

  2017-10-31 08:00:: Upgrade successfully completed

  2017-10-30 16:50:: The server will be upgraded to a new OS release on 2017-10-31 starting around 06:15. Short outages of Samba and NFS services are going to be expected.

  2017-10-25 10:00:: ibtnas02 now serves all partitions but the problem is not yet identified

  2017-10-24 15:00:: The server ibtnas02 is up again (partition data-08 is not available)

  2017-10-24 12:50:: The server ibtnas02 is down again

  2017-10-24 09:30:: The server ibtnas02 is back online

  2017-10-24 08:00:: The server ibtnas02 is down due to hardware problems

<<Anchor(2017-10-21-itetnas03)>>
== Outage of Server itetnas03 ==
'''Status:''' {{attachment:Status/green.gif}}

  2017-10-23 18:15:: Data are also accessible via NFS.

  2017-10-23 9:30:: The server is up. Data are accessible via Samba. NFS file service is still down.

  2017-10-21 15:00:: The server itetnas03 is down due to hardware problems



<<Anchor(2017-10-18-outage-etz-d-96-2)>>
== Outage of Servers in Serverroom ETZ/D/96.2 ==
'''Status:''' {{attachment:Status/green.gif}}

  2017-10-20 13:45:: All racks in ETZ/D/96.2 are working again (cooling problem solved).

  2017-10-20 10:00:: The technician will arrive at 13:00 hours. Some servers are running, but without watercooling. So any rack might shutdown at any time if the air cooling is not sufficient. This will most probably again happen when the technician will be working in the room (i.e. this afternoon).

  2017-10-19 18:30:: The cooling engineer could not fix the problem, so some servers are still offline. Another technicial will try to fix the cooling system tomorrow morning.

  2017-10-18 14:00:: Cooling system is still not working correctly, we only selectively powered on a couple of compute machines.

  2017-10-18 12:50:: The problem has been localized and repaired. We need to wait that the circuit is cooling down.

  2017-10-18 10:30:: Outage of most racks in ETZ/D/96.2 (cooling problem) . Most compute servers are offline.


<<Anchor(2017-05-13-outage-etz-d-96-2)>>
== Outage Servers in Serverroom ETZ/D/96.2 ==
'''Status:''' {{attachment:Status/green.gif}}

  2017-05-13 20:00:: Outage of some racks in ETZ/D/96.2. Several compute servers offline.
  2017-05-13 23:59:: Most of the servers are back online.
  2017-05-15 08:45:: Status of remaining servers verified. All back online.

<<Anchor(2017-03-24-cronbox-login-ssh-keys)>>
== Cronbox/Login Server migration: new SSH host key ==
'''Status:''' {{attachment:Status/green.gif}}

  2017-03-24 17:00:: The cronbox and login server has moved to a new host. A new SSH host key has been generated:
  {{{
4096 MD5:fc:a8:00:5b:64:90:86:a1:fb:49:75:ef:55:58:90:b3 (RSA)
4096 SHA256:v48HAAAjr+avnPAESdQzazSriKYZeTGGtIPKfoE8Dg0 (RSA)
}}}
The SSH host key is as well listed on: https://people.ee.ethz.ch/

  Remember:: '''Always''' verify a fingerprint of a SSH host key before accepting it.

<<Anchor(2017-01-07-Mailsystem migration)>>
== EE Mailsystem migration ==
'''STATUS:''' {{attachment:Status/green.gif}} '''Mailsystem up'''

  2017-01-08 15:00:: The new mailsystem is now started. In case of unattended problems we will stop it again to prevent data loss and to analyze the problem.

  2017-01-07 24:00:: Not all testcases could be performed. We now plan to enable the new system about noon.

  2017-01-07 20:45:: Old Mailserver Configuration migrated, starting the mailserver testing

  2017-01-07 14:00:: User mailbox data migrated, starting mailserver configuration migration

  2017-01-07 07:00:: All mail services are stopped. Mailbox data copy started.


= Archived status reports =

[[Status/Archive/2015|2015]]
[[Status/Archive/2014|2014]]
[[Status/Archive/2013|2013]]
[[Status/Archive/2012|2012]]
[[Status/Archive/2011|2011]]
[[Status/Archive/2010|2010]]

General Informations

Status-Key

Status/green.gif

Resolved

Status/orange.gif

Still working but with some errors

Status/red.gif

Pending

Current status reports

Planned project/ archive storage downtime and client reboot

Status: Status/green.gif

2020-07-11 12:00
Migration has been completed, all services are back to operational state.
2020-07-11 08:00
Migration started, services are shutdown
2020-07-11 8:00-12:00
Start of planned maintenance work. Project/ archive storage services (known under the names "ocean", "bluebay", "lagoon" and "benderstor") will not be available. ISG-managed Linux clients will be rebooted.

svn.ee.ethz.ch downtime for server upgrade

Status: Status/green.gif

2020-06-04 07:05
Webservices for managing SVN repositories are enabled.
2020-06-04 06:15

Systemupgrade is done and access to the SVN repositories via the svn and https transport protocols are back online.

2020-06-04 06:00
The server servicing the SVN repositories will be upgraded to a new operating system version. During this timeframe outages for access to the SVN repositories are expected.

European HPC cluster abuse

Status: Status/green.gif
Recently European HPC clusters have been attacked and abused for mining purposes. The D-ITET Slurm and SGE clusters have not been compromised. We are monitoring the situation closely.

2020-05 17 08:30
No successful login from known attacker IP addresses could be determined, none of the files indicating being compromised have been found on our file systems
2020-05-16 14:30
No unusal cluster job activity was observed

D-ITET Netscratch downtime for server upgrade

Status: Status/green.gif

2020-05-04 06:00
Server upgrade has been completed.
2020-05-04 06:00
The server servicing the D-ITET Netscratch service will be upgraded to a new operating system version. During this timeframe outages for the NFS service will be expected.

Network outage ETx router

Status: Status/green.gif

2020-04-07 05:30

There was an issue on the Router rou-etx. ID networking team trackled and solved the issue. There was about a 10min interuption for the ETx networking zone affecting almost all ISG.EE maintained systems.

login.ee.ethz.ch: Reboot for maintenance

Status: Status/green.gif

2020-04-06 05:35

System behind login.ee.ethz.ch has been rebootet for maintenance and increase available resources.

See the information on access D-ITET resources remotely. To distribute better the load user are encouraged to use the VPN service whenever possible.

itet-stor (FindYourData) Server maintenance: Reconfiguration of VM parameters

Status: Status/green.gif

2020-02-18 19:03
System again up and running.
2020-02-18 19:00

Scheduled downtime for the itet-stor/FindYourData service due to maintenance work on the underlying server.

itet-stor (FindYourData) Server migration: New operating system version

Status: Status/green.gif

2020-01-20 07:15

OS upgrade done, there were short interruptions to the itet-stor/FindYourData service.

2020-01-20 06:00

We will update the server servicing the FindYourData service from Debian jessie 8 to Debian stretch 9. There will be short downtimes accessing this service during the update.

Project Storage maintenance

Status: Status/green.gif

2019-10-01 06:00
The needed maintenance was performend and storage services (NFS and Samba) are again up online.
2019-10-01 05:50
We will perform HW maintenance of the system holding the project storage. NFS and Samba services to access the storage will be offline for 10-15min between 05:50 and 06:15.

Power outage

Status: Status/green.gif

2019-09-17 12:11
Power reestablished.
2019-09-17 10:29

Power outage in ET area. For up-to-date information please visit https://ethz.ch/services/de/it-services/service-desk.html#meldungen

Outage of Project Storage

Status: Status/green.gif

2019-09-17 12:11
At the moment the power has been reestablished after the outage there was again a failure of the storage system until 12:30.
2019-09-05 15:30
Project homes on bluebay are unavailable until 19:10 due to another hardware failure
2019-08-28 07:00
Whilst the service is running stable we have still open ongoing communication with the cluster filesystem developers to further track down the issues seen in the last weeks.
2019-08-22 06:00
Planned reboot of second system in cluster and related managment node.
2019-08-21 06:00
Planned reboot of first system (expect downtime of 5-10 minutes).
2019-08-18 03:45
Full sync of the missing storage targed finally succeeded. For now full redudancy regained. We are still in contact with the developers of the cluster filesystem to further analyse the problem.
2019-08-14 08:00
Full redundancy is still not regained. We are as well in contact with the developers of the cluster filesystem to analyse the current problems.
2019-08-12 07:00
Sync of the last storage target is ongoing and reached 82%.
2019-08-08 18:00
Heavy load on the storage system while still needing to perform resync caused high latencies on access. Several services were unavailable or responded very laggy until up 22:30. Situation is normalizing but until the full re-sync succeeds there is potential for further outages.
2019-08-08 15:55
The storage system is operational again and an initial high load phase due to resynchronization is finished.
2019-08-08 14:28
There are again problems with the storage servers...
2019-08-07 16:30
Due to errors while re-syncing the remaining target we needed to restart a full sync on this last target. Unfortunately this means that redundancy in the storage system will not yet be regained.
2019-08-07 06:55
High I/O load on the system during further syncing caused accessing issues on the projects between 06:15 and 06:55. There is still an issue with one target which needs a re-sync which might cause further access outages.
2019-08-06 10:50
Re-syncing is still ongoing, one storage target is missing, redundancy of the underlying storage system is not yet regained.
2019-08-05 13:00
Re-syncing of the broken targets is still ongoing, thus redundancy of the storage system is not yet regained.
2019-08-02 17:00
NFS throughput has normalized, causes are investigated. Recurrences have to be expected.
2019-08-02 15:50
High load on the storage server slowed down NFS connections down to timeouts.
2019-08-02 15:05
The synchronization from the healthy system 2 to system 1 has again been started. Once this synchronization is complete (in roughly 3 days), the system is finally redundant again.
2019-08-02 12:45
The hardware has been fixed. The storage services (NFS/Samba) have been powered on.
2019-08-02 11:50
The failing hardware component is currently being replaced by the hardware supplier.
2019-08-02 10:00
The failing hardware of system 1 has been identified. Availability of hardware replacement parts and successful data consistency check will determine the availability of the storage services.
2019-08-02 08:30
The replaced hardware in system 1 is again faulty, while system 2 seems (hardware-wise) ok. Due to data inconsistency, the storage service can not provide its services.
2019-08-01 16:30
While system 1 has not synchronized all data from system 2, the latter system goes down. Starting from here, project and archive storage is not available anymore.
2019-07-31 22:00
A high load of one of the backup processes leads to an NFS service interruption. After postponing that backup process, NFS recovers.
2019-07-31 21:30
The hardware failure lead to two corrupt storage targets on system 1. After reinitializing the affected targets the synchronization of data from the healthy system 2 starts.
2019-07-31 15:45
Service is restored, hardware exchange was successful. Restoring full redundancy is pending.
2019-07-31 15:00
Project homes on bluebay are unavailable due to replacement of faulty hardware part. Estimated downtime of ~1h.
2019-07-31 07:40
Service is restored, still running with reduced redundancy
2019-07-31 06:20
Project homes on bluebay are unavailable due to deactivation of faulty hardware part
2019-07-31 02:40
Storage system 2 took over. Service is restored but running with reduced redundancy.
2019-07-31 02:30
Project homes on bluebay are unavailable due to a hardware failure on system 1

Outage net_scratch service

Status: Status/green.gif

2019-08-28 07:00
The old server is now decomissioned. We will re-bootstrap the service on new hardware in next weeks and re-announce the service when ready to be used.
2019-08-12 07:00
Read-only access to partially recoverable data has been removed again as planned. We are planning to re-instantiate the service on new hardware.
2019-07-26 12:00

Read-only access to partially recoverable data has been made available on login.ee.ethz.ch under /mnt. This volume will be available until August 9th.

2019-07-23 08:00

We are in contact with the hardware vendor to see on further steps to take with the respective RAID controller and double-check if there is an issue with it. Data on the net_scratch is with a high change lost.

2019-07-22 15:35

We still try to recover the data but the filesystem got corruption due to a hardware hickup on the underlying raid system. Affected by this is /itet-stor/<username>/net_scratch.

2019-07-22 12:00

Filesystem errors for the device holding the net_scratch service were reported. The server is currently offline due to severe errors on the filesystem.

Disabling ptrace on managed Debian/GNU Linux computers

Status: Status/green.gif

2019-07-22 09:00
  • New images are rolled out. As clients reboot they will automatically pick up this new release with a fixed Linux kernel.
2019-07-21 15:00
  • New images are available. On next reboot of an managed client will boot with a patched kernel and the mitigation disabled.
2019-07-19 13:00
  • A current Linux kernel bug allows any unprivileged user to gain root access. A proof of concept code snippet that exploits this vulnerability is publicly available.

    To protect our systems we temporarily disabled ptrace(2) on all managed Linux systems. All software depending on ptrace(2) will completely or at least partially fail. A prominent example is the GNU Debugger gdb.

    A patched Linux kernel will come soon. Once this new kernel is running, we will enable ptrace again.

Outage of Project Storage

Status: Status/green.gif

2019-07-18 18:45
Service is back to normal
2019-07-18 17:00
Project homes on bluebay are unavailable

D-ITET services outages

Status: Status/green.gif

2019-04-24 14:18
Virtualisation cluster back with full redudancy. User affecting virtual machines were back online already at 07:30.
2019-04-24 13:00
A issue between the redunant switches caused the network issue, leading the cluster to be in an inconsistent state and rebooting all virtual machines. Networking people are investigating further the issue between the switches.
2019-04-24 09:00
Further analysis ongoing, but healt status of virtualisation cluster was affected leading to resets of all hosted virtual machines.
2019-04-24 07:30
We are bringing services back to normal.
2019-04-24 07:00
A planned outage in ETZ building caused stability issues on serveral services of D-ITET in particular HOME and mail services.

Downtime HOME server: Repair filesystem inconsistency

Status: Status/green.gif

2019-02-25 06:18
System is back online.
2019-02-25 06:00
we identified a filesystem usage accounting discrepancy on one filesystem on the HOME server requiring taking down the server and issuing a repair of the underlying filesystem. The home storage is the default storage location for personal accounts on computers managed by ISG.EE.

One of two RDS servers is not reachable

Status: Status/green.gif

2019-01-25 23:50
Maintenance issues have been resolved. All RDS servers are up and running now.
2019-01-25 14:00
RDS maintenance window is terminated but one server has still pending updates. Logins are not allowed until this issue has been fixed.

Upgrade license, database and distributed computing management server itetmaster01

Status: Status/green.gif

2018-11-12 12:00
PostgreSQL services are online.
2018-11-12 09:31
Arton-Cluster and Condor Grid updates are finished.
2018-11-12 07:30
Arton-Cluster updated (final checks pending)
2018-11-12 07:00
System and license services upgraded. Still pending: Arton, Condor and PostgreSQL Upgrades.
2018-11-12 06:00 - 12:00

Planned upgrade of itetmaster01

Maintenance Project Storage

Status: Status/green.gif

2018-11-08 07:00
Services back online (some recovering slowly)
2018-11-08 06:15
Starting downtime for project storage due to an important maintenance on master node.

D-ITET RDS fronted is difficult to reach due to AD name resolution issues

Status: Status/green.gif

2018-11-02 08:00
RDS is working normally. Protective Measures were put into place to ensure the AD name is updated correctly.
2018-11-01 09:45
worli.d.ethz.ch can not be resolved by the AD name service. DNS works fine but AD-DNS Synchronization seems to be in an unstable state. We are in contact with the responsible team of central IT services.
  • WORKAROUND:
    • D-ITET users: Connenct directly to vela.ee.ethz.ch
    • IFE users: Connect directly to neon.ee.ethz.ch

Jabba Tape Library HW Problems

Status: Status/green.gif

2018-08-17 07:00

Update: All tape library issues have been solved. All backup and archive services are back online

2018-08-15 12:00

Update: There are still issues with the tape library. It will be down for at least another day.

2018-08-13 11:30

Update: The failed tape drive has been replaced. The management PC is still not working. Due to the very old hardware the supply with spare parts does take longer as usual. The tape library remains offline at least until tomorrow afternoon.

2018-08-10 10:00
Due to a failed tape drive and a defective management PC Jabba's Tape Library is not working. The Jabba server is not affected. We are in contact with IBM and we hope the problem will be fixed by next Monday.

What does this mean for you:

  • BACKUP
    • Store new data: New data are stored to a SAM-FS cache area first, but can not be written to tape afterwards, i.e. the backup process can not be completed finally, but will started automatically as soon as the tape library works again.
    • Access existing data: Only access to data still available in the SAM-FS cache area is possible. NO ACCESS to data located on tape only.
  • ARCHIVE
    • Store new data: New data are stored to a SAM-FS cache area first. In second step data are copied to a archive disk but the second copy to a tape will fail. I.e. the archive process can be completed for one half only. The copy-to-tape will started automatically as soon as the tape library works again.
    • Access existing data: there is no limitation in access to archive data

D-ITET mail server downtime: New operating system version

Status: Status/green.gif

2018-06-16 08:24

System is up and running, all tests done. Everything should work as intended. If you find errors, please contact us under <support@ee.ethz.ch>

2018-06-16 08:08
System is up and running, we are performing some final tests before releasing access with little delay as previously announced.
2018-06-16 07:11
Everything works as planned currently
2018-06-16 06:00
Due to a planned operating system update, the D-ITET mail server will be unavailable today, June 16, 2018 between 06:00 and 08:00.

Major outage virtualization cluster/networking switch

Status: Status/green.gif

2018-04-24 08:56
Sending of emails is restored again. Recieving mail should not be lost for any properly sending email server, since the issues caused a temporary error notification to the sending server which should in turn retry resubmitting an email correctly later on with some delay.
2018-04-24 07:45
Bringing back online most important services, including home service; issue being investigated.
2018-04-24 06:29
Major outage of Networking/virtualization Cluster taking down important D-ITET Services (home Server, partially mailsystem, Linux clients).

Jabba Maintenance

Status: Status/green.gif

2018-04-06 08:10
Jabba is back online
2018-04-06 07:00
Jabba is offline due to maintenance work

D-ITET Storage Migration

Status: Status/green.gif

2018-03-10 15:00
Migration of user homes completed.
2018-03-10 14:15
User homes migrated, access is unblocked again, some post-migration tasks still pending.
2018-03-10 10:00
D-ITET user homes will be migrated from ID Storage to D-ITET Storage. During the whole migration time access to the user homes for the affected users is blocked. Affected users are informed directly by an email.

svn.ee.ethz.ch Server migration: New operating system version

Status: Status/green.gif

2018-02-12 08:55
Server upgrade has been completed and all services up and running again.
2018-02-12 06:15

Start updating server from Debian Wheezy 7 to Debian Stretch 9. Downtimes for https://svn.ee.ethz.ch, svn://svn.ee.ethz.ch and https://svnmgr.ee.ethz.ch.

Cronbox/Login Server migration: New operating system version

Status: Status/green.gif

2018-02-05 07:00

The host mira has been upgraded to Debian 9 Stretch. SSH Host keys fingerprints for RSA and ED25519 are:

4096 MD5:fc:a8:00:5b:64:90:86:a1:fb:49:75:ef:55:58:90:b3 (RSA)
4096 SHA256:v48HAAAjr+avnPAESdQzazSriKYZeTGGtIPKfoE8Dg0 (RSA)
256 SHA256:SgvaiZyIgzujLJdbtRij5VGUOXm/IuAs3MkMYtGZNhc (ED25519)
256 MD5:3b:b0:1a:8a:ea:0a:e5:ea:bb:9e:bb:5c:ef:24:c3:92 (ED25519)

The SSH host key is as well listed on: https://people.ee.ethz.ch/

2018-01-31 11:00

The host mira holding the cronbox and login service will be upgraded to Debian 9 Stretch on 2018-02-05 at 06:10.

Upgrade of Server itetnas02

Status: Status/green.gif

2018-01-25 07:30
Upgrade completed.
2018-01-24 16:45

On 2018-01-25 around 06:10 we will upgrade the server itetnas02. Several short outages for Fileservices (Samba, NFS) are expected. Services for project accounts and dedicated shares for biwi, ibt, ini and tik are affected.

Outage of Server itetnas03

Status: Status/green.gif

2017-11-15 07:00
Battery unit replaced
2017-11-10 07:20

Server is back online but without battery unit. We will need to shutdown itetnas03 again once the problem is isolated and can be fixed.

2017-11-10 06:15
The server itetnas03 is down due to hardware problems (A battery replacement caused controller problems). ISG and the hardware vendor are currently working to get this problem solved.

User Home accessibility

Status: Status/green.gif

2017-11-08 06:25
Informatikdienste have reverted a change which caused the problems for accessing all user's HOME via the CIFS (SAMBA) protocol.
2017-11-07 08:00
All users' HOME are currently not accessible by CIFS (SAMBA) protocol. NFS access is still available.

Outage of Server ibtnas02

Status: Status/green.gif

2017-10-31 08:00
Upgrade successfully completed
2017-10-30 16:50
The server will be upgraded to a new OS release on 2017-10-31 starting around 06:15. Short outages of Samba and NFS services are going to be expected.
2017-10-25 10:00
ibtnas02 now serves all partitions but the problem is not yet identified
2017-10-24 15:00
The server ibtnas02 is up again (partition data-08 is not available)
2017-10-24 12:50
The server ibtnas02 is down again
2017-10-24 09:30
The server ibtnas02 is back online
2017-10-24 08:00
The server ibtnas02 is down due to hardware problems

Outage of Server itetnas03

Status: Status/green.gif

2017-10-23 18:15
Data are also accessible via NFS.
2017-10-23 9:30
The server is up. Data are accessible via Samba. NFS file service is still down.
2017-10-21 15:00
The server itetnas03 is down due to hardware problems

Outage of Servers in Serverroom ETZ/D/96.2

Status: Status/green.gif

2017-10-20 13:45
All racks in ETZ/D/96.2 are working again (cooling problem solved).
2017-10-20 10:00
The technician will arrive at 13:00 hours. Some servers are running, but without watercooling. So any rack might shutdown at any time if the air cooling is not sufficient. This will most probably again happen when the technician will be working in the room (i.e. this afternoon).
2017-10-19 18:30
The cooling engineer could not fix the problem, so some servers are still offline. Another technicial will try to fix the cooling system tomorrow morning.
2017-10-18 14:00
Cooling system is still not working correctly, we only selectively powered on a couple of compute machines.
2017-10-18 12:50
The problem has been localized and repaired. We need to wait that the circuit is cooling down.
2017-10-18 10:30
Outage of most racks in ETZ/D/96.2 (cooling problem) . Most compute servers are offline.

Outage Servers in Serverroom ETZ/D/96.2

Status: Status/green.gif

2017-05-13 20:00
Outage of some racks in ETZ/D/96.2. Several compute servers offline.
2017-05-13 23:59
Most of the servers are back online.
2017-05-15 08:45
Status of remaining servers verified. All back online.

Cronbox/Login Server migration: new SSH host key

Status: Status/green.gif

2017-03-24 17:00
The cronbox and login server has moved to a new host. A new SSH host key has been generated:
4096 MD5:fc:a8:00:5b:64:90:86:a1:fb:49:75:ef:55:58:90:b3 (RSA)
4096 SHA256:v48HAAAjr+avnPAESdQzazSriKYZeTGGtIPKfoE8Dg0 (RSA)

The SSH host key is as well listed on: https://people.ee.ethz.ch/

Remember

Always verify a fingerprint of a SSH host key before accepting it.

EE Mailsystem migration

STATUS: Status/green.gif Mailsystem up

2017-01-08 15:00
The new mailsystem is now started. In case of unattended problems we will stop it again to prevent data loss and to analyze the problem.
2017-01-07 24:00
Not all testcases could be performed. We now plan to enable the new system about noon.
2017-01-07 20:45
Old Mailserver Configuration migrated, starting the mailserver testing
2017-01-07 14:00
User mailbox data migrated, starting mailserver configuration migration
2017-01-07 07:00
All mail services are stopped. Mailbox data copy started.

Archived status reports

2015 2014 2013 2012 2011 2010


CategoryEDUC

Status (last edited 2023-10-16 11:24:17 by alders)