Koumbit Network Status

Aller au contenu | Aller au menu | Aller à la recherche

mercredi 20 mai 2009

Maintenance window, 20-5-2009 13:00-0400

Who is affected

All the users of the shared hosting service, emails and websites alike.

When this will happen

May 20th, between 13:00 and 14:00 EDT (UTC-4).

What will happen

The main database server will be replaced with a more powerful machine.

A new file server will be put on line.

Why

The current database server has been the main performance bottleneck since February and we have tried numerous times to replace it to improve performance of the hosting cluster. We are hoping this will be the final operation required for at least a few months.

The new file server aims to reduce the dependency on the main server which currently assumes all functions except web page service, which includes file service. By moving this to a dedicated server, we will ensure better redundancy and scalability. Since the new server also supports hotswapping hard drives, hardware replacements will be easier and will not require any downtime.

How

We will take the whole cluster down for at least 30 minutes, between 13h00 and 13h30 (UTC-4). We hope to do both operations in 30 minutes, but we may go beyond that timeline and extend the operations if we have problems, to a maximum of 1h. Therefore all services should be back to normal (and faster!) at 14h00 (UTC-4).

If there's any modification to that timeline, an update will be posted, as usual, on http://offline.koumbit.net/.

Koumbit members can see the details here: https://wiki.koumbit.net/RapportsIntervention/2009-05-20

I object!

If this intervention is too problematic for you or your organisation, please let us know beforehand to see if we can arrange otherwise.

lundi 13 avril 2009

Maintenance window wed apr 15 at 20:00

Who is affected

All the users of the shared hosting service, emails and websites alike.

When

April 15th 2009, between 20h00 and 20h30 EDT (UTC-4).

What will happen

During the maintenance window, the main server will be closed down for a short period of time (10 minutes) to proceed with the replacement of a hard drive showing weaknesses.

A new database server will also be put online later during the night.

Why

We want to act proactively to remove any chance of a disk failure requiring an emergency intervention.

Additionally, the new database server will improve general performance.

How

The main server will be shutdown to replace the drive, which should provoke a 10 minute downtime. New memory and a second CPU will be installed in the new database server, which will then be put online during the remaining of the night, which should provoke only a minor outage which should be limited.

I object!

If this operation is too problematic for you or your organisation, please let us know within 24 so that we can arrange a workaround.

lundi 30 mars 2009

SSL certificates renewal and MySQL server upgrade

SSL certificate changes and problems

We recently had to renew our SSL certificates for the Koumbit.net domain since it's already been a year since we bought the wildcard on *koumbit.net. Since that portion of the infrastructure is not completely automated (ie. not provisioned through Puppet, we forgot to configure some services, mainly email, which were only configured on Saturday March 28th. The impact of that is that you may have gotten a warning on that day and that day only.

Otherwise all connexions to the *.koumbit.net domains should not generate a warning in most modern browsers. I repeat: if your browser is generating an error on an SSL-encrypted connexion to our domains, it is likely to be a man in the middle attack.

If you need to verify the certificate, you can rely on those fingerprints, signed with my personnal PGP key.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

SHA1 Fingerprint=91:D5:7D:CA:5C:24:84:E6:F9:EC:8F:E3:55:19:A4:A4:E9:50:3E:D1
MD5 Fingerprint=60:D0:AD:42:EC:5C:CD:75:BA:77:9C:63:B8:F2:7C:06
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAknRCZcACgkQWGBzs0AjcC+7iQCgjaRdDaIMoIgrVURTR0x8FwQ9
CFgAn0q7Buo19n3EGjUPVSqNs5qfW0rh
=pwsu
-----END PGP SIGNATURE-----

MySQL server upgrade

In other news, we have also noticed the recent slowdown of our shared hosting services and we are working on it. We will install a new database server during the following weeks which will boast 12GB of ram and two dual core processors. Since the CPU is back order, there are additional delays in the delivery of the hardware. We will send another announcement when the server will be put in production, as this will create a small outage.

mardi 17 mars 2009

Minor problems fixed (voicemail and mails from web)

We have had some problems with voicemail recently so if you people have been hitting "voicemail full" messages when calling the office, those should be fixed.

We also had a crash on a unmonitored mail server that provoked serious delays (5-14 days) in the delivery of emails sent from one of the web servers in the load balancer. There are around 3 thousand such messages that are slowly being delivered as we speak.

The monitoring system will be fixed to warn us properly of those crashes in the future.

samedi 28 février 2009

MySQL maintenance window wednesday night

Who is affected

All the sites hosted on our shared hosting servers.

When

Maintenance window:

  • Begins: 2009-03-04 23:59:59 EST
  • Ends: 2009-03-05 00:30:00 EST

What will happen

During the planned maintenance window, MySQL services will be slower than usual while the secondary server takes over the primary one. This will mainly affect websites, which will all see a slowdown, maybe even complete outages.

Why

The main objective of the intervention is to test the capacity of the main server when idle, without any traffic, to compare it against the new server we are in the process of configuring as a replacement.

We also wish to test the capacity of the secondary server and the abilities of the sysadmin team to be able to proceed quickly with such an intervention, without being pressed by an actual emergency.

How

To proceed with those tests, we will turn off the main server and redirect all traffic to the secondary server. Since that server is of lesser capacity, substantial performance hit will be observable on our main servers.

The details of the operation are available on this page:

https://wiki.koumbit.net/RapportsIntervention/2009-02-04

If the operation takes longer than expected, we will announce it on http://offline.koumbit.net/

I object!

If this operation is too problematic for you or your organisation, please let us know within 24h to see if we can take appropriate workarounds.

vendredi 16 janvier 2009

secondary server online, returning to regular performances

Who's affected

Users of the shared hosting service.

When

Jan 15th 2009 19:39EST

What happened

The secondary server was put back online.

Why

On january first, that server (hesiode.koumbit.net) was completely put offline by a power surge following a power outage. While the main server took over and the load balancing service hid the outage, this greatly affected the performance of websites and hosting services in general.

How

The server was return to our provider, which repaired the problem and returned the server.

jeudi 8 janvier 2009

maintenance window jan 9th between 14:00 and 16:00

Who is affected

All hosting services will be temporarly turned off as the servers will be rebooted. This will also affected virtual server users.

When

The operations will take place on january 9th 2009, between 14:00 and 16:00 EST. The server reboots should be limited to the period between 14:00 et 14:30 EST.

What will happen

The following servers will be rebooted: homere.koumbit.net, metis.koumbit.net, alexandria.koumbit.net, demeter.koumbit.net, marius.koumbit.net, romulus.koumbit.net et raymond.fqccl.org

The following server will be removed: hesiode.koumbit.net.

The following servers will be put online: lgm.koumbit.net, sw4-canix2.koumbit.net

Why

Some servers will be rebooted to apply security upgrades to the Linux kernel. The secondary web server (hesiode.koumbit.net) will be removed from the cabinet to be replaced because it has been damaged by the january 1st power failure. A new server will also be put online for a client (lgm.koumbit.net). Finally, new equipment will be put into place to make the new cabinet able to welcome new servers.

That new cabinet is necessary to respond properly to our growth.

How

Details of the operations are available to Koumbit members in the page: https://wiki.koumbit.net/RapportsIntervention/2009-01-09

I object!

If this intervention is too problematic for you or your organisation, please let us know within 24h to see if we can arrange otherwise.

300th account and new cabinet

Accounts per month

We have just welcome our 300th account today! This symbolic step comes at a turning point in the history of our hosting services as we are getting ready to open our second cabinet to deploy new servers. We're still having some delays in the deployment of our redundant infrastructure roadmap, but we're soon going to hire more personnel that should help further this faster.

mardi 16 décembre 2008

Air conditionning failure in main cabinet nov 16 2008

At 7:15 this morning, all core services went down due to a air conditionning unit failure in the datacenter. That unit failed about an hour ago, which raised the temperature of all server units in the datacenter and caused a cascading outage. Our provider is aware of the issue and is working on it right now, more updates to follow.

Update (7:57): I have just been informed of a "30 minutes" ETA from the remote team, hang in there.

Update (8:03): all services have been brought back up, sorry for the inconvenience.

mercredi 3 décembre 2008

phpMyadmin upgraded to 3.1.0

We have upgraded phpMyAdmin to the 3.1.0 version which fixes the "mbstring" issues that you have reported many times. Please report any problem to support@koumbit.org.

dimanche 9 novembre 2008

Hosting outage Saturday November 8, 2008: electric problem, database time problem

A fuse in the cabinet, where most of the Koumbit hosting servers are located, was overloaded and failed, in the night of November 8, 2008. Part of the servers were not available between 23h15 and 0h30, followed by other minor disruptions between 0h30 and 2h15.

Following this, the main web server of the shared hosting accounts did not recover correctly its time and was displaying 1970. This caused a few problems on some sites running content management content systems (sush as Spip and Drupal). The problem was noticed and fixed Sunday around 11h00.

Koumbit is about to open a second cabinet in a new point of presence. This is part of our 2008 architectural plan to increase redundancy and to deal with the growth of the demand. This will allow us, amongst other benefits, to avoid this type of outage, since the main shared hosting servers will be redundant between the two cabinets.

For questions or comments, you can comment on the sysadmin blog (offline.koumbit.net) or write to us at support@koumbit.org.

Thank you for your understanding.

Update, 16:47EST: it's the webserver and not the database server that had a clock problem.

dimanche 2 novembre 2008

Defective disc replacement on the MySQL server

Who's affected?

All virtual servers will be affected as well as all sites hosted on the shared server that use MySQL. Therefore, most of the sites hosted by Koumbit.

When?
* Anticipated START TIME: Sunday, November 2nd 2008 at 16:00:00 EST
* Anticipated END TIME: Sunday, November 2nd 2008 at 17:00:00 EST
What will happen?

The database server will be temporarily stopped in order to replace a defective disc.

Why?

One of the components of our RAID demonstrated a defect last night. No data was lost but it is important to replace the defective component to prevent the possibility of data loss from taking place.

How?

A technical (Antoine) will visit the data center to affect the replacement.

mercredi 15 octobre 2008

New webserver ready for testing, alternc 0.9.9 online

Who's affected

This notice affects all web developers maintaining sites on the shared hosting services.

Staring next week, all the users are also affected.

When

Monday october 20th at 13h.

What will happen

A new web server has been put online and has successfully passed a serie of internal tests. We now welcome all web developpers and other technically capable people to test the new webserver during the week.

Next monday, the new server will be added to the load balancing setup.

Why

The new server will ensure a better service continuity and a faster response.

How

When an outage will occur on a server, because of an overload or other, the second server will take over (the delay is currently set to 5 seconds). Even when not during an overload, both servers will share the load, greatly improving overall performance.

To test the new server immediatly, all interested testers should modify their "hosts" files by following the instructions in the page below:

https://wiki.koumbit.net/DnsWithHostsFile

The IP address of the new server is the following: 209.44.112.96

Please notify us of any anomaly at support@koumbit.org, mentionning that you believe the problem is related to the new server and your configuration below.

Other announcements

We want to profit from this announcement to emphasize on the release of AlternC 0.9.9, which fixes many bugs in the control panel and allows for deployment on multiple servers easily.

Additionnaly, note that the announcements sent to the mailing list are now marked with the language of the message. You can therefore filter the announcements you want to receive on the following page:

https://listes.koumbit.net/cgi-bin/mailman/options/hag-koumbit.org

I object!

If this intervention is too problematic for your or your organisation, please let us know within 24h to see if we can make other arrangements.

lundi 6 octobre 2008

Network outage at main datacenter

We had a complete outage between 7:43 and 7:52. Between 7:56 and 8:43, we had around 50% packet loss, and that situation has now returned again. There isn't much we can do as we depend on our upstream provider to resolve the situation.

Update (9:19): situation back to normal again. It seems that our provider had stopped announcing its addresses to teleglobe, its main bandwidth provider.

Update (12:00): situation has returned to normal during the morning. It seems our upstream provider was victim of a large-scale distributed denial of service attack.

jeudi 25 septembre 2008

Security reboots and new webserver online on september 30th

When

September 30th between 14:40 and 15:00, EDT (-0400).

What will happen

The servers will be rebooted for a security update. Furthermore, a new physical server will be added to the LoadBalancing configuration.

Why

The linux kernel has suffered multiple security vulnerabilities recently and we therefore need to upgrade with the newer kernels.

As for the load balancer, the goal is to resolve the recent reliability problems and allow for an easier maintenance of the services.

How

See the complete report (fr). Servers will be rebooted one after the other between 14:30 and 15:00. This will affect all virtual servers as the shared hosting, each outage lasting around 90 seconds.

The new server (hesiode.koumbit.net) will be put online but will not be activated before a new test period, as it is possible the new server breaks when displaying certain sites.

I object!

If this operation is too problematic for you or your organisation, please let us know within 24h to see if we can arrange otherwise.

mardi 16 septembre 2008

recursive DNS service outage today

When

Today september 16th, between 17:45 and 18:15, EDT (-0400).

What will happen

The server hosting one the virtual servers resolving DNS for the cabinet (209.44.112.71, recurse2.koumbit.net) will be replaced, provoking a short outage of around 30 minutes of this service. The other server (209.44.112.70, recurse.koumbit.net) should continue to perform regular service and we therefore believe that this will have minimal impact on the infrastructure.

Why

The server (remus.koumbit.net) is approaching end of life and needs to be replaced. It will be transformed into a massive backup server (alexandria.koumbit.net).

How

Koumbit members can read the details of the operational report. Note that remus.koumbit.net will now be named metis.koumbit.net. We will also put a new web node online name, hesiode.koumbit.net.

jeudi 11 septembre 2008

MySQL server outage

The main mysql server (demeter) has suffered a major outage. Apologies for the unusual outage.

It lasted between 13:23 à 14:08. All web and mail services were affected, but no mail should have been lost. The problem was related with another server crash. Koumbit members can read the full report.

jeudi 28 août 2008

New DNS server: ns3.koumbit.net

What is happening

We are adding a new server to our list of DNS servers. The new server is already functional for all the shared hosting domains.

The new address of the server is: 209.172.53.230

Who is affected

All the users managing their domains themselves (as technical contact) have to add NS3.KOUMBIT.NET to their DNS configuration. This will ensure that you will not suffer any outage when we switch NS2.KOUMBIT.NET providers.

All domains managed by Koumbit have been properly modified today. If we are the technical contact for your domain, you do not have any action to take today.

You can verify the contacts for your domains through this web page:

http://www.gandi.net/whois

When

The changes have already started. The server has been in production since today. The "Glue Records" have been updated today, as all the domains for which we are the technical contact.

Why

The secondary DNS server NS2.KOUMBIT.NET is hosted on a network link with less than desirable latency, which degrades our quality of service. We therefore want to migrate this server to another provider, but this move may create an outage. We are therefore creating a new DNS server that will provide us with another redundancy layer.

I object

If this intervention is too problematic for you or your organisation, please let us know within 24h to see if we can arrange otherwise.

mercredi 27 août 2008

new webserver in the cluster

I have just added a new web server to the load balancing setup. It is currently configured to answer only when the main server goes down (as opposed to sharing the load with it). This should get rid of the "503 Service unavailable" messages that we were regularly seeing on the web server these days.

There may be issues with some sites related to that change. We have tested a few sites (a Drupal and a Tikiwiki) and things seem to be running fine, but if you see weird behaviour, please tell us the exact time at which it was encountered so we can diagnose the problem.

Note that this does not yet improve performance in the cluster, but merely improves reliability. We will shortly deploy a dedicated server that should improve performance as well.

switch replacement complete, new statistics URLs

The maintenance yesterday is now complete and the new switch is in place. You machine has very likely changed ports. You can see the new configuration on the MRTG page:

http://log.koumbit.net/mrtg/

Most of you should be at ports above 36.

The statistics from the old switch are still available here:

http://log.koumbit.net/mrtg.pre-sw3/

Sorry for the trouble.

- page 1 de 2