Koumbit Network Status

Aller au contenu | Aller au menu | Aller à la recherche

jeudi 8 janvier 2009

maintenance window jan 9th between 14:00 and 16:00

Who is affected

All hosting services will be temporarly turned off as the servers will be rebooted. This will also affected virtual server users.

When

The operations will take place on january 9th 2009, between 14:00 and 16:00 EST. The server reboots should be limited to the period between 14:00 et 14:30 EST.

What will happen

The following servers will be rebooted: homere.koumbit.net, metis.koumbit.net, alexandria.koumbit.net, demeter.koumbit.net, marius.koumbit.net, romulus.koumbit.net et raymond.fqccl.org

The following server will be removed: hesiode.koumbit.net.

The following servers will be put online: lgm.koumbit.net, sw4-canix2.koumbit.net

Why

Some servers will be rebooted to apply security upgrades to the Linux kernel. The secondary web server (hesiode.koumbit.net) will be removed from the cabinet to be replaced because it has been damaged by the january 1st power failure. A new server will also be put online for a client (lgm.koumbit.net). Finally, new equipment will be put into place to make the new cabinet able to welcome new servers.

That new cabinet is necessary to respond properly to our growth.

How

Details of the operations are available to Koumbit members in the page: https://wiki.koumbit.net/RapportsIntervention/2009-01-09

I object!

If this intervention is too problematic for you or your organisation, please let us know within 24h to see if we can arrange otherwise.

300th account and new cabinet

Accounts per month

We have just welcome our 300th account today! This symbolic step comes at a turning point in the history of our hosting services as we are getting ready to open our second cabinet to deploy new servers. We're still having some delays in the deployment of our redundant infrastructure roadmap, but we're soon going to hire more personnel that should help further this faster.

mardi 16 décembre 2008

Air conditionning failure in main cabinet nov 16 2008

At 7:15 this morning, all core services went down due to a air conditionning unit failure in the datacenter. That unit failed about an hour ago, which raised the temperature of all server units in the datacenter and caused a cascading outage. Our provider is aware of the issue and is working on it right now, more updates to follow.

Update (7:57): I have just been informed of a "30 minutes" ETA from the remote team, hang in there.

Update (8:03): all services have been brought back up, sorry for the inconvenience.

mercredi 3 décembre 2008

phpMyadmin upgraded to 3.1.0

We have upgraded phpMyAdmin to the 3.1.0 version which fixes the "mbstring" issues that you have reported many times. Please report any problem to support@koumbit.org.

dimanche 9 novembre 2008

Hosting outage Saturday November 8, 2008: electric problem, database time problem

A fuse in the cabinet, where most of the Koumbit hosting servers are located, was overloaded and failed, in the night of November 8, 2008. Part of the servers were not available between 23h15 and 0h30, followed by other minor disruptions between 0h30 and 2h15.

Following this, the main web server of the shared hosting accounts did not recover correctly its time and was displaying 1970. This caused a few problems on some sites running content management content systems (sush as Spip and Drupal). The problem was noticed and fixed Sunday around 11h00.

Koumbit is about to open a second cabinet in a new point of presence. This is part of our 2008 architectural plan to increase redundancy and to deal with the growth of the demand. This will allow us, amongst other benefits, to avoid this type of outage, since the main shared hosting servers will be redundant between the two cabinets.

For questions or comments, you can comment on the sysadmin blog (offline.koumbit.net) or write to us at support@koumbit.org.

Thank you for your understanding.

Update, 16:47EST: it's the webserver and not the database server that had a clock problem.

dimanche 2 novembre 2008

Defective disc replacement on the MySQL server

Who's affected?

All virtual servers will be affected as well as all sites hosted on the shared server that use MySQL. Therefore, most of the sites hosted by Koumbit.

When?
* Anticipated START TIME: Sunday, November 2nd 2008 at 16:00:00 EST
* Anticipated END TIME: Sunday, November 2nd 2008 at 17:00:00 EST
What will happen?

The database server will be temporarily stopped in order to replace a defective disc.

Why?

One of the components of our RAID demonstrated a defect last night. No data was lost but it is important to replace the defective component to prevent the possibility of data loss from taking place.

How?

A technical (Antoine) will visit the data center to affect the replacement.

mercredi 15 octobre 2008

New webserver ready for testing, alternc 0.9.9 online

Who's affected

This notice affects all web developers maintaining sites on the shared hosting services.

Staring next week, all the users are also affected.

When

Monday october 20th at 13h.

What will happen

A new web server has been put online and has successfully passed a serie of internal tests. We now welcome all web developpers and other technically capable people to test the new webserver during the week.

Next monday, the new server will be added to the load balancing setup.

Why

The new server will ensure a better service continuity and a faster response.

How

When an outage will occur on a server, because of an overload or other, the second server will take over (the delay is currently set to 5 seconds). Even when not during an overload, both servers will share the load, greatly improving overall performance.

To test the new server immediatly, all interested testers should modify their "hosts" files by following the instructions in the page below:

https://wiki.koumbit.net/DnsWithHostsFile

The IP address of the new server is the following: 209.44.112.96

Please notify us of any anomaly at support@koumbit.org, mentionning that you believe the problem is related to the new server and your configuration below.

Other announcements

We want to profit from this announcement to emphasize on the release of AlternC 0.9.9, which fixes many bugs in the control panel and allows for deployment on multiple servers easily.

Additionnaly, note that the announcements sent to the mailing list are now marked with the language of the message. You can therefore filter the announcements you want to receive on the following page:

https://listes.koumbit.net/cgi-bin/mailman/options/hag-koumbit.org

I object!

If this intervention is too problematic for your or your organisation, please let us know within 24h to see if we can make other arrangements.

lundi 6 octobre 2008

Network outage at main datacenter

We had a complete outage between 7:43 and 7:52. Between 7:56 and 8:43, we had around 50% packet loss, and that situation has now returned again. There isn't much we can do as we depend on our upstream provider to resolve the situation.

Update (9:19): situation back to normal again. It seems that our provider had stopped announcing its addresses to teleglobe, its main bandwidth provider.

Update (12:00): situation has returned to normal during the morning. It seems our upstream provider was victim of a large-scale distributed denial of service attack.

jeudi 25 septembre 2008

Security reboots and new webserver online on september 30th

When

September 30th between 14:40 and 15:00, EDT (-0400).

What will happen

The servers will be rebooted for a security update. Furthermore, a new physical server will be added to the LoadBalancing configuration.

Why

The linux kernel has suffered multiple security vulnerabilities recently and we therefore need to upgrade with the newer kernels.

As for the load balancer, the goal is to resolve the recent reliability problems and allow for an easier maintenance of the services.

How

See the complete report (fr). Servers will be rebooted one after the other between 14:30 and 15:00. This will affect all virtual servers as the shared hosting, each outage lasting around 90 seconds.

The new server (hesiode.koumbit.net) will be put online but will not be activated before a new test period, as it is possible the new server breaks when displaying certain sites.

I object!

If this operation is too problematic for you or your organisation, please let us know within 24h to see if we can arrange otherwise.

mardi 16 septembre 2008

recursive DNS service outage today

When

Today september 16th, between 17:45 and 18:15, EDT (-0400).

What will happen

The server hosting one the virtual servers resolving DNS for the cabinet (209.44.112.71, recurse2.koumbit.net) will be replaced, provoking a short outage of around 30 minutes of this service. The other server (209.44.112.70, recurse.koumbit.net) should continue to perform regular service and we therefore believe that this will have minimal impact on the infrastructure.

Why

The server (remus.koumbit.net) is approaching end of life and needs to be replaced. It will be transformed into a massive backup server (alexandria.koumbit.net).

How

Koumbit members can read the details of the operational report. Note that remus.koumbit.net will now be named metis.koumbit.net. We will also put a new web node online name, hesiode.koumbit.net.

jeudi 11 septembre 2008

MySQL server outage

The main mysql server (demeter) has suffered a major outage. Apologies for the unusual outage.

It lasted between 13:23 à 14:08. All web and mail services were affected, but no mail should have been lost. The problem was related with another server crash. Koumbit members can read the full report.

jeudi 28 août 2008

New DNS server: ns3.koumbit.net

What is happening

We are adding a new server to our list of DNS servers. The new server is already functional for all the shared hosting domains.

The new address of the server is: 209.172.53.230

Who is affected

All the users managing their domains themselves (as technical contact) have to add NS3.KOUMBIT.NET to their DNS configuration. This will ensure that you will not suffer any outage when we switch NS2.KOUMBIT.NET providers.

All domains managed by Koumbit have been properly modified today. If we are the technical contact for your domain, you do not have any action to take today.

You can verify the contacts for your domains through this web page:

http://www.gandi.net/whois

When

The changes have already started. The server has been in production since today. The "Glue Records" have been updated today, as all the domains for which we are the technical contact.

Why

The secondary DNS server NS2.KOUMBIT.NET is hosted on a network link with less than desirable latency, which degrades our quality of service. We therefore want to migrate this server to another provider, but this move may create an outage. We are therefore creating a new DNS server that will provide us with another redundancy layer.

I object

If this intervention is too problematic for you or your organisation, please let us know within 24h to see if we can arrange otherwise.

mercredi 27 août 2008

new webserver in the cluster

I have just added a new web server to the load balancing setup. It is currently configured to answer only when the main server goes down (as opposed to sharing the load with it). This should get rid of the "503 Service unavailable" messages that we were regularly seeing on the web server these days.

There may be issues with some sites related to that change. We have tested a few sites (a Drupal and a Tikiwiki) and things seem to be running fine, but if you see weird behaviour, please tell us the exact time at which it was encountered so we can diagnose the problem.

Note that this does not yet improve performance in the cluster, but merely improves reliability. We will shortly deploy a dedicated server that should improve performance as well.

switch replacement complete, new statistics URLs

The maintenance yesterday is now complete and the new switch is in place. You machine has very likely changed ports. You can see the new configuration on the MRTG page:

http://log.koumbit.net/mrtg/

Most of you should be at ports above 36.

The statistics from the old switch are still available here:

http://log.koumbit.net/mrtg.pre-sw3/

Sorry for the trouble.

mercredi 20 août 2008

Intervention on august 26th

Who is affected

All the machines, virtual servers or not, and services hosted in the main cabinet. This includes hosting and email services.

When

August 26th 2008, between 19h00 and 21h00 EDT (-0400). Outages described below will occur between 20h00 and 21h00, EDT. However, we hope to limit those outages to 30 minutes (so between 20h00 and 20h30).

What will happen

The main switch will be replaced. This will provoque short network outages for each of the hosted servers.

Why

The current switch is full and show signs of weaknesses. We prefer to replace it before a complete outage.

How

See the rapport d'intervention (fr). Note that this outage will begin with a general outage affecting all servers when the core router will be replugged. Then every machine will be replugged one by one, which should provoke a few minutes of outage for each machine. It is also possible that this procedure fails to function properly and that we go forward with a quick and dirty unplugging and replugging of everything.

I object!

If this intervention is too problematic for you or your organisation, please let us know within 24h to see if we can arrange otherwise.

Also not that the sysadmin blog will be updated if the intervention is changed in any way or if we experience problems or delays.

Starting to use categories to classify articles by language

We starting to use Dotclear's categories to isolate the content by language in this blog. The main page will contain content from both languages. To see the content in your language, use:

This also applies to RSS feeds:

Posts will all be translated from now on or there will at least be a pointer from the missing translation.