STDS (Sudden Tbred Death Syndrome)

A few days ago there was a question posted in the MHW forums looking for information about STDS (sudden tbred death syndrome) similar to SNDS (sudden northwood death syndrome) that sometimes occurs when overclocking and overvolting. The answer given was so detailed and informative I asked the author if I could reprint it here.

Question: "ELiTE KiLLaH"

I've read a few reports of STDS (sudden tbred death syndrome) similar to SNDS (sudden northwood death syndrome) but have yet to find a definitive source.

Answer: "Talon"

What you are seeing in both cases involving Thoroughbred and Northwood processors is simply the realities of today’s scaled CMOS technology. Rigors of semiconductor device physics aside suffice to say that CPUs operate based on the control of current by an electric field. The strength of this electric field is directly proportional to the thickness of the dielectric material that is present between the gate electrode and the channel region of the transistor through which current flow is induced or prevented. As we shrink the size of our transistors smaller and smaller it remains necessary to maintain proper control over the channel. This is achieved by reducing the thickness of the dielectric layer usually by a factor approaching that which we scale the transistor length.

The de facto dielectric material used in all of today’s Silicon based semiconductor devices is a highly uniform layer of silicon dioxide. In modern 130nm CMOS processes the thickness of this film is less than 20 angstroms (A). An angstrom is 1X10^-10 meters, or another way of thinking about it is that the entire oxide thickness is equivalent to about 5-6 atomic monolayers. Needless to say the difficulty in manufacturing a film so thin with less than +/- .5A of variability across a 12" wafer is extraordinary. But that's another story in itself.

Now the fundamental property of a dielectric is that an individual atom/molecule of the material is able to form a dipole that opposes an applied electric field. There are a number of different ways different atoms and molecules accomplish this but I’ll leave that to an electromagnetic fields course. When you have a monolayer of these atoms/molecules each with their dipole moments opposing the applied electric field the material is said to exhibit a level of permittivity. The specific permittivity of a material can be expressed as a factor k. You'll often hear reference to low and high k materials for use in semiconductors.

There are several mechanisms by which you can damage this critical layer, nearly all of which are direct results of increased voltage and exacerbated by the increased temperature which accompanies it. If the applied electric field becomes great enough the resistance of the dipole action can actually become overcome resulting in permanent damage to the atomic bonds. This value of field strength is known as the dielectric breakdown and for perfect oxide is taken to be about 10MV/cm. This sounds like a lot, however if we do the math and convert this value for a 20A oxide we would get a breakdown voltage of 2 volts. Generally the BIOS will not allow one to raise the voltage high enough to exceed this value but some people have been known to use various hardware modifications to get around BIOS limitations. This limitation is why as processors are produced using smaller transistors and thus thinner oxides the voltage at which they operate is reduced. Of course there are non-idealities that confound this number in real life but theoretically the limit of applied gate voltage should be around this number.

In addition to catastrophic failure there is also a factor known as time dependent dielectric breakdown. Even when exposed to field strengths less than the dielectric breakdown strength the oxide can still suffer fail given sufficient time. As you increase the voltage of your processor you are generally increasing both the voltage applied to the gate as well as the drain/source bias. This increases both the lateral and vertical electric fields in the channel. In the NMOS portion of your CMOS transistors electrons are the charge carriers which travel through the channel. Exposed to these increased fields the electron accelerates to a high velocity due to the lateral field. If the vertical field is great enough it can sweep the electron up into the oxide where it injects itself. This is known as carrier hot electron (CHE) injection. There are two proposed methods by which these injected electrons contribute to the breakdown of the oxide. One is that the electrons gain sufficient kinetic energy to generate electron-hole pairs within the oxide. Some of these holes become trapped within the oxide and enable conduction through the ordinarily insulating material. The other proposed method is that the energetic electrons actually cause the silicon-oxygen bonds to break and form defects which attract trapped charges which can eventually give rise to a conductive path through the oxide. This process is accelerated at higher temperatures.

Additionally at a high enough gate voltage the device will be in strong enough inversion such that the conduction band at the substrate/oxide interface forms a triangular quantum well. Free electrons residing in the conduction band basically bounce back and forth within this well until they are able to tunnel directly through the physical oxide. This is known as Fowler-Nordheim Tunneling and is actually the premise by which Flash memory works. The transit of electrons through the oxide however is detrimental and leads to eventual failure. This method is known as cold carrier injection.

Both CHE and FN Tunneling lead to oxide reliability problems. Your processor is designed to operate reliably at specified voltages and temperatures for a period of approximately 10 years. Any time you change these values you are directly impacting the long term reliability of these parts. People are operating Northwoods at 2V which is higher than P4s were even specified to be run at when they were being built on the old .18um process. There was a reason why when Intel moved to .13um they dropped the core voltage down to 1.5V.

Now another issue that is impacting the reliability of devices at 130nm and below is that of electromigration. The move to copper interconnects greatly reduced the effects of bulk electromigration seen with aluminum. However copper can suffer from surface electromigration effects. Surface electromigration refers to the directed motion of atoms at the surface of the copper "wires" that connect the transistors and gates. It is caused by an electric current in the bulk of the material. As the line width of metallic interconnects becomes comparable to or smaller than the grain size of the film voids can form. These voids will migrate in the direction of the current and eventually collapse into a slit near a 90 degree turn which disconnects the conductor. As you increase the voltage to your processor and overclock it you will increase the currents passed through the copper interconnects. These interconnects are intentionally designed as small as possible in order to pack the miles and miles of wiring it takes to connect the 40+ million transistors into the small die size of today’s modern CPU. Because line widths are narrower than ever before they are more susceptible to surface electromigration.

So the short answer to your question would be yes, voltage and temperature are responsible. These issues will only become more pronounced as integrated circuits are built using 90nm technology as the tolerances once present in older processes are now unattainable.


This page comes from

The URL for this page is: