Google OCS Apollo: A $3 Billion Revolution in Data Center Networking

       Networking is a critical component of any data center, especially given the rise of large language models that require large amounts of network resources. So networking is an obvious target for Google’s infrastructure optimization. Last year, at conferences like OFC and SIGCOMM, Google unveiled its own Jupiter networking stack, which covers everything from internal switches to customizable, reconfigurable software.
       This specialized network stack has saved Google at least $3 billion compared to industry standards used by competitors like Amazon and Microsoft. In addition to lower costs, Google has also seen improved network performance and reduced latency! This specialized network stack was first deployed five years ago and is now used in most of Google’s data centers. Google’s custom network is a key component in training its modern large-scale language models, including PaLM.
       Before we dive into how their proprietary network stack works, let’s briefly discuss its purpose and impact on the industry. First, Google claims that their proprietary network can increase throughput by 30%, reduce power consumption by 40%, reduce capital expenditures by 30%, reduce traffic termination rates by 10%, and reduce overall network downtime by 50x.
       Most importantly, they allow for a phased upgrade of the data center network. Google’s dedicated switches also eliminate the need to purchase Broadcom network switches at the backbone level.
       Traditional networks use a “Clos” topology, also known as a “spine-leaf” architecture, to connect all the servers and their racks in a data center. This “spine-leaf” architecture consists of a spine, wings, and compute nodes. A compute node is a server rack filled with central processing units (CPUs), graphics processing units (GPUs), FPGAs, storage systems, and/or application-specific integrated circuits (ASICs). The compute node is then connected to wing or top switches, which in turn are connected to the spine through various aggregation layers.
       Traditionally, so-called Electronic Packet Switches (EPS) are used as the backbone network. These are common network switches, with major vendors including Broadcom, Cisco, Marvell, and Nvidia. However, these EPSs consume a significant amount of energy. Moreover, network speeds double every two to three years. Although this doubling reduces energy consumption, it also requires upgrading existing backbone EPSs. Consequently, each new generation of Broadcom Tomahawk switches requires significant capital expenditures.
       Google has launched a project called Gemini to eliminate the backbone layer from its data center network, thereby reducing the power consumption and capital costs associated with that switching layer. Moreover, the project is not limited to the backbone layer. It will be improved and possibly applied to lower layers of the network.
       The Apollo program aimed to replace the traditional Clos architecture, which used EPS with optical switches (OCS). The first generation of optical switches was called Palomar. These OCSs replaced the old Clos backbone architecture. Instead of repeatedly converting signals from electrical to optical and back at the backbone level, OCSs used an all-optical interconnect, using mirrors to redirect incoming light beams, which were then encoded with data and transmitted from the source port to the destination port.
       To use an analogy, OCSs are like railroad switches. They may have multiple tracks, but a train can only move along one specific track/route at a time. To change the train’s route, you need to manually change the direction of the track.
       Using Google’s network as an example, if one part of the data center connected via port 7 wants to talk to another part of the data center connected via port 4, but it is configured as port 11, the fabric switch must reconfigure these mirrors so that port 7 can talk to port 4. Note that with traditional EPS there is no need to manually reconfigure anything since all ports are always connected through the electrical switch.
       Google uses these optical switches in a direct-connect architecture, connecting branches directly through patch panels. This is not packet switching, but essentially an optical cross-connect.
       Using the train analogy, it is a large railway station with several incoming and outgoing tracks. Any incoming train can transfer to any outgoing train, but at the station the trains will have to be rebuilt.
       Note that in a typical network architecture, each packet has a header associated with it. This header is decoded and determines where the electrical switch sends the packet. In Google’s network control system (OCS), there is no decoding of packets. Packets follow a predetermined path before they even reach the OCS. If you want to change the port through which communication occurs, you need to describe the flow of packets and their direction. You need to know where the train is going before it even arrives at the station.
       Overall, OCS is a “set it and forget it” solution, as it takes seconds to move an image to reconfigure packet routing through OCS. Compared to traditional EPS, this is incredibly slow. The train needs to be routed before it arrives.
       Google’s OCS is not a direct replacement for EPS, and network design must explicitly take OCS into account to account for image remapping time.
       While the lack of flexibility and compatibility are the main drawbacks of OCS, it also has many advantages. Google highlights three main advantages: wavelength-independent data rate, low latency, and significant energy savings.
       Data rate and wavelength independence are important for two main reasons. First, it allows compatibility with any switch and any optical technology. This means that if you need to connect a switch with 100G transceivers to a switch with 800G transceivers, OCS will be able to establish the connection without any problems, since it simply forwards optical signals, rather than forwarding packets.
       Once OCS is set up, you can upgrade your switches and fiber network to faster generations without replacing the network backbone. OCS has a longer lifespan than traditional EPS.
       Traditional EPS systems connect optical fibers to a switch. These fibers are converted into electrical signals using photodetectors and TiA, then transmitted and clocked by a digital signal processor (DSP). They are then transmitted through a printed circuit board to a standard switch chip, where the packets are decoded and routed. The packets are then recoded and traversed again. Each step introduces additional latency.
       OCS’s low latency is achieved because OCS does not need to decode packets; all it needs to do is reflect incoming light from the source port to the destination port.
       This brings us to the third and most important advantage: energy consumption. Each stage of a traditional power plant’s operation consumes a certain amount of energy.
       OCS had four major drawbacks, all of which Google claims to have addressed: high initial costs, high insertion loss, long setup times, and lack of support.
       The high initial cost is something Google can afford in the long run. Because OCS can handle any bandwidth, these OCSs did not need to be replaced when Google migrated to Leaf switches with 1.6 and 3.2 Tb transceivers, offsetting the initial cost. Google estimates that because OCSs can be reused across multiple upgrade cycles, the total CAPEX is approximately 70% of the standard EPS.
       If Google expects the OCS to last three upgrade cycles, the initial capital cost of the OCS would be about 3.5 times EPS (and the average selling price would increase as the EPS is upgraded). If Google expects these OCS to last four generations, the difference in initial capital cost would be almost 6 times!
       Insertion loss is another major drawback of OCS technology. Insertion loss is the loss of signal power when switching an optical signal between transmission media, such as from a laser to a silicon photonic chip or from an optical fiber to a photodetector. It is usually measured in decibels (dB) and characterizes the degree of signal reduction. The greater the insertion loss, the greater the loss of signal power. For example, if a device introduces an insertion loss of 3 dB, the output signal power will be half the input power.
       The higher the insertion loss, the weaker the signal, which can lead to unreliable data transmission. Insertion loss is measured in decibels, and the lower the decibels, the better. While the standard insertion loss for fiber is around 6 dB, Google has reduced it to a worst-case scenario.
       Another major problem was reconfiguration time. Reconfiguring a mirror to a different route took a matter of seconds. Google solved this problem by analyzing its network traffic in detail.
       They argue that creating a network for worst-case traffic is overkill, and that by planning for network traffic, long mirror reconfiguration times can be avoided.
       Google solved the problem of the lack of direct support by redesigning its network to support OCS. The company put “a decade of design and manufacturing experience” into Jupiter, and the Apollo project represents a significant step forward in reducing the overall cost of these large-scale network systems. It’s a secret sauce that Google is reluctant to reveal publicly, but after describing the hardware, we can share some details.
       Initially, the Apollo program adopted a vendor solution for OCS, and Huawei has also deployed solutions from the same vendor for various use cases in its network.
       Due to the difficulties in maintaining the reliability and quality of this solution at scale, a decision was made to develop the OCS system in-house.
       The heart of the Palomar OCS is the Palomar MEMS mirror package, which contains 176 individually controlled micromirrors. However, since 40 micromirrors were turned off for cost reasons, only 136 of them were actually used.
       MEMS (microelectromechanical systems) are miniature mechanical and electromechanical devices with embedded electronics manufactured using micromachining technologies.
       The principle of operation of these mirrors is that the incoming signal light with a wavelength of 1310 nm (O-band) is combined with another light beam with a wavelength of 850 nm through a dichroic beam splitter. The dichroic beam splitter transmits light with a wavelength of 850 nm, while reflecting light with a wavelength of 1310 nm. The dichroic beam splitter is an inclined mirror with a coating that transmits light of certain wavelengths and reflects others.
       In the case of Palomar OCS, the wavelength of light that needs to be transmitted is 850 nm, and the wavelength of light that needs to be reflected is 1310 nm. Since it is impossible to reflect or transmit 100% of the light, some light is lost in the separation process, but more than 90% of the light can be transmitted to the desired location.
       From there, the combined light is reflected off the MEMS array and into a second dichroic beam splitter, which returns the 1310 nm light back into the MEMS array while simultaneously passing the 850 nm light into a camera that monitors the alignment of the MEMS array. Alignment of the MEMS array is critical because even a small deviation from alignment can result in interrupted data transmission.
       To maintain alignment of the MEMS array, two 850 nm beams are used. So when the 1310 nm beam reflects off the MEMS array a second time, it combines with the 850 nm beam. When the combined beam reaches the final dichroic beam splitter, it separates the 1310 nm beam and sends it to the output port.
       To reduce the number of OCS ports and fiber optic cables by half, thereby simplifying the system, Palomar used an optical circulator to implement bidirectional communication. An optical circulator is a three-port device in which the input signal from port 1 is routed to port 2, and the input signal from port 2 is routed to port 3. This turns a standard duplex transceiver into a bidirectional transceiver.
       This increases return loss and crosstalk. Return loss is the loss of signal at the end of a fiber optic cable. When light passes from an optical fiber into another medium (such as air), the change in refractive index causes the signal to attenuate. High optical return loss can prevent laser light from being transmitted correctly.
       Crosstalk is interference between two channels that results in signal degradation and increased noise. So Google abandoned its previous use of erbium-doped fiber amplifiers, which were limited to wavelengths between 1530 and 1565 nm (C-band), and instead used optical coatings and upgraded the optics to operate in the 1310 nm (O-band) wavelength range. This upgrade also reduced the return loss and crosstalk of the system.
       Google used wavelength division multiplexing (WDM) optical transceivers for Apollo. WDM is a technology that takes multiple optical signals and transmits them at different wavelengths over a single optical fiber. The first generation of Apollo was based on the 40 Gbps standard. This standard (CWDM4 MSA) was adopted across the industry, leading to standardization and mass production of optical components. The only unique feature of this solution is the MEMS-based switch.
       As part of Project Apollo, Google developed a non-blocking 136×136 optical switch that is backwards and forwards compatible with any bandwidth and wavelength currently in use or planned for use in Google data centers. According to Google, this switch consumes only 108 watts. By comparison, a standard 136-port EPS switch consumes about 3,000 watts.
       So, despite some of the shortcomings of OCS, the advantages of the solution created by Google outweigh these shortcomings. Over the past five years, “tens of thousands of OCS with 136×136 ports (with eight spare ports) have been produced and deployed.” Google has created a system that works well.
       In the future, Google is considering OCS with a higher port count to achieve greater horizontal scalability and higher switching speeds, which will facilitate wider adoption of OCS at the network layer. Wider adoption could have a significant negative impact on Broadcom, the leader in hyperscale network switches. In addition, Google said it will continue to improve reliability and reduce insertion and return losses.
       Google is also researching piezoelectric switching technology to replace existing MEMS-based systems, as piezoelectric systems have significant advantages over MEMS systems in terms of insertion and return loss. Switching speeds can also be faster. Google also shared its research on MEMS, robotics, piezoelectric, waveguide, and waveguide switches. The research findings cover relative cost, port count, switching time, insertion loss, control voltage, and latch-up, which we will discuss below.
       OCS is also used in all Google TPUv4 and TPUv5 systems and is a key component of their superior performance/TCO.
       Below we present a number of documents, images, and articles related to OCS. They include traffic data and Google’s implementation of OCS. They also include images of switches and examples of how Google analyzes traffic and implements OCS. They also demonstrate the benefits of network performance, not just cost (which was the main topic above).
       Model access not included – please contact sales@semianalysis.com to inquire about our institutional offerings.
       By subscribing to SemiAnalysis, you will have full access to all articles, Data Explorer charts, article discussions, and additional information gained from in-depth research.
       By subscribing to SemiAnalysis, you will have full access to all articles, article discussions, and additional information gained from in-depth research.

PIEZO SWITCH



Post time: Aug-21-2025