The CSS Handbook: a handy guide to CSS for developers

The CSS Handbook: a handy guide to CSS for developers

CSS, a shorthand for Cascading Style Sheets, is one of the main building blocks of the Web. Its history goes back to the 90’s, and along with HTML it has changed a lot since its humble beginnings.

As I’ve been creating websites since before CSS existed, I have seen its evolution.

CSS is an amazing tool, and in the last few years it has grown a lot, introducing many fantastic features like CSS Grid, Flexbox and CSS Custom Properties.

This handbook is aimed at a vast audience.

First, the beginner. I explain CSS from zero in a succinct but comprehensive way, so you can use this book to learn CSS from the basics.

Then, the professional. CSS is often considered like a secondary thing to learn, especially by JavaScript developers. They know CSS is not a real programming language, they are programmers and therefore they should not bother learning CSS the right way. I wrote this book for you, too.

Next, the person that knows CSS from a few years but hasn’t had the opportunity to learn the new things in it. We’ll talk extensively about the new features of CSS, the ones that are going to build the web of the next decade.

CSS has improved a lot in the past few years and it’s evolving fast.

Even if you don’t write CSS for a living, knowing how CSS works can help save you some headaches when you need to understand it from time to time, for example while tweaking a web page.

Thank you for getting this ebook. My goal with it is to give you a quick yet comprehensive overview of CSS.


You can reach me via email at, on Twitter @flaviocopes.

My website is

Table of Contents

SQL Fiddle

SQL Fiddle

SQL Fiddle
Table of Contents
Source Code
A fiddle-head fern
About SQL Fiddle
A tool for easy online testing and sharing of database problems and their solutions.
Who should I contact for help/feedback?
There are two ways you can get in contact:

Email :
Twitter: @sqlfiddle
What am I supposed to do here?
If you do not know SQL or basic database concepts, this site is not going to be very useful to you. However, if you are a database developer, there are a few different use-cases of SQL Fiddle intended for you:

You want help with a tricky query, and you’d like to post a question to a Q/A site like StackOverflow. Build a representative database (schema and data) and post a link to it in your question. Unique URLs for each database (and each query) will be generated as you use the site; just copy and paste the URL that you want to share, and it will be available for anyone who wants to take a look. They will then be able to use your DDL and your SQL as a starting point for answering your question. When they have something they’d like to share with you, they can then send you a link back to their query.

You want to compare and contrast SQL statements in different database back-ends. SQL Fiddle easily lets you switch which database provider (MySQL, PostgreSQL, MS SQL Server, Oracle, and SQLite) your queries run against. This will allow you to quickly evaluate query porting efforts, or language options available in each environment.

You do not have a particular database platform readily available, but you would like to see what a given query would look like in that environment. Using SQL Fiddle, you don’t need to bother spinning up a whole installation for your evaluation; just create your database and queries here!

How does it work?
The Schema DDL that is provided is used to generate a private database on the fly. If anything is changed in your DDL (even a single space!), then you will be prompted to generate a new schema and will be operating in a new database.

All SQL queries are run within a transaction that gets immediately rolled-back after the SQL executes. This is so that the underlying database structure does not change from query to query, which makes it possible to share anonymously online with any number of users (each of whom may be writing queries in the same shared database, potentially modifying the structure and thus — if not for the rollback — each other’s results).

As you create schemas and write queries, unique URLs that refer to your particular schema and query will be visible in your address bar. You can share these with anyone, and they will be able to see what you’ve done so far. You will also be able to use your normal browser functions like ‘back’, ‘forward’, and ‘reload’, and you will see the various stages of your work, as you would expect.

What differences are there between the various database options?
Aside from the differences inherent in the various databases, there are a few things worth pointing out about their implementation on SQL Fiddle.

MySQL only supports queries which read from the schema (selects, basically). This is necessary due to some limitations in MySQL that make it impossible for me to ensure a consistent schema while various people are fiddling with it. The other database options allow the full range of queries that the back-end supports.

SQLite runs in the browser; see below for more details.

What’s up with that [ ; ] button under each panel?
This obscure little button determines how the queries in each of the panels get broken up before they are sent off to the database. This button pops open a dropdown that lists different “query terminators.” Query terminators are used as a flag to indicate (when present at the end of a line) that the current statement has ended. The terminator does not get sent to the database; instead, it merely idicates how I should parse the text before I execute the query.

Oftentimes, you won’t need to touch this button; the main value this feature will have is in defining stored procedures. This is because it is often the case that within a stored procedure’s body definition, you might want to end a line with a semicolon (this is often the case). Since my default query terminator is also a semicolon, there is no obvious way for me to see that your stored procedure’s semicolon isn’t actually the end of the query. Left with the semicolon terminator, I would break up your procedure definition into incorrect parts, and errors would certainly result. Changing the query terminator to something other than a semicolon avoids this problem.

Why are there two strange-looking options for SQLite?
SQLite is something of a special case amongst the various database types I support. I could have implemented it the same way as the others, with a backend host doing the query execution, but what fun is that? SQLite’s “lite” nature allowed for some interesting alternatives.

First, I found the very neat project SQL.js, which is an implementation of the engine translated into javascript. This means that instead of using my servers (and my limited memory), I could offload the work onto your browser! Great for me, but unfortunately SQL.js does have a few drawbacks. One is that it taxes the browser a bit when it is first loaded into memory. The other is that it doesn’t work in all browsers (so far I’ve seen it fail in IE9 and mobile Safari).

The other option is “WebSQL.” This option makes use of the SQLite implementation that a few browsers come with built-in (I’ve seen it work in Chrome and Safari; supposedly Opera supports this too). This feature was considered part of the W3C working draft for HTML5, but they depricated it in favor of IndexedDB. Despite this, a few browsers (particularly mobile browsers) still have it available, so I figured that this would be a useful feature to grab onto. The advantage over SQL.js is that it is quite a bit faster to load the schema and run the queries. The disadvantage is that it isn’t widely supported, and likely not long for this world.

Together, these two options allow SQLite to run within any decent browser *cough*IE*cough*. If someone links you to a SQLite fiddle that your browser doesn’t support, just switch over to the other option and build it using that one. If neither works, then get a better browser.

Who built this site, and why?
profile for Jake Feasel at Stack Overflow, Q&A for professional and enthusiast programmers
Jake Feasel
Jake Feasel
Senior Software Engineer at ForgeRock / Developer of…
University of Alaska AnchorageUniversity of Alaska Anchorage
View profileLinkedIn was built by Jake Feasel, a web developer originally from Anchorage, Alaska and now living in Vancouver, WA. He started developing the site around the middle of January, 2012.

He had been having fun answering questions on StackOverflow, particularly related to a few main categories: ColdFusion, jQuery, and SQL.

He found JS Fiddle to be a great tool for answering javascript / jQuery questions, but he also found that there was nothing available that offered similar functionality for the SQL questions. So, that was his inspiration to build this site. Basically, he built this site as a tool for developers like me to be more effective in assisting other developers.

How is the site paid for?
ZZZ Projects started to pay for the hosting in 2017 before taking the ownership in 2018. We have some great plan to make SQL Fiddle even more friendly user, and we welcome any contribution.

Source Code
If you are interested in the fine details of the code behind SQL Fiddle and exactly how it is deployed, it is all available on github

What platform is it running on?
This site uses many different technologies. The primary ones used to provide the core service, in order from client to server are these:

Title Description
RequireJS js JavaScript module loader and code optimizer.
CodeMirror js For browser-based SQL editing with text highlighting.
Bootstrap css Twitter’s CSS framework (v2).
LESS css CSS pre-processor.
Backbone.js js MV* JavaScript framework.
Handlebars.js js JavaScript templating engine.
Lodash.js js Functional programming library for Javascript.
Date.format.js js Date formatting JavaScript library.
jQuery js AJAX, plus misc JS goodness. (Also jq plugins Block UI and Cookie).
html-query-plan js XSLT for building rich query plans for SQL Server.
Varnish backend Content-caching reverse proxy.
Vert.x backend Open Source Java-based Application Server.
PostgreSQL db Among others, of course, but PG is the central database host for this platform.
Grunt devops Javascript task runner, config and frontend build automation.
Maven devops Dependency management, backend build automation.
Docker devops VM management.
Amazon AWS hosting Cloud Hosting Provider.
GitHub devops hosting Git repository, collaboration environment.
This list doesn’t include the stacks used to run the database engines. Those are pretty standard installs of the various products. For example, I’m running a Windows 2008 VPS running SQL Server 2014 and Oracle, and various Docker images running the others. The ASRock DeskMini A300 Review: An Affordable DIY AMD Ryzen mini-PC The ASRock DeskMini A300 Review: An Affordable DIY AMD Ryzen mini-PC

Home> Systems
The ASRock DeskMini A300 Review: An Affordable DIY AMD Ryzen mini-PC
by Ganesh T S on April 26, 2019 8:00 AM EST
17 Comments | Add A Comment

Small form-factor (SFF) machines have emerged as a major growth segment in the desktop PC market. Performance per watt is an important metric for such systems, and Intel has pretty much been the only game in town for such computers, given that AMD platforms prior to the launch of Ryzen could barely compete on that metric. The NUC (UCFF) and mini-STX (5×5) were introduced by Intel as the standard motherboard sizes for the SFF market, and we have previously seen AMD-based NUC-like platforms (namely, the Zotac ZBOX CA320 nano back in 2014, and the Compulab fitlet-XA10-LAN in 2016).

Not to be left out entirely, however, AMD’s vendors are finally starting to dip their toes back in to the mini-PC market with Ryzen-based systems. Earlier this year, ASRock became the first vendor to announce an AMD-based mini-STX system – the DeskMini A300. So for today’s review we’re delving deep into the performance and features of the DeskMini A300, and seeing how it stacks up against other contemporary SFF PCs.

Introduction and Platform Analysis
ASRock’s DeskMini series is a family of barebones systems in the mini-STX (140mm x 147mm motherboard / 1.92L chassis) and micro-STX (188mm x 147mm motherboard / 2.7L chassis) form-factors. Here, ‘barebones’ differs slightly from the NUC terminology. While the NUCs just require the user to plug in RAM and storage, the mini-STX and micro-STX boards are socketed. This gives users a choice of CPU to install, making it similar in more respects to a typical DIY build.

The DeskMini A300 that we are looking at today is a mini-STX machine capable of supporting AMD AM4 processors with integrated graphics. The board uses the AMD A300 chipset, and supports both Ryzen-based Raven Ridge APUs and the older Bulldozer-based Bristol Ridge APUs with a TDP of up to 65W.

There are multiple versions of the DeskMini A300 available, depending on the optional components that are bundled. The product page mentions the DeskMini A300 and the A300W, with the latter’s accessory pack including an Intel AC-3168 Wi-Fi kit. On the Overview page, however, a number of optional components are mentioned – an AMD APU cooler (for up to 65W, with a dimensions of 77mm x 68mm x 39mm and speeds between 1950 and 3500RPM), a VESA mount kit, a M.2 Wi-Fi kit, and a USB 2.0 cable to put the dual USB-port slots on the top / side of the chassis to use.

It must be noted that the chassis design only allows for coolers up to 46mm in height – this means that the Wraith coolers (Stealth @ 54mm, Spire @ 71mm, and the Max @ 85m) are all unsupported. Users might be better off the optional cooler that ASRock advertises for use with the DeskMini A300.

Overall, our barebones review sample came with the optional cooler in the package. ASRock also provided us with an AMD Ryzen 5 2400G APU to install in the system. We completed the build with a 500GB Western Digital WD Blue SN500 NVMe SSD and a 16GB G.Skill Ripjaws DDR4-3000 SODIMM kit.

The specifications of our DeskMini A300 review configuration are summarized in the table below.

ASRock DeskMini A300 Specifications
Processor Ryzen 5 2400G
AMD Zen, 4C/8T, 3.6 (3.9) GHz
2MB+4MB L2+L3, 65 W TDP
Memory G.Skill Ripjaws F4-3000C16D-16GRS DDR4 SODIMM
16-18-18-43 @ 3000 MHz
2×8 GB
Graphics Radeon RX Vega 11 Graphics
Disk Drive(s) Western Digital WD SN500
(500 GB; M.2 2280 PCIe 3.0 x2 NVMe SSD; SanDisk 64L 3D TLC)
Networking Realtek RTL8168 (MAC) / RTL8111 (PHY) Gigabit Ethernet controller
Audio 3.5mm Headphone / Microphone Jack
Capable of 5.1/7.1 digital output with HD audio bitstreaming (HDMI)
Miscellaneous I/O Ports 1x USB 2.0
2x USB 3.0 Type-A, 1x USB 3.1 Gen 1 Type-C
Operating System Retail unit is barebones, but we installed Windows 10 Enterprise x64
Pricing $150 (barebones)
$465 (as configured, no OS)
Full Specifications ASRock DeskMini A300 Barebones Specifications
Thanks to Western Digital and G.Skill for the build components.
Similar to the other DeskMini systems, the A300 is equipped with two DDR4 SO-DIMM slots (supporting DDR4-2400 with Bristol Ridge APUs, and DDR4-2933 with Raven Ridge). There are two M.2 2280 slots on board (one on the same side as the CPU socket, and another on the underside). This is in contrast to the Intel-based DeskMini 310 board which comes with just a single M.2 slot. The two M.2 slots are PCIe 3.0 x4. However, if the Athlon 2xxGE series APUs are used, the second slot operates in PCIe 3.0 x2 mode.

Other features are similar to the DeskMini 310 – two SATA ports and space in the chassis for the installation of two 2.5″ SATA drives, a Realtek ALC233 audio codec chip to support a headphone / microphone audio-jack, two USB 3.0 Type-A ports, one USB 3.1 Gen 1 Type-C port, and a single USB 2.0 Type-A port. The 120W (19V @ 6.32A) power adapter is external. The LAN port is backed by a Realtek RTL8168/8111H controller compared to the Intel I219V in the DeskMini 310.

The package includes the drivers on a CD (a USB key, even read-only, is much more preferable), a quick installation guide, screws to install the storage drives, rubber feet to raise the chassis when it is placed vertically, a couple of SATA cables, and a geo-specific power cord.

In addition to the extra M.2 2280 NVMe SSD slot, the DeskMini A300 scores over the DeskMini 310 by sporting a native HDMI 2.0a display output. Note that HDMI display output support on Intel processors is restricted to HDMI 1.4a. Vendors wanting to implement a HDMI 2.0a port in their system have been forced to place a LSPCon on board to convert on of the Display Port 1.2 outputs from the processor to HDMI 2.0a, which results in increased board costs. Since the target market for the DeskMini 310 could make do with a single 4Kp60 output using the DisplayPort port, ASRock didn’t bother to place a LSPCon on that board. The DeskMini A300 supports simultaneous dual 4Kp60 displays using the DisplayPort and HDMI ports in the rear. Triple display output is also supported, but the D-Sub port can support only a 2048 x 1536 resolution at the maximum.

Gallery: ASRock DeskMini A300

The gallery above takes us around the chassis design and the board features. Without the Wi-Fi antenna pigtails to worry about, it was a breeze to draw out the board from the chassis and install the components.

The DeskMini A300 comes with an AMD A300 Promontory chipset. It is the most basic offering from AMD in the AM4 lineup. Overclocking is not supported. There are no USB 3.1 Gen 2 ports, and StoreMI (storage acceleration using a combination of PCIe and SATA drives) is also not supported. From the AIDA64 system report, we see that the second M.2 2280 port (on the underside of the board) is enabled by the x2 / x4 NVMe link from the processor. The remaining 12 free PCIe lanes from the Ryzen 5 2400G are configured as two x4 links for the M.2 slots on the top side (Wi-Fi and storage). The remaining x4 link is used in a x1 configuration for the Realtek LAN controller. All the rest of the I/O ports (USB and SATA) are direct passthrough from the SoC portion of the Ryzen 5 2400G.

Moving on to the BIOS features, the use of the A300 chipset rules out any overclocking of the Ryzen processor itself. Upon boot up, our configuration came up with the G.Skill SODIMMs in DDR4-2400 mode. The BIOS allowed us to load the available XMP profile (DDR4-3000), and a simple saving of the change followed by a power cycle resulted in the DRAM configured for 3000 MHz operation.

Gallery: ASRock DeskMini A300 BIOS Features

Our review sample shipped with the BIOS v1.2. Prior to benchmarking, we upgraded to the recommended version, 3.40. Screenshots from both BIOS versions can be seen in the gallery above.

In the table below, we have an overview of the various systems that we are comparing the ASRock DeskMini A300 against. Note that they may not belong to the same market segment. The relevant configuration details of the machines are provided so that readers have an understanding of why some benchmark numbers are skewed for or against the ASRock DeskMini A300 when we come to those sections.

Comparative PC Configurations
Aspect ASRock DeskMini A300
CPU AMD Ryzen 5 2400G AMD Ryzen 5 2400G
GPU AMD Radeon RX Vega 11 Graphics AMD Radeon RX Vega 11 Graphics
RAM G.Skill Ripjaws F4-3000C16D-16GRS DDR4 SODIMM
16-18-18-43 @ 3000 MHz
2×8 GB G.Skill Ripjaws F4-3000C16D-16GRS DDR4 SODIMM
16-18-18-43 @ 3000 MHz
2×8 GB
Storage Western Digital WD Blue WDS500G1B0C
(500 GB; M.2 2280 PCIe 3.0 x2; SanDisk 64L 3D TLC) Western Digital WD Blue WDS500G1B0C
(500 GB; M.2 2280 PCIe 3.0 x2; SanDisk 64L 3D TLC)
Wi-Fi N/A N/A
Price (in USD, when built) $150 (barebones)
$465 (as configured, No OS) $150 (barebones)
$465 (as configured, No OS)
The rest of the review will deal with performance benchmarks – both artificial and real-world workloads, performance for home-theater PC duties, and an evaluation of the thermal design under stressful workloads.


View All Comments
Thvash – Friday, April 26, 2019 – link
For some reason 4K HDR, VP9 Profile 2 is not accelerated at all under Windows, while GPU claims to support it, no such issues under Linux
Smell This – Friday, April 26, 2019 – link
**Microsoft removed the in-built HEVC decoding capabilities of Windows 10 in the 2017 Fall Creators Update, and replaced it with an extension that had to be downloaded from the Microsoft Store. Without the extension, playback is restricted to 1080p non-HDR streams encoded in H.264. In addition to the decoding capabilities, the system also needs to support PlayReady 3.0 DRM.**

Another drive-by DRM borking by WIntel …
ganeshts – Friday, April 26, 2019 – link
Actually, it is OK with Kodi (XBMC) and Microsoft Edge / VideoUI app on Windows. It is only VLC and LAV Video Decoder having issues.
DigitalFreak – Friday, April 26, 2019 – link
” The hardware itself is actually rather capable (as noted above), but the the current state of the Radeon drivers holds it back.”

Same old story that’s been going on for a decade or more with ATI/AMD.
Irata – Friday, April 26, 2019 – link
Some more power consumption numbers: (A300 vs. A310)

Idle power: 81%
Max power consumption (stressing CPU+GPU): 131%.

But this gives us:

– Gaming performance: no numbers for the A310, however the A300 has an average gaming performance of 204 % vs. Bean Canyon (using the fps shown as default) at 126 of its power consumption, so again it is more power efficient.

Cinebench Muti-threaded rendering: 137% of he A310’s performance @ (using the max power consumption as a guideline) 131% of the power consumption.

Note: It would be nice to show the power consumption for all benchmarks, i.e. gaming, 7-zip, cinebench….
Irata – Friday, April 26, 2019 – link
I found this a bit odd:

“For traditional office and business workloads, it gets the job done; and while it’s not particularly energy efficient, the upfront cost itself is lower”.

Looking at the Bapco Sysmark overall power consumption numbers, the DeskMini A300 and 310 have basically identical numbers (32.26 vs. 31.62 Wh). Seeing the the performance delta is not considerable I find this statement a bit odd. And these are Bapco Sysmark numbers which need to be taken with a rock of salt.
davie887 – Friday, April 26, 2019 – link
Intel CAN’T be shown in anything other that their best light.

Anyone who questions them has to prepare for the consequences 😀
BigMamaInHouse – Friday, April 26, 2019 – link
Comparing 2400G with Real iGPU vs $431 i7-8559U – I’d say it performs Great!.
Irata – Friday, April 26, 2019 – link
To be more specific, on the “productivity benchmark”, the A300 has 89% of the A310’s score with 86% of its power consumption, so for office type tasks, it is actually a bit more efficient.
ganeshts – Friday, April 26, 2019 – link
Ah, the pitfalls of saying ProdA scores X% of ProdB in metric M at Y% in metric N, when M and N are not linearly correlated!

Extending it the same way, if I were to build the DeskMini 310 system with the same original review components at the current prices, I am going to splurge : 162 (DeskMini 310 board with Wi-Fi compared to DeskMini A300 without Wi-Fi) + 139 (Core i3-8100) + 76 (DDR4-2400 2x8GB SODIMM) + 78 (PCIe 3.0 x4 240GB NVMe SSD – Corsair Force MP510) = $455 ; Let me look up the table for the DeskMini A300 cost without Wi-Fi – tada, it is $465 – oh oh oh!!!! Does the lower upfront cost for the AMD system (as claimed in the article in the same BAPCo section) evaporate into thin air? No!

The reason is that when you are looking at SYSmark 2018 scores and SYSmark 2018 energy consumption numbers, you compare against systems that score approximately the same in those particular metrics.

For the overall SYSmark 2018 scores, the DeskMini A300 is approximately the same as the Baby Canyon NUC – then, let us look at the energy consumption numbers for those two – the Baby Canyon consumes lesser energy.

For the energy consumption numbers, the A300 and 310 are approximately in the same ball park – and there, you see the the 310 with a higher score.

As for accusations that ‘Intel CAN”T be shown in anything other than their best light’ – take a chill pill – the PCMark 8 numbers back up SYSmark 2018. And, in the gaming section, we show that AMD outperforms the best that Intel can offer. As an impartial reviewer, my aim is to present the facts as-is and provide my analysis – if you come with pre-conceived notions that one product / vendor is better than the other, then, no amount of facts will convince you otherwise.


Linux Networking

Linux Networking

leandromoreira / linux-network-performance-parameters
Code Issues 0 Pull requests 0 Projects 0 Pulse

Linux network queues overview
Fitting the sysctl variables into the Linux network flow
Ingress – they’re coming
Egress – they’re leaving
What, Why and How – network and sysctl parameters
Ring Buffer – rx,tx
Interrupt Coalescence (IC) – rx-usecs, tx-usecs, rx-frames, tx-frames (hardware IRQ)
Interrupt Coalescing (soft IRQ) and Ingress QDisc
Egress QDisc – txqueuelen and default_qdisc
TCP Read and Write Buffers/Queues
Honorable mentions – TCP FSM and congestion algorithm
Network tools
Sometimes people are looking for sysctl cargo cult values that bring high throughput and low latency with no trade-off and that works on every occasion. That’s not realistic, although we can say that the newer kernel versions are very well tuned by default. In fact, you might hurt performance if you mess with the defaults.

This brief tutorial shows where some of the most used and quoted sysctl/network parameters are located into the Linux network flow, it was heavily inspired by the illustrated guide to Linux networking stack and many of Marek Majkowski’s posts.

Feel free to send corrections and suggestions! 🙂
Linux network queues overview
linux network queues

Fitting the sysctl variables into the Linux network flow
Ingress – they’re coming
Packets arrive at the NIC
NIC will verify MAC (if not on promiscuous mode) and FCS and decide to drop or to continue
NIC will DMA packets at RAM, in a region previously prepared (mapped) by the driver
NIC will enqueue references to the packets at receive ring buffer queue rx until rx-usecs timeout or rx-frames
NIC will raise a hard IRQ
CPU will run the IRQ handler that runs the driver’s code
Driver will schedule a NAPI, clear the hard IRQ and return
Driver raise a soft IRQ (NET_RX_SOFTIRQ)
NAPI will poll data from the receive ring buffer until netdev_budget_usecs timeout or netdev_budget and dev_weight packets
Linux will also allocate memory to sk_buff
Linux fills the metadata: protocol, interface, setmacheader, removes ethernet
Linux will pass the skb to the kernel stack (netif_receive_skb)
It will set the network header, clone skb to taps (i.e. tcpdump) and pass it to tc ingress
Packets are handled to a qdisc sized netdev_max_backlog with its algorithm defined by default_qdisc
It calls ip_rcv and packets are handled to IP
It calls netfilter (PREROUTING)
It looks at the routing table, if forwarding or local
If it’s local it calls netfilter (LOCAL_IN)
It calls the L4 protocol (for instance tcp_v4_rcv)
It finds the right socket
It goes to the tcp finite state machine
Enqueue the packet to the receive buffer and sized as tcp_rmem rules
If tcp_moderate_rcvbuf is enabled kernel will auto-tune the receive buffer
Kernel will signalize that there is data available to apps (epoll or any polling system)
Application wakes up and reads the data
Egress – they’re leaving
Application sends message (sendmsg or other)
TCP send message allocates skb_buff
It enqueues skb to the socket write buffer of tcp_wmem size
Builds the TCP header (src and dst port, checksum)
Calls L3 handler (in this case ipv4 on tcp_write_xmit and tcp_transmit_skb)
L3 (ip_queue_xmit) does its work: build ip header and call netfilter (LOCAL_OUT)
Calls output route action
Calls netfilter (POST_ROUTING)
Fragment the packet (ip_output)
Calls L2 send function (dev_queue_xmit)
Feeds the output (QDisc) queue of txqueuelen length with its algorithm default_qdisc
The driver code enqueue the packets at the ring buffer tx
The driver will do a soft IRQ (NET_TX_SOFTIRQ) after tx-usecs timeout or tx-frames
Re-enable hard IRQ to NIC
Driver will map all the packets (to be sent) to some DMA’ed region
NIC fetches the packets (via DMA) from RAM to transmit
After the transmission NIC will raise a hard IRQ to signal its completion
The driver will handle this IRQ (turn it off)
And schedule (soft IRQ) the NAPI poll system
NAPI will handle the receive packets signaling and free the RAM
What, Why and How – network and sysctl parameters
Ring Buffer – rx,tx
What – the driver receive/send queue a single or multiple queues with a fixed size, usually implemented as FIFO, it is located at RAM
Why – buffer to smoothly accept bursts of connections without dropping them, you might need to increase these queues when you see drops or overrun, aka there are more packets coming than the kernel is able to consume them, the side effect might be increased latency.
Check command: ethtool -g ethX
Change command: ethtool -G ethX rx value tx value
How to monitor: ethtool -S ethX | grep -e “err” -e “drop” -e “over” -e “miss” -e “timeout” -e “reset” -e “restar” -e “collis” -e “over” | grep -v “\: 0″
Interrupt Coalescence (IC) – rx-usecs, tx-usecs, rx-frames, tx-frames (hardware IRQ)
What – number of microseconds/frames to wait before raising a hardIRQ, from the NIC perspective it’ll DMA data packets until this timeout/number of frames
Why – reduce CPUs usage, hard IRQ, might increase throughput at cost of latency.
Check command: ethtool -c ethX
Change command: ethtool -C ethX rx-usecs value tx-usecs value
How to monitor: cat /proc/interrupts
Interrupt Coalescing (soft IRQ) and Ingress QDisc
What – maximum number of microseconds in one NAPI polling cycle. Polling will exit when either netdev_budget_usecs have elapsed during the poll cycle or the number of packets processed reaches netdev_budget.
Why – instead of reacting to tons of softIRQ, the driver keeps polling data; keep an eye on dropped (# of packets that were dropped because netdev_max_backlog was exceeded) and squeezed (# of times ksoftirq ran out of netdev_budget or time slice with work remaining).
Check command: sysctl net.core.netdev_budget_usecs
Change command: sysctl -w net.core.netdev_budget_usecs value
How to monitor: cat /proc/net/softnet_stat; or a better tool
What – netdev_budget is the maximum number of packets taken from all interfaces in one polling cycle (NAPI poll). In one polling cycle interfaces which are registered to polling are probed in a round-robin manner. Also, a polling cycle may not exceed netdev_budget_usecs microseconds, even if netdev_budget has not been exhausted.
Check command: sysctl net.core.netdev_budget
Change command: sysctl -w net.core.netdev_budget value
How to monitor: cat /proc/net/softnet_stat; or a better tool
What – dev_weight is the maximum number of packets that kernel can handle on a NAPI interrupt, it’s a Per-CPU variable. For drivers that support LRO or GRO_HW, a hardware aggregated packet is counted as one packet in this.
Check command: sysctl net.core.dev_weight
Change command: sysctl -w net.core.dev_weight value
How to monitor: cat /proc/net/softnet_stat; or a better tool
What – netdev_max_backlog is the maximum number of packets, queued on the INPUT side (the ingress qdisc), when the interface receives packets faster than kernel can process them.
Check command: sysctl net.core.netdev_max_backlog
Change command: sysctl -w net.core.netdev_max_backlog value
How to monitor: cat /proc/net/softnet_stat; or a better tool
Egress QDisc – txqueuelen and default_qdisc
What – txqueuelen is the maximum number of packets, queued on the OUTPUT side.
Why – a buffer/queue to face connection burst and also to apply tc (traffic control).
Check command: ifconfig ethX
Change command: ifconfig ethX txqueuelen value
How to monitor: ip -s link
What – default_qdisc is the default queuing discipline to use for network devices.
Why – each application has different load and need to traffic control and it is used also to fight against bufferbloat
Check command: sysctl net.core.default_qdisc
Change command: sysctl -w net.core.default_qdisc value
How to monitor: tc -s qdisc ls dev ethX
TCP Read and Write Buffers/Queues
What – tcp_rmem – min (size used under memory pressure), default (initial size), max (maximum size) – size of receive buffer used by TCP sockets.
Why – the application buffer/queue to the write/send data, understand its consequences can help a lot.
Check command: sysctl net.ipv4.tcp_rmem
Change command: sysctl -w net.ipv4.tcp_rmem=”min default max”; when changing default value, remember to restart your user space app (i.e. your web server, nginx, etc)
How to monitor: cat /proc/net/sockstat
What – tcp_wmem – min (size used under memory pressure), default (initial size), max (maximum size) – size of send buffer used by TCP sockets.
Check command: sysctl net.ipv4.tcp_wmem
Change command: sysctl -w net.ipv4.tcp_wmem=”min default max”; when changing default value, remember to restart your user space app (i.e. your web server, nginx, etc)
How to monitor: cat /proc/net/sockstat
What tcp_moderate_rcvbuf – If set, TCP performs receive buffer auto-tuning, attempting to automatically size the buffer.
Check command: sysctl net.ipv4.tcp_moderate_rcvbuf
Change command: sysctl -w net.ipv4.tcp_moderate_rcvbuf value
How to monitor: cat /proc/net/sockstat
Honorable mentions – TCP FSM and congestion algorithm
sysctl net.core.somaxconn – provides an upper limit on the value of the backlog parameter passed to the listen() function, known in userspace as SOMAXCONN. If you change this value, you should also change your application to a compatible value (i.e. nginx backlog).
cat /proc/sys/net/ipv4/tcp_fin_timeout – this specifies the number of seconds to wait for a final FIN packet before the socket is forcibly closed. This is strictly a violation of the TCP specification but required to prevent denial-of-service attacks.
cat /proc/sys/net/ipv4/tcp_available_congestion_control – shows the available congestion control choices that are registered.
cat /proc/sys/net/ipv4/tcp_congestion_control – sets the congestion control algorithm to be used for new connections.
cat /proc/sys/net/ipv4/tcp_max_syn_backlog – sets the maximum number of queued connection requests which have still not received an acknowledgment from the connecting client; if this number is exceeded, the kernel will begin dropping requests.
cat /proc/sys/net/ipv4/tcp_syncookies – enables/disables syn cookies, useful for protecting against syn flood attacks.
cat /proc/sys/net/ipv4/tcp_slow_start_after_idle – enables/disables tcp slow start.
How to monitor:

netstat -atn | awk ‘/tcp/ {print $6}’ | sort | uniq -c – summary by state
ss -neopt state time-wait | wc -l – counters by a specific state: established, syn-sent, syn-recv, fin-wait-1, fin-wait-2, time-wait, closed, close-wait, last-ack, listening, closing
netstat -st – tcp stats summary
nstat -a – human-friendly tcp stats summary
cat /proc/net/sockstat – summarized socket stats
cat /proc/net/tcp – detailed stats, see each field meaning at the kernel docs
cat /proc/net/netstat – ListenOverflows and ListenDrops are important fields to keep an eye on
cat /proc/net/netstat | awk ‘(f==0) { i=1; while ( i<=NF) {n[i] = $i; i++ }; f=1; next} \ (f==1){ i=2; while ( i<=NF){ printf "%s = %d\n", n[i], $i; i++}; f=0} ' | grep -v "= 0; a human readable /proc/net/netstat
tcp finite state machine Source:

Network tools for testing and monitoring
iperf3 – network throughput
vegeta – HTTP load testing tool
netdata – system for distributed real-time performance and health monitoring

Click to access TCPlinux.pdf

Click to access

Click to access xl710-x710-performance-tuning-linux-guide.pdf

Click to access 20150325_network_performance_tuning.pdf

View at

Click to access multi-core-processor-based-linux-paper.pdf

Click to access LinuxDriver.pdf

Queueing in the Linux Network Stack

Click to access LJ13-07.pdf

Network performance monitoring and tuning in Linux
View at

Click to access tcpperf.pdf

Click to access 100G-Tuning-TechEx2016.tierney.pdf

Click to access ols2009-pages-169-184.pdf

Understanding Throughput and TCP Windows

Bufferbloat: A Network Best Practice You’re Probably Doing Wrong

Desktop version

The Challenger Disaster: A Case of Subjective Engineering

The Challenger Disaster: A Case of Subjective Engineering

IEEE websites place cookies on your device to give you the best user experience. By using our websites, you agree to the placement of these cookies. To learn more, read our Privacy Policy.
Accept & Close
IEEE Spectrum logo
Heroic Failures
The Challenger Disaster: A Case of Subjective Engineering
Posted 28 Jan 2016 | 18:41 GMT
Illustration: Barry Ross
Editor’s Note: Today is the 30th anniversary of the loss of the space shuttle Challenger, which was destroyed 73 seconds in its flight, killing all onboard. To mark the anniversary, IEEE Spectrum is republishing this seminal article which first appeared in June 1989 as part of a special report on risk. The article has been widely cited in both histories of the space program and in analyses of engineering risk management.

“Statistics don’t count for anything,” declared Will Willoughby, the National Aeronautics and Space Administration’s former head of reliability and safety during the Apollo moon landing program. “They have no place in engineering anywhere.” Now director of reliability management and quality assurance for the U.S. Navy, Washington, D.C., he still holds that risk is minimized not by statistical test programs, but by “attention taken in design, where it belongs.” His design-­oriented view prevailed in NASA in the 1970s, when the space shuttle was designed and built by many of the engineers who had worked on the Apollo program.

“The real value of probabilistic risk analysis is in understanding the system and its vulnerabilities,” said Benjamin Buchbinder, manager of NASA’s two-year-old risk management program. He maintains that probabilistic risk analysis can go beyond design-oriented qualitative techniques in looking at the interactions of subsystems, ascertaining the effects of human activity and environmental conditions, and detecting common-cause failures.

NASA started experimenting with this program in response to the Jan. 28, 1986, Challenger accident that killed seven astronauts. The program’s goals are to establish a policy on risk management and to conduct risk assessments independent of normal engineering analyses. But success is slow because of past official policy that favored “engineering judgment” over “probability numbers,” resulting in NASA’s failure to collect the type of statistical test and flight data useful for quantitative risk assessment.

This Catch 22–the agency lacks appropriate statistical data because it did not believe in the technique requiring the data, so it did not gather the relevant data–is one example of how an organization’s underlying culture and explicit policy can affect the overall reliability of the projects it undertakes.

External forces such as politics further shape an organization’s response. Whereas the Apollo program was widely supported by the President and the U.S. Congress and had all the money it needed, the shuttle program was strongly criticized and underbudgeted from the beginning. Political pressures, coupled with the lack of hard numerical data, led to differences of more than three orders of magnitude in the few quantitative estimates of a shuttle launch failure that NASA was required by law to conduct.

Some observers still worry that, despite NASA’s late adoption of quantitative risk assessment, its internal culture and its fear of political opposition may be pushing it to repeat dangerous errors of the shuttle program in the new space station program.

Basic Facts
System: National Space Transportation System (NSTS)—the space shuttle

Risk assessments conducted during design and operation: preliminary hazards analysis; failure modes and effects analysis with critical items list; various safety assessments, all qualitative at the system level, but with quantitative analyses conducted for specific subsystems.

Worst failure: In the January 1986 Challenger accident, primary and secondary O-rings in the field joint of the right solid-fuel rocket booster were burnt through by hot gases.

Consequences: loss of $3 billion vehicle and crew.

Predictability: long history of erosion in O-rings, not envisaged in the original design.

Causes: inadequate original design (booster joint rotated farther open than intended); faulty judgment (managers decided to launch despite record low temperatures and ice on launch pad); possible unanticipated external events (severe wind shear may have been a contributing factor).

Lessons learned: in design, to use probabilistic risk assessment more in evaluating and assigning priorities to risks; in operation, to establish certain launch commit criteria that cannot be waived by anyone.

Other outcomes: redesign of booster joint and other shuttle subsystems that also had a high level of risk or unanticipated failures; reassessment of critical items.

NASA’s preference for a design approach to reliability to the exclusion of quantitative risk analysis was strengthened by a negative early brush with the field. According to Haggai Cohen, who during the Apollo days was NASA’s deputy chief engineer, NASA contracted with General Electric Co. in Daytona Beach, Fla., to do a “full numerical PRA [probabilistic risk assessment]” to assess the likelihood of success in landing a man on the moon and returning him safely to earth. The GE study indicated the chance of success was “less than 5 percent.” When the NASA Administrator was presented with the results, he felt that if made public, “the numbers could do irreparable harm, and he disbanded the effort,” Cohen said. “We studiously stayed away from [numerical risk assessment] as a result.”

“That’s when we threw all that garbage out and got down to work,” Willoughby agreed. The study’s proponents, he said, contended “ ‘you build up confidence by statistical test programs. ’ We said, ‘No, go fly a kite, we’ll build up confidence by design.’ Testing gives you only a snapshot under particular conditions. Reality may not give you the same set of circumstances, and you can be lulled into a false sense of security or insecurity.”

As a result, NASA adopted qualitative failure modes and effects analysis (FMEA) as its principal means of identifying design features whose worst-case failure could lead to a catastrophe. The worst cases were ranked as Criticality 1 if they threatened the life of the crew members or the existence of the vehicle; Criticality 2 if they threatened the mission; and Criticality 3 for anything less. An R designated a redundant system [see “How NASA determined shuttle risk,”]. Quantitative techniques were limited to calculating the probability of the occurrence of an individual failure mode “if we had to present a rationale on how to live with a single failure point,” Cohen explained.

Illustration: Barry Ross
About 1700 design changes were made in the components and subsystems of the space shuttle between the Challenger accident of January 1986 and the launch of the next shuttle, Discovery, in September 1988. Some areas of the shuttle, however, still present significant risk. (Source: National Aeronautics and Space Administration)
The politics of risk

By the late 1960s and early 1970s the space shuttle was being portrayed as a reusable airliner capable of carrying 15-ton payloads into orbit and 5-ton payloads back to earth. Shuttle astronauts would wear shirtsleeves during takeoff and landing instead of the bulky spacesuits of the Gemini and Apollo days. And eventually the shuttle would carry just plain folks: non-astronaut scientists, politicians, schoolteachers, and journalists.

NASA documents show that the airline vision also applied to risk. For example, in the 1969 NASA Space Shuttle Task Group Report, the authors wrote: “It is desirable that the vehicle configuration provide for crew/passenger safety in a manner and to the degree as provided in present day commercial jet aircraft.”

Statistically an airliner is the least risky form of transportation, which implies high reliability. And in the early 1970s, when President Richard M. Nixon, Congress, and the Office of Management and Budget (OMB) were all skeptical of the shuttle, proving high reliability was crucial to the program’s continued funding.

OMB even directed NASA to hire an outside contractor to do an economic analysis of how the shuttle compared with other launch systems for cost-effectiveness, observed John M. Logsdon, director of the graduate program in science, technology, and public policy at George Washington University in Washington, D.C. “No previous space programme had been subject to independent professional economic evaluation,” Logsdon wrote in the journal Space Policy in May 1986. “It forced NASA into a belief that it had to propose a Shuttle that could launch all foreseeable payloads … [and] would be less expensive than alternative launch systems” and that, indeed, would supplant all expendable rockets. It also was politically necessary to show that the shuttle would be cheap and routine, rather than large and risky, with respect to both technology and cost, Logsdon pointed out.

Amid such political unpopularity, which threatened the program’s very existence, “some NASA people began to confuse desire with reality,” said Adelbert Tischler, retired NASA director of launch vehicles and propulsion. “One result was to assess risk in terms of what was thought acceptable without regard for verifying the assessment.” He added: “Note that under such circumstances real risk management is shut out.”

‘Disregarding data’

By the early 1980s many figures were being quoted for the overall risk to the shuttle, with estimates of a catastrophic failure ranging from less than 1 chance in 100 to 1 chance in 100 000. “The higher figures [1 in 100] come from working engineers, and the very low figures [1 in 100 000] from management,” wrote physicist Richard P. Feynman in his appendix “Personal Observations on Reliability of Shuttle” to the 1986 Report of the Presidential Commission on the Space Shuttle Challenger Accident.

The probabilities originated in a series of quantitative risk assessments NASA was required to conduct by the Interagency Nuclear Safety Review Panel (INSRP), in anticipation of the launch of the Galileo spacecraft on its voyage to Jupiter, originally scheduled for the early 1980s. Galileo was powered by a plutonium-­fueled radioisotope thermoelectric generator, and Presidential Directive/NSC-25 ruled that either the U.S. President or the director of the office of science and technology policy must examine the safety of any launch of nuclear material before approving it. The INSRP (which consisted of representatives of NASA as the launching agency, the Department of Energy, which manages nuclear devices, and the Department of Defense, whose Air Force manages range safety at launch) was charged with ascertaining the quantitative risks of a catastrophic launch dispersing the radioactive poison into the atmosphere. There were a number of studies because the upper stage for boosting Galileo into interplanetary space was reconfigured several times.

The first study was conducted by the J. H. Wiggins Co. of Redondo Beach, Calif., and published in three volumes between 1979 and 1982. It put the overall risk of losing a shuttle with its spacecraft payload during launch at between 1 chance in 1000 and 1 in 10, 000. The greatest risk was posed by the solid-fuel rocket boosters (SRBs). The Wiggins author noted that the history of other solid-fuel rockets showed them as undergoing catastrophic launches somewhere between 1 time in 59 and 1 time in 34, but that the study’s contract overseers, the Space Shuttle Range Safety Ad Hoc Committee, made an “engineering judgment” and “decided that a reduction in the failure probability estimate was warranted for the Space Shuttle SRBs” because “the historical data includes motors developed 10 to 20 years ago.” The Ad Hoc Committee therefore “decided to assume a failure probability of 1 x 10-3 for each SRB. “ In addition, the Wiggins author pointed out, “it was decided by the Ad-Hoc Committee that a second probability should be considered… which is one order of magnitude less” or 1 in 10, 000, “justified due to unique improvements made in the design and manufacturing process used for these motors to achieve man rating.”

In 1983 a second study was conducted by Teledyne Energy Systems Inc., Timonium, Md., for the Air Force Weapons Laboratory at Kirtland Air Force Base, N.M. It described the Wiggins analysis as consisting of “an interesting presentation of launch data from several Navy, Air Force, and NASA missile programs and the disregarding of that data and arbitrary assignment of risk levels apparently per sponsor direction” with “no quantita­tive justification at all.” After reanalyzing the data, the Teledyne authors concluded that the boosters ’ track record “suggest[s] a failure rate of around one-in-a-hundred.”

When risk analysis isn’t

NASA conducted its own internal safety analysis for Galileo, which was published in 1985 by the Johnson Space Center. The Johnson authors went through failure mode worksheets assigning probability levels. A fracture in the solid-rocket motor case or case joints —similar to the accident that destroyed Challenger —was assigned a probability level of 2; which a separate table defined as corresponding to a chance of 1 in 100, 000 and described as “remote,” or “so unlikely, it can be assumed that this hazard will not be experienced.”

The Johnson authors’ value of 1 in 100 000 implied, as Feynman spelled out, that “one could put a Shuttle up each day for 300 years expecting to lose only one.” Yet even after the Challenger accident, NASA’s chief engineer Milton Silveira, in a hearing on the Galileo thermonuclear generator held March 4, 1986, before the U.S. House of Representatives Committee on Science and Technology, said: “We think that using a number like 10 to the minus 3, as suggested, is probably a little pessimistic.” In his view, the actual risk “would be 10 to the minus 5, and that is our design objective.” When asked how the number was deduced, Silveira replied, “We came to those probabilities based on engineering judgment in review of the design rather than taking a statistical data base, because we didn’t feel we had that.”

After the Challenger accident, the 1986 presidential commission learned the O-rings in the field joints of the shuttle’s solid-­fuel rocket boosters had a history of damage correlated with low air temperature at launch. So the commission repeatedly asked the witnesses it called to hearings why systematic temperature­-correlation data had been unavailable before launch.

NASA’s “management methodology” for collection of data and determination of risk was laid out in NASA’s 1985 safety analysis for Galileo. The Johnson space center authors explained: “Early in the program it was decided not to use reliability (or probability) numbers in the design of the Shuttle” because the magnitude of testing required to statistically verify the numerical predictions “is not considered practical.” Furthermore, they noted, “experience has shown that with the safety, reliability, and quality assurance requirements imposed on manned space­flight contractors, standard failure rate data are pessimistic.”

“In lieu of using probability numbers, the NSTS [National Space Transportation System] relies on engineering judgment using rigid and well-documented design, configuration, safety, reliability, and quality assurance controls,” the Johnson authors continued. This outlook determined the data NASA managers required engineers to collect. For example, no “lapsed-time indicators” were kept on shuttle components, subsystems, and systems, although “a fairly accurate estimate of time and/or cycles could be derived,” the Johnson authors added.

One reason was economic. According to George Rodney, NASA’s associate administrator of safety, reliability, maintain­ability and quality assurance, it is not hard to get time and cycle data, “but it’s expensive and a big bookkeeping problem.”

Another reason was NASA’s “normal program development: you don’t continue to take data; you certify the components and get on with it,” said Rodney’s deputy, James Ehl. “People think that since we’ve flown 28 times, then we have 28 times as much data, but we don ’t. We have maybe three or four tests from the first development flights.”

In addition, Rodney noted, “For everyone in NASA that’s a big PRA [probabilistic risk assessment] seller, I can find you 10 that are equally convinced that PRA is oversold… [They] are so dubious of its importance that they won ’t convince themselves that the end product is worthwhile.”

Risk and the organizational culture

One reason NASA has so strongly resisted probabilistic risk analysis may be the fact that “PRA runs against all traditions of engineering, where you handle reliability by safety factors,” said Elisabeth Paté-Cornell, associate professor in the department of industrial engineering and engineering management at Stanford University in California, who is now studying organizational factors and risk assessment in NASA. In addition, with NASA’s strong pride in design, PRA may be “perceived as an insult to their capabilities, that the system they ’ve designed is not 100 percent perfect and absolutely safe,” she added. Thus, the character of an organization influences the reliability and failure of the systems it builds because its structure, policy, and culture determine the priorities, incentives, and communication paths for the engineers and managers doing the work, she said.

“Part of the problem is getting the engineers to understand that they are using subjective methods for determining risk, because they don’t like to admit that,” said Ray A. Williamson, senior associate at the U.S. Congress Office of Technology Assessment in Washington, D.C. “Yet they talk in terms of sounding objective and fool themselves into thinking they are being objective.”

“It’s not that simple,” Buchbinder said. “A probabilistic way of thinking is not something that most people are attuned to. We don’t know what will happen precisely each time. We can only say what is likely to happen a certain percentage of the time.” Unless engineers and managers become familiar with probability theory, they don ’t know what to make of “large uncertainties that represent the state of current knowledge,” he said. “And that is no comfort to the poor decision-maker who wants a simple answer to the question, ‘Is this system safe enough? ’”

As an example of how the “mindset” in the agency is now changing in favor of “a willingness to explore other things,” Buchbinder cited the new risk management program, the workshops it has been holding to train engineers and others in quantitative risk assessment techniques, and a new management instruction policy that requires NASA to “provide disciplined and documented management of risks throughout program life cycles.”

Hidden risks to the space station

NASA is now at work on its big project for the 1990s: a space station, projected to cost $30 billion and to be assembled in orbit, 220 nautical miles above the earth, from modules carried aloft in some two dozen shuttle launches. A National Research Council committee evaluated the space station program and concluded in a study in September 1987: “If the probability of damaging an Orbiter beyond repair on any single Shuttle flight is 1 percent— the demonstrated rate is now one loss in 25 launches, or 4 percent —the probability of losing an Orbiter before [the space station’s first phase] is complete is about 60 percent.”

The probability is within the right order of magnitude, to judge by the latest INSRP-mandated study completed in December for Buchbinder’s group in NASA by Planning Research Corp., McLean, Va. The study, which reevaluates the risk of the long-delayed launch of the Galileo spacecraft on its voyage to Jupiter, now scheduled for later this year, estimates the chance of losing a shuttle from launch through payload deployment at 1 in 78, or between 1 and 2 percent, with an uncertainty of a factor of 2.

Those figures frighten some observers because of the dire con­sequences of losing part of the space station. “The space station has no redundancy —no backup parts,” said Jerry Grey, director of science and technology policy for the American Institute of Aeronautics and Astronautics in Washington, D.C.

The worst case would be loss of the shuttle carrying the logistics module, which is needed for reboost, Grey pointed out. The space station’s orbit will subject it to atmospheric drag such that, if not periodically boosted higher, it will drift downward and with in eight months plunge back to earth and be destroyed, as was the Skylab space station in July 1979. “If you lost the shuttle with the logistics module, you don ’t have a spare, and you can ’t build one in eight months,” Grey said, “so you may lose not only that one payload, but also whatever was put up there earlier.”

Why are there no backup parts? “Politically the space station is under fire [from the U.S. Congress] all the time because NASA hasn’t done an adequate job of justifying it,” said Grey. “NASA is apprehensive that Congress might cancel the entire program”— and so is trying to trim costs as much as possible.

Grey estimated that spares of the crucial modules might add another 10 percent to the space station’s cost. “But NASA is not willing to go to bat for that extra because they ’re unwilling to take the political risk,” he said— a replay, he fears, of NASA’s response to the political negativism over the shuttle in the 1970s.

The NRC space station committee warned: “It is dangerous and misleading to assume there will be no losses and thus fail to plan for such events.”

“Let’s face it, space is a risky business,” commented former Apollo safety officer Cohen. “I always considered every launch a barely controlled explosion.”

“The real problem is: whatever the numbers are, acceptance of that risk and planning for it is what needs to be done,” Grey said. He fears that “NASA doesn’t do that yet.”

In addition to the sources named in the text, the authors would like to acknowledge the information and insights afforded by the following: E. William Colglazier (director of the Energy, Environment and Resources Center at the University of Tennessee, Knoxville) and Robert K. Weatherwax (president of Sierra Energy & Risk Assessment Inc., Roseville, Calif.), the two authors of the 1983 Teledyne/Air Force Weapons Laboratory study; Larry Crawford, director of reliability and trends analysis at NASA head­quarters in Washington, D.C.; Joseph R. Fragola, vice president, Science Applications International Corp., New York City; Byron Peter Leonard, president, L Systems Inc., El Segundo, Calif.; George E. Mueller, former NASA associate administrator for manned spaceflight; and Marcia Smith, specialist in aerospace policy, Congressional Research Service, Washington, D.C.

This article first appeared in print in June 1989 as part of the special report “Managing Risk In Large Complex Systems” under the title “The space shuttle: a case of subjective engineering.”

How NASA Determined Shuttle Risk
At the start of the space shuttle’s design, the National Aeronautics and Space Administration defined risk as “the chance (qualitative) of loss of personnel capability, toss of system, or damage to or loss of equipment or property.” NASA accordingly relied on several techniques for determining reliability and potential design problems, concluded the U.S. National Research Council’s Committee on Shuttle Criticality Review and Hazard Analysis Audit in its January 1988 report Post-Challenger Evaluation of Space Shuttle Risk Assessment and Management. But, the report noted, the analyses did “not address the relative probabilities of a particular hazardous condition arising from failure modes, human errors, or external situations,” so did not measure risk.

A failure modes and effects analysis (FMEA) was the heart of NASA’s effort to ensure reliability, the NRC report noted. An FMEA, carried out by the contractor building each shuttle element or subsystem, was performed on all flight hardware and on ground support equipment that interfaced with flight hard ware. Its chief purpose was to i dentify hardware critical to the performance and safety of the mission.

Items that did not meet certain design, reliability and safety requirements specified by NASA’s top management and whose failure could threaten the toss of crew, vehicle, or mis­sion, made up a critical i tems list (CIL).

Although the FMEA/CIL was first viewed as a design tool, NASA now uses it during operations and management as well, to analyze problems, assess whether corrective actions are effective, identify where and when inspection an d maintenance are needed, and reveal trends in failures.

Second, NASA conducted hazards analyses, performed jointly by shuttle engineers and by NASA’s safety and operations organizations. They made use of the FMEA/ CIL, various design reviews, safety analyses, and other studies. They considered not only the failure modes identified In the FMEA, but also other threats p osed by the mission activities, crew­machine interfaces, and the environment. After hazards and their causes were identified, NASA engineers and managers had to make one of three decisions: to eliminate the cause of each hazard, to control the cause if it could not be eliminated, or to accept the hazards that could not be controlled.

NASA also conducted an element i nterface functional analysis (EIFA) to look at the shuttle more nearly as a com plete system. Both the FMEA and the hazards analyses concentrated only on i ndividual elements of the shuttle: the space shuttle’s main engines i n the orbiter, the rest of the orbiter, the external tank, and the solid fuel rocket boosters. The EIFA assessed hazards at the mating of the elements.

Also to examine the shuttle as a system, NASA conducted a one-time critical functions assessment in 1978, which searched for multiple and cascading failures. The information from all these studies fed one way into an overall mission safety assessment.

The NRC committee had several criticisms. In practice, the FMEA was the sole basis for some engineering change decisions and all engineering waivers and rationales for re taining certain high-risk design features. However, the NRC report noted, hazard analyses for some important, high-risk subsystems “were not updated for years at a time even though design changes had occurred or dangerous failures were experienced.” On one procedural flow chart, the report noted, “the ‘Hazard Analysis As Required ’ is a dead-end box with inputs but no output with respect to waiver approval decisions.”

The NAC committee concluded that “the isolation of the hazard analysis within NASA’s risk assessment and management process to date can be seen as reflecting the past weakness of the entire safety organization” —T.E.B. and K.E.

How the Boeing 737 Max Disaster Looks to a Software Developer
Building the System/360 Mainframe Nearly Destroyed IBM
Musk vs. Bezos: The Battle of the Space Billionaires Heats Up
How NASA Will Use Robots to Create Rocket Fuel From Martian Soil
Is There a Giant Planet Lurking Beyond Pluto?
The FCC’s Big Problem with Small Satellites
Featured Jobs
Dean of the College of Engineering and Computer Science
Boston, Massachusetts
Wentworth Institute of Technology
Research Scientist – Computer Vision/Machine Learning
Seattle, WA
Core OS Intelligent Resource Management SW Engineer
Cupertino, CA
More Jobs >>
Comment Policy

Contact Us
Email Newsletters
About IEEE Spectrum
About IEEE
© Copyright 2019 IEEE Spectrum

Current Issue
Magazine cover image

How the Boeing 737 Max Disaster Looks to a Software Developer

How the Boeing 737 Max Disaster Looks to a Software Developer

IEEE websites place cookies on your device to give you the best user experience. By using our websites, you agree to the placement of these cookies. To learn more, read our Privacy Policy.
Accept & Close
IEEE Spectrum logo
How the Boeing 737 Max Disaster Looks to a Software Developer
Posted 18 Apr 2019 | 19:49 GMT
The views expressed here are solely those of the author and do not represent positions of IEEE Spectrum or the IEEE.

Photo of the crash site showing engine.
Photo: Jemal Countess/Getty Images
This is part of the wreckage of Ethiopian Airlines Flight ET302, a Boeing 737 Max airliner that crashed on 11 March in Bishoftu, Ethiopia, killing all 157 passengers and crew.
I have been a pilot for 30 years, a software developer for more than 40. I have written extensively about both aviation and software engineering. Now it’s time for me to write about both together.

The Boeing 737 Max has been in the news because of two crashes, practically back to back and involving brand new airplanes. In an industry that relies more than anything on the appearance of total control, total safety, these two crashes pose as close to an existential risk as you can get. Though airliner passenger death rates have fallen over the decades, that achievement is no reason for complacency.

The 737 first appeared in 1967, when I was 3 years old. Back then it was a smallish aircraft with smallish engines and relatively simple systems. Airlines (especially Southwest) loved it because of its simplicity, reliability, and flexibility. Not to mention the fact that it could be flown by a two-person cockpit crew—as opposed to the three or four of previous airliners—which made it a significant cost saver. Over the years, market and technological forces pushed the 737 into ever-larger versions with increasing electronic and mechanical complexity. This is not, by any means, unique to the 737. Airliners constitute enormous capital investments both for the industries that make them and the customers who buy them, and they all go through a similar growth process.

Most of those market and technical forces are on the side of economics, not safety. They work as allies to relentlessly drive down what the industry calls “seat-mile costs”—the cost of flying a seat from one point to another.

Much had to do with the engines themselves. The principle of Carnot efficiency dictates that the larger and hotter you can make any heat engine, the more efficient it becomes. That’s as true for jet engines as it is for chainsaw engines.

It’s as simple as that. The most effective way to make an engine use less fuel
per unit of power produced is to make it larger. That’s why the Lycoming O-360 engine in my Cessna has pistons the size of dinner plates. That’s why
marine diesel engines stand three stories tall. And that’s why Boeing wanted to put the huge CFM International LEAP engine in its latest version of the 737.

There was just one little problem: The original 737 had (by today’s standards) tiny little engines, which easily cleared the ground beneath the wings. As the 737 grew and was fitted with bigger engines, the clearance between the engines and the ground started to get a little…um, tight.

Illustration showing the Boeing 737 airliner.
By substituting a larger engine, Boeing changed the intrinsic aerodynamic nature of the 737 airliner.
Various hacks (as we would call them in the software industry) were developed. One of the most noticeable to the public was changing the shape of the engine intakes from circular to oval, the better to clear the ground.

With the 737 Max, the situation became critical. The engines on the original 737 had a fan diameter (that of the intake blades on the engine) of just 100 centimeters (40 inches); those planned for the 737 Max have 176 cm. That’s a centerline difference of well over 30 cm (a foot), and you couldn’t “ovalize” the intake enough to hang the new engines beneath the wing without scraping the ground.

The solution was to extend the engine up and well in front of the wing. However, doing so also meant that the centerline of the engine’s thrust changed. Now, when the pilots applied power to the engine, the aircraft would have a significant propensity to “pitch up,” or raise its nose.

The angle of attack is the angle between the wings and the airflow over the wings. Think of sticking your hand out of a car window on the highway. If your hand is level, you have a low angle of attack; if your hand is pitched up, you have a high angle of attack. When the angle of attack is great enough, the wing enters what’s called an aerodynamic stall. You can feel the same thing with your hand out the window: As you rotate your hand, your arm wants to move up like a wing more and more until you stall your hand, at which point your arm wants to flop down on the car door.

This propensity to pitch up with power application thereby increased the risk that the airplane could stall when the pilots “punched it” (as my son likes to say). It’s particularly likely to happen if the airplane is flying slowly.

Worse still, because the engine nacelles were so far in front of the wing and so large, a power increase will cause them to actually produce lift, particularly at high angles of attack. So the nacelles make a bad problem worse.

I’ll say it again: In the 737 Max, the engine nacelles themselves can, at high angles of attack, work as a wing and produce lift. And the lift they produce is well ahead of the wing’s center of lift, meaning the nacelles will cause the 737 Max at a high angle of attack to go to a higher angle of attack. This is aerodynamic malpractice of the worst kind.

Pitch changes with power changes are common in aircraft. Even my little Cessna pitches up a bit when power is applied. Pilots train for this problem and are used to it. Nevertheless, there are limits to what safety regulators will allow and to what pilots will put up with.

Pitch changes with increasing angle of attack, however, are quite another thing. An airplane approaching an aerodynamic stall cannot, under any circumstances, have a tendency to go further into the stall. This is called “dynamic instability,” and the only airplanes that exhibit that characteristic—fighter jets—are also fitted with ejection seats.

Everyone in the aviation community wants an airplane that flies as simply and as naturally as possible. That means that conditions should not change markedly, there should be no significant roll, no significant pitch change, no nothing when the pilot is adding power, lowering the flaps, or extending the landing gear.

The airframe, the hardware, should get it right the first time and not need a lot of added bells and whistles to fly predictably. This has been an aviation canon from the day the Wright brothers first flew at Kitty Hawk.

Apparently the 737 Max pitched up a bit too much for comfort on power application as well as at already-high angles of attack. It violated that most ancient of aviation canons and probably violated the certification criteria of the U.S. Federal Aviation Administration. But instead of going back to the drawing board and getting the airframe hardware right (more on that below), Boeing relied on something called the “Maneuvering Characteristics Augmentation System,” or MCAS.

Boeing’s solution to its hardware problem was software.

I will leave a discussion of the corporatization of the aviation lexicon for another article, but let’s just say another term might be the “Cheap way to prevent a stall when the pilots punch it,” or CWTPASWTPPI, system. Hmm. Perhaps MCAS is better, after all.

MCAS is certainly much less expensive than extensively modifying the airframe to accommodate the larger engines. Such an airframe modification would have meant things like longer landing gear (which might not then fit in the fuselage when retracted), more wing dihedral (upward bend), and so forth. All of those hardware changes would be horribly expensive.

“Everything about the design and manufacture of the Max was done to preserve the myth that ‘it’s just a 737.’ Recertifying it as a new aircraft would have taken years and millions of dollars. In fact, the pilot licensed to fly the 737 in 1967 is still licensed to fly all subsequent versions of the 737.”
—Feedback on an earlier draft of this article from a 737 pilot for a major airline
What’s worse, those changes could be extensive enough to require not only that the FAA recertify the 737 but that Boeing build an entirely new aircraft. Now we’re talking real money, both for the manufacturer as well as the manufacturer’s customers.

That’s because the major selling point of the 737 Max is that it is just a 737, and any pilot who has flown other 737s can fly a 737 Max without expensive training, without recertification, without another type of rating. Airlines—Southwest is a prominent example—tend to go for one “standard” airplane. They want to have one airplane that all their pilots can fly because that makes both pilots and airplanes fungible, maximizing flexibility and minimizing costs.

It all comes down to money, and in this case, MCAS was the way for both Boeing and its customers to keep the money flowing in the right direction. The necessity to insist that the 737 Max was no different in flying characteristics, no different in systems, from any other 737 was the key to the 737 Max’s fleet fungibility. That’s probably also the reason why the documentation about the MCAS system was kept on the down-low.

Put in a change with too much visibility, particularly a change to the aircraft’s operating handbook or to pilot training, and someone—probably a pilot—would have piped up and said, “Hey. This doesn’t look like a 737 anymore.” And then the money would flow the wrong way.

As I explained, you can do your own angle-of-attack experiments just by putting your hand out a car door window and rotating it. It turns out that sophisticated aircraft have what is essentially the mechanical equivalent of a hand out the window: the angle-of-attack sensor.

You may have noticed this sensor when boarding a plane. There are usually two of them, one on either side of the plane, and usually just below the pilot’s windows. Don’t confuse them with the pitot tubes (we’ll get to those later). The angle-of-attack sensors look like wind vanes, whereas the pitot tubes look like, well, tubes.

Angle-of-attack sensors look like wind vanes because that’s exactly what they are. They are mechanical hands designed to rotate in response to changes in that angle of attack.

The pitot tubes measure how much the air is “pressing” against the airplane, whereas the angle-of-attack sensors measure what direction that air is coming from. Because they measure air pressure, the pitot tubes are used to determine the aircraft’s speed through the air. The angle-of-attack sensors measure the aircraft’s direction relative to that air.

There are two sets of angle-of-attack sensors and two sets of pitot tubes, one set on either side of the fuselage. Normal usage is to have the set on the pilot’s side feed the instruments on the pilot’s side and the set on the copilot’s side feed the instruments on the copilot’s side. That gives a state of natural redundancy in instrumentation that can be easily cross-checked by either pilot. If the copilot thinks his airspeed indicator is acting up, he can look over to the pilot’s airspeed indicator and see if it agrees. If not, both pilot and copilot engage in a bit of triage to determine which instrument is profane and which is sacred.

Long ago there was a joke that in the future planes would fly themselves, and the only thing in the cockpit would be a pilot and a dog. The pilot’s job was to make the passengers comfortable that someone was up front. The dog’s job was to bite the pilot if he tried to touch anything.

On the 737, Boeing not only included the requisite redundancy in instrumentation and sensors, it also included redundant flight computers—one on the pilot’s side, the other on the copilot’s side. The flight computers do a lot of things, but their main job is to fly the plane when commanded to do so and to make sure the human pilots don’t do anything wrong when they’re flying it. The latter is called “envelope protection.”

Let’s just call it what it is: the bitey dog.

Let’s review what the MCAS does: It pushes the nose of the plane down when the system thinks the plane might exceed its angle-of-attack limits; it does so to avoid an aerodynamic stall. Boeing put MCAS into the 737 Max because the larger engines and their placement make a stall more likely in a 737 Max than in previous 737 models.

When MCAS senses that the angle of attack is too high, it commands the aircraft’s trim system (the system that makes the plane go up or down) to lower the nose. It also does something else: It pushes the pilot’s control columns (the things the pilots pull or push on to raise or lower the aircraft’s nose) downward.

In the 737 Max, like most modern airliners and most modern cars, everything is monitored by computer, if not directly controlled by computer. In many cases, there are no actual mechanical connections (cables, push tubes, hydraulic lines) between the pilot’s controls and the things on the wings, rudder, and so forth that actually make the plane move. And, even where there are mechanical connections, it’s up to the computer to determine if the pilots are engaged in good decision making (that’s the bitey dog again).

But it’s also important that the pilots get physical feedback about what is going on. In the old days, when cables connected the pilot’s controls to the flying surfaces, you had to pull up, hard, if the airplane was trimmed to descend. You had to push, hard, if the airplane was trimmed to ascend. With computer oversight there is a loss of natural sense in the controls. In the 737 Max, there is no real “natural feel.”

True, the 737 does employ redundant hydraulic systems, and those systems do link the pilot’s movement of the controls to the action of the ailerons and other parts of the airplane. But those hydraulic systems are powerful, and they do not give the pilot direct feedback from the aerodynamic forces that are acting on the ailerons. There is only an artificial feel, a feeling that the computer wants the pilots to feel. And sometimes, it doesn’t feel so great.

When the flight computer trims the airplane to descend, because the MCAS system thinks it’s about to stall, a set of motors and jacks push the pilot’s control columns forward. It turns out that the flight management computer can put a lot of force into that column—indeed, so much force that a human pilot can quickly become exhausted trying to pull the column back, trying to tell the computer that this really, really should not be happening.

Illustration showing the Boeing 737 anti-stall system.
The antistall system depended crucially on sensors that are installed on each side of the airliner—but the system consulted only the sensor on one side.
Indeed, not letting the pilot regain control by pulling back on the column was an explicit design decision. Because if the pilots could pull up the nose when MCAS said it should go down, why have MCAS at all?

MCAS is implemented in the flight management computer, even at times when the autopilot is turned off, when the pilots think they are flying the plane. In a fight between the flight management computer and human pilots over who is in charge, the computer will bite humans until they give up and (literally) die.

Finally, there’s the need to keep the very existence of the MCAS system on the hush-hush lest someone say, “Hey, this isn’t your father’s 737,” and bank accounts start to suffer.

The flight management computer is a computer. What that means is that it’s not full of aluminum bits, cables, fuel lines, or all the other accoutrements of aviation. It’s full of lines of code. And that’s where things get dangerous.

Those lines of code were no doubt created by people at the direction of managers. Neither such coders nor their managers are as in touch with the particular culture and mores of the aviation world as much as the people who are down on the factory floor, riveting wings on, designing control yokes, and fitting landing gears. Those people have decades of institutional memory about what has worked in the past and what has not worked. Software people do not.

In the 737 Max, only one of the flight management computers is active at a time—either the pilot’s computer or the copilot’s computer. And the active computer takes inputs only from the sensors on its own side of the aircraft.

When the two computers disagree, the solution for the humans in the cockpit is 
to look across the control panel to see
 what the other instruments are saying and then sort it out. In the Boeing system, the flight
 management computer does not “look 
across” at the other instruments. It 
believes only the instruments on its side. It doesn’t go old-school. It’s modern. It’s software.

This means is that if a particular angle-of-attack sensor goes haywire—which happens all the time in a machine that alternates from one extreme environment to another, vibrating and shaking all the way—the flight management computer just believes it.

It gets even worse. There are several other instruments that can be used to determine things like angle of attack, either directly or indirectly, such as the pitot tubes, the artificial horizons, etc. All of these things would be cross-checked by a human pilot to quickly diagnose a faulty angle-of-attack sensor.

In a pinch, a human pilot could just look out the windshield to confirm visually and directly that, no, the aircraft is not pitched up dangerously. That’s the ultimate check and should go directly to the pilot’s ultimate sovereignty. Unfortunately, the current implementation of MCAS denies that sovereignty. It denies the pilots the ability to respond to what’s before their own eyes.

Like someone with narcissistic personality disorder, MCAS gaslights the pilots. And it turns out badly for everyone. “Raise the nose, HAL.” “I’m sorry, Dave, I’m afraid I can’t do that.”

In the MCAS system, the flight management computer is blind to any other evidence that it is wrong, including what the pilot sees with his own eyes and what he does when he desperately tries to pull back on the robotic control columns that are biting him, and his passengers, to death.

In the old days, the FAA had armies of aviation engineers in its employ. Those FAA employees worked side by side with the airplane manufacturers to determine that an airplane was safe and could be certified as airworthy.

As airplanes became more complex and the gulf between what the FAA could pay and what an aircraft manufacturer could pay grew larger, more and more of those engineers migrated from the public to the private sector. Soon the FAA had no in-house ability to determine if a particular airplane’s design and manufacture were safe. So the FAA said to the airplane manufacturers, “Why don’t you just have your people tell us if your designs are safe?”

The airplane manufacturers said, “Sounds good to us.” The FAA said, “And say hi to Joe, we miss him.”

Thus was born the concept of the “Designated Engineering Representative,” or DER. DERs are people in the employ of the airplane manufacturers, the engine manufacturers, and the software developers who certify to the FAA that it’s all good.

Now this is not quite as sinister a conflict of interest as it sounds. It is in nobody’s interest that airplanes crash. The industry absolutely relies on the public trust, and every crash is an existential threat to the industry. No manufacturer is going to employ DERs that just pencil-whip the paperwork. On the other hand, though, after a long day and after the assurance of some software folks, they might just take their word that things will be okay.

It is astounding that no one who wrote the MCAS software for the 737 Max seems even to have raised the possibility of using multiple inputs, including the opposite angle-of-attack sensor, in the computer’s determination of an impending stall. As a lifetime member of the software development fraternity, I don’t know what toxic combination of inexperience, hubris, or lack of cultural understanding led to this mistake.

But I do know that it’s indicative of a much deeper problem. The people who wrote the code for the original MCAS system were obviously terribly far out of their league and did not know it. How can they can implement a software fix, much less give us any comfort that the rest of the flight management software is reliable?

So Boeing produced a dynamically unstable airframe, the 737 Max. That is big strike No. 1. Boeing then tried to mask the 737’s dynamic instability with a software system. Big strike No. 2. Finally, the software relied on systems known for their propensity to fail (angle-of-attack indicators) and did not appear to include even rudimentary provisions to cross-check the outputs of the angle-of-attack sensor against other sensors, or even the other angle-of-attack sensor. Big strike No. 3.

None of the above should have passed muster. None of the above should have passed the “OK” pencil of the most junior engineering staff, much less a DER.

That’s not a big strike. That’s a political, social, economic, and technical sin.

It just so happens that, during the timeframe between the first 737 Max crash and the most recent 737 crash, I’d had the occasion to upgrade and install a brand-new digital autopilot in my own aircraft. I own a 1979 Cessna 172, the most common aircraft in history, at least by production numbers. Its original certification also predates that of the 737’s by about a decade (1955 versus 1967).

My new autopilot consists of several very modern components, including redundant flight computers (dual Garmin G5s) and a sophisticated communication “bus” (a Controller Area Network bus) that lets all the various components talk to one another, irrespective of where they are located in my plane. A CAN bus derives from automotive “drive by wire” technology but is otherwise very similar in purpose and form to the various ARINC buses that connect the components in the 737 Max.

My autopilot also includes electric pitch trim. That means it can make the same types of configuration changes to my 172 that the flight computers and MCAS system make to the 737 Max. During the installation, after the first 737 Max crash, I remember remarking to a friend that it was not lost on me that I was potentially adding a hazard similar to the one that brought down the Lion Air crash.

Finally, my new autopilot also implements “envelope protection,” the envelope being the graph of the performance limitations of an aircraft. If my Cessna is not being flown by the autopilot, the system nonetheless constantly monitors the airplane to make sure that I am not about to stall it, roll it inverted, or a whole host of other things. Yes, it has its own “bitey dog” mode.

As you can see, the similarities between my US $20,000 autopilot and the multimillion-dollar autopilot in every 737 are direct, tangible, and relevant. What, then, are the differences?

For starters, the installation of my autopilot required paperwork in the form of a “Supplemental Type Certificate,” or STC. It means that the autopilot manufacturer and the FAA both agreed that my 1979 Cessna 172 with its (Garmin) autopilot was so significantly different from what the airplane was when it rolled off the assembly line that it was no longer the same Cessna 172. It was a different aircraft altogether.

In addition to now carrying a new (supplemental) aircraft-type certificate (and certification), my 172 required a very large amount of new paperwork to be carried in the plane, in the form of revisions and addenda to the aircraft operating manual. As you can guess, most of those addenda revolved around the autopilot system.

Of particular note in that documentation, which must be studied and understood by anyone who flies the plane, are various explanations of the autopilot system, including its command of the trim control system and its envelope protections.

There are instructions on how to detect when the system malfunctions and how to disable the system, immediately. Disabling the system means pulling the autopilot circuit breaker; instructions on how to do that are strewn throughout the documentation, repeatedly. Every pilot who flies my plane becomes intimately aware that it is not the same as any other 172.

This is a big difference between what pilots who want to fly my plane are told and what pilots stepping into a 737 Max are (or were) told.

Another difference is between the autopilots in my system and that in the 737 Max. All of the CAN bus–interconnected components constantly do the kind of instrument cross-check that human pilots do and that, apparently, the MCAS system in the 737 Max does not. For example, the autopilot itself has a self-contained attitude platform that checks the attitude information coming from the G5 flight computers. If there is a disagreement, the system simply goes off-line and alerts the pilot that she is now flying manually. It doesn’t point the airplane’s nose at the ground, thinking it’s about to stall.

Perhaps the biggest difference is in the amount of physical force it takes for the pilot to override the computers in the two planes. In my 172, there are still cables linking the controls to the flying surfaces. The computer has to press on the same things that I have to press on—and its strength is nowhere near as great as mine. So even if, say, the computer thought that my plane was about to stall when it wasn’t, I can easily overcome the computer.

In my Cessna, humans still win a battle of the wills every time. That used to be a design philosophy of every Boeing aircraft, as well, and one they used against their archrival Airbus, which had a different philosophy. But it seems that with the 737 Max, Boeing has changed philosophies about human/machine interaction as quietly as they’ve changed their aircraft operating manuals.

The 737 Max saga teaches us not only about the limits of technology and the risks of complexity, it teaches us about our real priorities. Today, safety doesn’t come first—money comes first, and safety’s only utility in that regard is in helping to keep the money coming. The problem is getting worse because our devices are increasingly dominated by something that’s all too easy to manipulate: software.

Hardware defects, whether they are engines placed in the wrong place on a plane or O-rings that turn brittle when cold, are notoriously hard to fix. And by hard, I mean expensive. Software defects, on the other hand, are easy and cheap to fix. All you need to do is post an update and push out a patch. What’s more, we’ve trained consumers to consider this normal, whether it’s an update to my desktop operating systems or the patches that get posted automatically to my Tesla while I sleep.

Back in the 1990s, I wrote an article comparing the relative complexity of the Pentium processors of that era, expressed as the number of transistors on the chip, to the complexity of the Windows operating system, expressed as the number of lines of code. I found that the complexity of the Pentium processors and the contemporaneous Windows operating system was roughly equal.

That was the time when early Pentiums were affected by what was known as the FDIV bug. It affected only a tiny fraction of Pentium users. Windows was also affected by similar defects, also affecting only fractions of its users.

But the effects on the companies were quite different. Where Windows addressed its small defects with periodic software updates, in 1994 Intel recalled the (slightly) defective processors. It cost the company $475 million—more than $800 million in today’s money.

I believe the relative ease—not to mention the lack of tangible cost—of software updates has created a cultural laziness within the software engineering community. Moreover, because more and more of the hardware that we create is monitored and controlled by software, that cultural laziness is now creeping into hardware engineering—like building airliners. Less thought is now given to getting a design correct and simple up front because it’s so easy to fix what you didn’t get right later.

Every time a software update gets pushed to my Tesla, to the Garmin flight computers in my Cessna, to my Nest thermostat, and to the TVs in my house, I’m reminded that none of those things were complete when they left the factory—because their builders realized they didn’t have to be complete. The job could be done at any time in the future with a software update.

“I’m a software developer turned network engineer and have written airliner avionics software in the past. It was interesting how many hoops we had to jump through to get an add-on board for the computer certified, while software certifications were nil (other than “cannot run on Windows,” “must be written in C++”). This was, admittedly, nearly 10 years ago, and I hope that things have changed since.”
—Anonymous, personal correspondence
Boeing is in the process of rolling out a set of software updates to the 737 Max flight control system, including MCAS. I don’t know, but I suspect that those updates will center on two things:

Having the software “cross-check” system indicators, just as a human pilot would. Meaning, if one angle-of-attack indicator says the plane’s about to stall, but the other one says it’s not so, at least hold off judgment about pushing the nose down into the dirt and maybe let a pilot or two know you’re getting conflicting signals. 

Backing off on the “shoot first, ask questions later” design philosophy—meaning, looking at multiple inputs. 

For the life of me, I do not know why those two basic aviation design considerations, bedrocks of a mind-set that has served the industry so well until now, were not part of the original MCAS design. And, when they were not, I do not know or understand what part of the DER process failed to catch the fundamental design defect.

But I suspect that it all has to do with the same thing that brought us from Boeing’s initial desire to put larger engines on the 737 and to avoid having to internalize the cost of those larger engines—in other words, to do what every child is taught is impossible: get a free lunch.

The emphasis on simplicity comes from the work of Charles Perrow, a sociologist at Yale University whose 1984 book, Normal Accidents: Living With High-Risk Technologies, tells it all in the very title. Perrow argues that system failure is a normal outcome in any system that is very complex and whose components are “tightly bound”—meaning that the behavior of one component immediately controls the behavior of another. Though such failures may seem to stem from one or another faulty part or practice, they must be seen as inherent in the system itself. They are “normal” failures.

Nowhere is this problem more acutely felt than in systems designed to augment or improve safety. Every increment, every increase in complexity, ultimately leads to decreasing rates of return and, finally, to negative returns. Trying to patch and then repatch such a system in an attempt to make it safer can end up making it less safe.

This is the root of the old engineering axiom “Keep it simple, stupid” (KISS) and its aviation-specific counterpart: “Simplify, then add lightness.”

The original FAA Eisenhower-era certification requirement was a testament to simplicity: Planes should not exhibit significant pitch changes with changes in engine power. That requirement was written when there was a direct connection between the controls in the pilot’s hands and the flying surfaces on the airplane. Because of that, the requirement—when written—rightly imposed a discipline of simplicity on the design of the airframe itself. Now software stands between man and machine, and no one seems to know exactly what is going on. Things have become too complex to understand.

I cannot get the parallels between the 737 Max and the space shuttle Challenger out of my head. The Challenger accident, another textbook case study in normal failure, came about not because people didn’t follow the rules but because they did. In the Challenger case, the rules said that they had to have prelaunch conferences to ascertain flight readiness. It didn’t say that a significant input to those conferences couldn’t be the political considerations of delaying a launch. The inputs were weighed, the process was followed, and a majority consensus was to launch. And seven people died.

In the 737 Max case, the rules were also followed. The rules said you couldn’t have a large pitch-up on power change and that an employee of the manufacturer, a DER, could sign off on whatever you came up with to prevent a pitch change on power change. The rules didn’t say that the DER couldn’t take the business considerations into the decision-making process. And 346 people are dead.

It is likely that MCAS, originally added in the spirit of increasing safety, has now killed more people than it could have ever saved. It doesn’t need to be “fixed” with more complexity, more software. It needs to be removed altogether.

An earlier version of this article was cited in EE Times.

About the Author
Gregory Travis is a writer, a software executive, a pilot, and an aircraft owner. In 1977, at the age of 13, he wrote Note, one of the first social media platforms, and he has logged more than 2,000 hours of flying time, ranging from gliders to a Boeing 757 (as a full-motion simulator).

Maiden Flight of Sikorsky-Boeing’s Defiant Helicopter
How to Protect Pilots From Laser Pointer Attacks
Sabrewing Plans a Cargo Drone That Can Detect and Avoid Obstacles
First Passenger Electric Aircraft to Take Off Soon
Growing Drone Industry Spawns a Growing Antidrone Industry
SpaceX Claims to Have Redesigned Its Starlink Satellites to Eliminate Casualty Risks
Featured Jobs
Aerospace Engineer – ATAS
Atlanta, GA
Georgia Tech Research Institute (GTRI)
Aerospace Engineer – ATAS
Atlanta, GA
Georgia Tech Research Institute (GTRI)
Space Systems Researcher – ACL
Atlanta, GA
Georgia Tech Research Institute (GTRI)
More Jobs >>
Comment Policy

Contact Us
Email Newsletters
About IEEE Spectrum
About IEEE
© Copyright 2019 IEEE Spectrum

Current Issue
Magazine cover image

Hardening SSH with 2fa

Hardening SSH with 2fa

lizthegrey / attributes.rb
Hardening SSH with 2fa
default[‘sshd’][‘sshd_config’][‘AuthenticationMethods’] = ‘publickey,keyboard-interactive:pam’
default[‘sshd’][‘sshd_config’][‘ChallengeResponseAuthentication’] = ‘yes’
default[‘sshd’][‘sshd_config’][‘PasswordAuthentication’] = ‘no’
Hi! I’m Liz, a Developer Advocate at, and I spent my first weeks at the company doing security hardening of our infrastructure. I’d like to share what I’d learned with you, so that you can benefit from my reading of dozens of scattered pages of documentation and my ruling out of numerous dead ends.

Why you should take security and usability seriously
Developers and administrators have historically used SSH keys to provide authentication between hosts. By adding passphrase encryption, the private keys become resistant to theft when at rest. But what about when in use? Unfortunately, the usability challenges of re-entering the passphrase on every connection means that engineers began caching keys unencrypted in memory of their workstations, and worse yet, forwarding the agent to allow remote hosts to use the cached keys without further confirmation. The recent breach at Matrix underscores how dangerous it is to allow authenticated sessions to propagate across hosts and environments without a human in the loop.

Thus, we need solutions that prevent key theft from the systems we connect to, while maintaining ease of use. Two-factor authentication stops malicious automated propagation in its tracks by having a second factor protect use of our keys. There are two primary ways of preventing an attacker from misusing our credentials: either using a separate device that generates, using a shared secret, numerical codes that we can transfer over out of band and enter alongside our key, or having the separate device perform all the cryptography for us only when physically authorized by us.

Google, where I previously worked, employs short-lived SSH certificates issued by a central piece of infrastructure, stored on secure hardware tokens. But this is a serious change to developer workflow, and requires extensive infrastructure to set up. What will work for a majority of developers who are used to simply loading their SSH key into the agent at the start of their login session and SSHing everywhere?

Design considerations & threat models
I’m assuming that you have a publicly exposed bastion host for each environment that intermediates accesses to the rest of each environment’s VPC, and use SSH keys to authenticate from laptops to the bastion and from the bastion to each VM/container in the VPC. If you don’t yet have a bastion host and a VPC, start there!

It was important to me to make Honeycomb safe from compromise, even if malicious worm-like code were executed on a developer’s laptop while SSH keys were unlocked, or if a developer accidentally forwarded an SSH agent to a hostile remote system. I also thought it important to build on existing work to disk encrypt all endpoints by ensuring the loss of physical control over a phone or hardware token could not itself grant production access. However, I consider it out of scope to prevent active local intervention and session hijacking (since someone who controls your active console or keyboard has you pretty well pwned).

I’m also assuming you have a mix of operating systems, hardware, and preferences about carrying dongles vs. wanting to use phones for second factor, etc.

How to get started!
First, start by enabling numerical time-based one time password (TOTP) for SSH authentication. Is it perfect? No, since a malicious host could impersonate the real bastion (if strict host checking isn’t on), intercept your OTP, and then use it to authenticate to the real bastion. But it’s better than being wormed or compromised because you forgot to take basic measures against even a passive adversary.

Server-side setup
You’ll want a root shell open just in case, and the following snippets added to your Chef cookbooks (from this gist):

attributes/default.rb (from attributes.rb)
recipes/default.rb (copy from recipe.rb)
Okay, now we can set this running on our hosts… and go through the client setup for ourselves at least.

Client-side setup
Now, each user authenticating needs a shared key to be present, encrypted, in SSM (or equivalent for your choice of cloud provider). Have each user install an OTP app such as Google Authenticator, Authy, Duo, or Lastpass, then do the following on their laptop:

Install dependencies:

brew install oath-toolkit OR apt install oathtool openssl

Generate a random base16 string to use as your key:

➜ openssl rand -hex 10
##### ^^^ that’s an example output used here – don’t use it!
Convert it and put it into a phone-based authenticator app: Run oathtool -v [key] to convert it to the format (“Base32 secret”) that mobile authenticators use.

➜ oathtool -v 22ea2966afefd82660e1
Hex secret: 22ea2966afefd82660e1
Base32 secret: ELVCSZVP57MCMYHB
… more stuff down here we don’t need
For 1Password, add a one time password and enter the “Base32 secret” output from oathtool -v [key]
For Duo, select “other” and use the Base32 secret.
for Authy click “Enter key manually” and use the Base32 secret
Verify that generated codes are correct: Run oathtool –totp [key] and check that it returns the same value as your authenticator application.

➜ oathtool –totp 22ea2966afefd82660e1
Store our key into the cloud secrets manager: Run aws ssm put-parameter –name /2fa/totp/$USER –value [key] –type SecureString –key-id alias/parameter_store_key to put your key into SSM Parameter Store. $USER should be the same as the username you use when you log in to a bastion. If you are updating the key instead of pushing it for the first time, add the –overwrite flag to the end of the command.

➜ aws ssm put-parameter –name /2fa/totp/lizf –value 22ea2966afefd82660e1 –type SecureString –key-id alias/parameter_store_key
“Version”: 1
Log in for the first time: Now, when we ssh to the bastion host, we can ensure that the SSH agent can only be trampolined to other hosts within the VPC, but any attempt to programatically use from the outside the forwarded agent (or loaded in-memory keys) to access a bastion will fail because no TOTP from the separate mobile device was provided.

Let’s check that we’re asking for TOTPs:

➜ ssh -A bastion
Enter passphrase for key ‘[snip]’:
One-time password (OATH) for ‘[user]’:
Welcome to Ubuntu 18.04.1 LTS…
Now there’s a value proposition for hardware auth…
People might get sick and tired of entering a numerical OTP every time they have to log into the bastion! It’s almost like the old days of passphrase-encrypted SSH keys that motivated us to use agents! So let’s leverage this inherent laziness to get people more, rather than less, secure!

Server-side setup
Change the beginning of files/sshd in your Chef module to begin as follows:

auth required
auth optional

# If it’s a hardware or secure enclave SSH key, no need for a numerical OTP.
auth sufficient file=/etc/2fa_token_keys

# Check a TOTP code as a second resort, using a time slip of +/- 150 seconds.
auth sufficient usersfile=/etc/users.oath digits=6 window=5

# People without OTPs will need to add an OTP secret to AWS SSM and wait an hour.
auth requisite

And add the following additional lines to recipes/default.rb (a note to the nervous: my source modifications to openssh-server and libpam-ssh-agent-auth are available from Launchpad):

apt_repository ‘openssl-pam-bindings’ do
uri ‘’

packages = %w{ openssh-server libpam-ssh-agent-auth }
packages.each do |p|
r = package p do
action :upgrade

service ‘sshd’ do
subscribes :reload, ‘package[openssh-server]’
Now you’ll need to use Chef to populate /etc/2fa_token_keys with keys that you know are generated and stored securely (e.g. using one of the below methods). I don’t know how you maintain your lists of ssh key mappings to users, nor how you add ssh keys to your ~/.ssh/authorized_keys files, so I can’t provide general advice.

Mac client setup
People with Touchbar Macs should use TouchID to authenticate logins, as they’ll have their laptop and their fingers with them anyways. sekey lets us support this.

Install the binary:

brew cask install sekey

Add to ~/.ssh/config on your local machine:

IdentityAgent ~/.sekey/ssh-agent.ssh

Generate a key and export it:

sekey –generate-keypair “bastion key”
sekey –export-key $(sekey –list-keys|grep “bastion key”|grep –only-matching -E ‘[a-f0-9]{40}’)
And then store the resulting key to /etc/2fa_token_keys and ~/.ssh/authorized_keys in Chef. setup for iOS and Android
Instead of generating OTPs and sending them over manually with our fingers, our mobile devices can securely store our SSH keys and only remotely authorize usage (and send the signed challenge to the remote server) if a human presses a button on the phone.

This is the theory behind, and is even more secure than a TOTP app so long as you supply appropriate parameters to force hardware coprocessor storage (NIST P-256 for iOS, and 3072-bit RSA for Android, on new enough devices). Make sure people use screen locks!

Follow the instructions here: and then supply the generated key to both ~/.ssh/authorized_keys and /etc/2fa_token_keys in your Chef automation, and you won’t be prompted for a TOTP.

YubiKey hardware token & Linux/ChromeOS client setup
Initial per-YubiKey setup
Follow these instructions from a Linux host to set up a basic working hardened YubiKey SSH key:

Install Dependencies

sudo apt-add-repository ppa:yubico/stable && sudo apt-get update
sudo apt-get install gpg yubikey-manager-qt pinentry-curses scdaemon pcscd
echo “reader-port Yubico YubiKey” > .gnupg/scdaemon.conf
Hardening to prevent a rogue host from authenticating without your permission

ykman openpgp touch sig on
ykman openpgp touch aut on
ykman openpgp touch enc on
Hardening in case your security key is stolen

gpg –change-pin

Default user pin is 123456 and admin pin is 12345678, change both of them to something more secure; they can both be the same PIN.

Generate a random 24-byte hex-encoded reset key and save it somewhere, GPG encrypted with your normal daily use keys (ykman-gui can generate a 24-byte string for you in “PIV → Configure PINs → Change Management Key”)

Generating the keys:

gpg –card-edit
Input 4096 for all three modes (you’ll need to enter the admin and user pins)
Don’t back up the stubs when prompted.
Enter your full name and email address; make sure you leave a comment (e.g. desk computer) so you know which stub key is which in your GPG keyring.
Wait a minute, then enter the user PIN one more time, then wait about 5-10 minutes for the generation process to complete. It will print the UID of the master key before returning you to the card-edit prompt.

gpg –export-ssh-key UID_of_master_key

This will print out the ssh pubkey string you’ll need to add to the remote ~/.ssh/authorized_keys and /etc/2fa_token_keys in Chef.

Usage for authentication
Once the per-key setup is done, the configured Yubikey can be used in a Linux machine configured like so: gpg –with-keygrip -K

Save the keygrip of the master key you just generated to .gnupg/sshcontrol

Ensure that you have gpg-agent configured correctly: Set curses pinentry. Why? So you don’t randomly get X passphrase/passcode prompts all over the place (esp remotely):

Edit ~/.gnupg/gpg-agent.conf to contain:

pinentry-program /usr/bin/pinentry-curses
You’ll update your ~/.bashrc to contain the following lines:

export SSH_AUTH_SOCK=$(gpgconf –list-dirs agent-ssh-socket)
export GPG_TTY=$(tty)
gpg-connect-agent updatestartuptty /bye >/dev/null
Run ssh-add -l to confirm you see your key in the list (it’ll show 4096 SHA256:… cardno:… (RSA) in the listing).

When you ssh from a terminal into a bastion (remember to ssh -A for agent forwarding!), it’ll prompt in the terminal that you most recently opened for your PIN on initial usage of the key. You’ll complete that step, then tap your key to confirm. You’re in!

Install these two Chrome apps:
Then open the Secure Shell App (this won’t work yet from the Crostini Terminal app because Crostini doesn’t have USB pass-through yet, although it’s coming in Chrome 75!)

Within the secure shell app’s configuration screen for the bastion host:

relay server option: –ssh-agent=gsc
ssh option: -A
You’ll then enter the user PIN when prompted, and tap the security key to confirm when logging into the bastion.

Further reading
The folks at have written some fantastic blogs on securing SSH that go beyond the basic hardening I recommend here.

Hope this helps! Send me a Twitter DM (@lizthegrey) or email ( if you have improvements to suggest!

name “bastion”
description “special hardening for bastions”
version “0.0.1”

depends “aws”
depends “sshd”
package “libpam-oath” do
action :upgrade

aws_ssm_parameter_store ‘getOTPsecrets’ do
path ‘/2fa/totp/’ # or your own choice of SSM path.
recursive true
with_decryption true
return_key ‘totp_secrets’
action :get_parameters_by_path
# No need for aws_access_key and aws_secret_access_key due to implicit EC2 grant.
sensitive true

# Populate the oath file.
template ‘/etc/users.oath’ do
source ‘users.oath.erb’
owner ‘root’
group ‘root’
mode ‘0600’
:users => lazy { node.run_state[‘totp_secrets’] }
sensitive true

cookbook_file ‘/etc/pam.d/sshd’ do
source ‘sshd’
owner ‘root’
group ‘root’
mode ‘0644’
action :create

# Force ssh to consult PAM as well as using SSH keys for primary auth..
include_recipe ‘sshd’
auth required
auth optional

# Check a TOTP code, using a time slip of +/- 150 seconds.
auth sufficient usersfile=/etc/users.oath digits=6 window=5

# People without OTPs will need to add an OTP secret to AWS SSM and wait an hour.
auth requisite

account required
@include common-account

session [success=ok ignore=ignore module_unknown=ignore default=bad] close
session required
session optional force revoke
@include common-session

session optional motd=/run/motd.dynamic
session optional noupdate
session optional standard noenv
session required
session required
session required user_readenv=1 envfile=/etc/default/locale
session [success=ok ignore=ignore module_unknown=ignore default=bad] open
@include common-password
# To update this file, generate a SSM parameter in /2fa/totp.
# See this URL for examples:

HOTP/T30 –

commented 1 day ago
I see you like Hardening

commented 1 day ago
I prefer to use (and have my devs use) ed25519 keys instead of RSA.
I do it both on kryptonco and on laptops

commented about 21 hours ago
This is fantastic, thank you for writing it!

I’d be interested to see how this could be adapted to work with FreeIPA. (Identity management via ldap, Kerberos, sssd and a certificate authority)

commented about 21 hours ago
Thank you for sharing your knowledge and expertise for free. People like you make the world a better place.

commented about 21 hours ago
I would just use gpg hardware keys with SSH.

Then you don’t have to mess with 2 factor.

Winkster commented about 17 hours ago • edited about 17 hours ago
This is awesome, Liz, thanks so much. Very thorough and thoughtful post. All angles considered. You rock.

commented about 16 hours ago
I’m a little annoyed this entire thing assumes we have Chef.

Requirements should be stated at the beginning, or at the least, you should not launch directly into “add blah into Chef” but instead “First you’ll need Chef and to add blah into it”.

Its presumptuous and rude to assume everybody reading has the requirements and you should instead explitically state them.

dirtypants commented about 15 hours ago • edited about 15 hours ago
Thank you for this write-up, it’s a lot of great information. I don’t think it is either presumptuous or rude to document the actual common and normal software you used, though some may be offended by your use of code. I must admit I’m left confused why anyone would have a problem with someone posting on github the thing they did with common software.

robertkraig commented about 14 hours ago • edited about 14 hours ago
I’d love to see a video posted on youtube or attached in gif in fast-forward mode to see the whole process work / look.

commented about 13 hours ago
Great write-up, thanks for sharing!

relaxdiego commented about 11 hours ago • edited about 7 hours ago
Area man says it’s presumptuous and rude to share what you know for free on the Interwebz. In other news, research shows that open source is bad for the world because of the multitude of languages and frameworks in use. “It’s too gosh darn rude and offensive!” according to one bystander.

commented about 7 hours ago
This is awesome, thanks a lot for sharing!

commented about 4 hours ago
Just wanted to thank you, this is VERY valuable nowadays.

commented about 1 hour ago
Unless I’m misreading, you should delete the top comment for offensiveness. Whether you do or not, please delete this comment after you read it. Thank you.

Comment on gist
Sign in to comment
or sign up to join this conversation on GitHub
Desktop version

Part 1: Java to native using GraalVM

Part 1: Java to native using GraalVM

« Part 2: OpenJ9 versus HotSpot Part 2: Native microservice in GraalVM »
Written by Roy van Rijn ( on Sep 20, 2018 09:14:13 : 4 comments

Part 1: Java to native using GraalVM
One of the most amazing projects I’ve learned about this year is GraalVM.

I’ve learned about this project during Devoxx Poland (a Polish developer conference) at a talk by Oleg Šelajev. If you’re curious about everything GraalVM has to offer, not just the native Java compilation, please watch his video.

GraalVM is a universal/polyglot virtual machine. This means GraalVM can run programs written in:

Python 3
JVM-based languages (such as Java, Scala, Kotlin)
LLVM-based languages (such as C, C++).
In short: Graal is very powerful.

There is also the possibility to mix-and-match languages using Graal, do you want to make a nice graph in R from your Java code? No problem. Do you want to call some fast C code from Python, go ahead.

Installing GraalVM
In this blogpost though we’ll look at another powerful thing Graal can do: native-image compilation

Instead of explaining what it is, let’s just go ahead, install GraalVM and try it out.

To install GraalVM, download and unpack, update PATH parameters and you’re ready to go. When you look in the /bin directory of Graal you’ll see the following programs:

Here we recognise some usual commands, such as ‘javac’ and ‘java’. And if everything is setup correctly you’ll see:

$ java -version
openjdk version “1.8.0_172”
OpenJDK Runtime Environment (build 1.8.0_172-20180626105433.graaluser.jdk8u-src-tar-g-b11)
GraalVM 1.0.0-rc6 (build 25.71-b01-internal-jvmci-0.48, mixed mode)
Hello World with native-image
Next up, let’s create a “Hello World” application in Java:

public class HelloWorld {
public static void main(String… args) {
System.out.println(“Hello World”);
And just like your normal JDK, we can compile and run this code in the Graal virtual machine:

$ javac
$ java HelloWorld
Hello World
But the real power of Graal becomes clear when we use a third command: native-image

This command takes your Java class(es) and turns them into an actual program, a standalone binary executable, without any virtual machine! The commands you pass to native-image very similar to what you would pass to java. In this case we have the classpath and the Main class:

$ native-image -cp . HelloWorld
Build on Server(pid: 63941, port: 60051)*
[helloworld:63941] classlist: 1,236.06 ms
[helloworld:63941] (cap): 1,885.61 ms
[helloworld:63941] setup: 2,758.47 ms
[helloworld:63941] (typeflow): 3,031.39 ms
[helloworld:63941] (objects): 2,136.63 ms
[helloworld:63941] (features): 46.04 ms
[helloworld:63941] analysis: 5,304.17 ms
[helloworld:63941] universe: 205.46 ms
[helloworld:63941] (parse): 640.12 ms
[helloworld:63941] (inline): 1,155.06 ms
[helloworld:63941] (compile): 3,436.76 ms
[helloworld:63941] compile: 5,594.76 ms
[helloworld:63941] image: 749.82 ms
[helloworld:63941] write: 653.29 ms
[helloworld:63941] [total]: 16,753.87 ms
$ ls -ltr
-rw-r–r– 1 royvanrijn wheel 119 Sep 20 09:36
-rw-r–r– 1 royvanrijn wheel 425 Sep 20 09:38 HelloWorld.class
-rwxr-xr-x 1 royvanrijn wheel 5596400 Sep 20 09:41 helloworld
$ ./helloworld
Hello World
Now we have an executable that prints “Hello World”, without any JVM in between, just 5.6mb. Sure, for this example 5mb isn’t that small, but it is much smaller than having to package and install an entire JVM (400+mb)!

Docker and native-image
So what else can we do? Well, because the resulting program is a binary, we can put it into a Docker image without ANY overhead. To do this we’ll need two different Dockerfile’s, the first is used to compile the program against Linux (instead of MacOS or Windows), the second image is the ‘host’ Dockerfile, used to host our program.

Here is the first Dockerfile:

FROM ubuntu

RUN apt-get update && \
apt-get -y install gcc libc6-dev zlib1g-dev curl bash && \
rm -rf /var/lib/apt/lists/*

# Latest version of GraalVM (at the time of writing)
ENV GRAAL_FILENAME graalvm-ce-${GRAAL_VERSION}-linux-amd64.tar.gz

# Download GraalVM

# Untar and move the files we need:
RUN tar -zxvf /tmp/${GRAAL_FILENAME} -C /tmp \
&& mv /tmp/graalvm-ce-${GRAAL_VERSION} /usr/lib/graalvm

RUN rm -rf /tmp/*

# Create a volume to which we can mount to build:
VOLUME /project
WORKDIR /project

# And finally, run native-image
ENTRYPOINT [“/usr/lib/graalvm/bin/native-image”]
This image can be created as follows:

$ docker build -t royvanrijn/graal-native-image:latest .
Using this image we can create a different kind of executable. Let’s create our application using the just created docker image:

$ docker run -it \
-v /Projects/graal-example/helloworld/:/project –rm \
royvanrijn/graal-native-image:latest \
–static -cp . HelloWorld -H:Name=app

Build on Server(pid: 11, port: 40905)*
[app:11] classlist: 3,244.85 ms
[app:11] (cap): 1,023.94 ms
[app:11] setup: 1,986.81 ms
[app:11] (typeflow): 4,285.18 ms
[app:11] (objects): 2,008.19 ms
[app:11] (features): 57.07 ms
[app:11] analysis: 6,446.49 ms
[app:11] universe: 255.45 ms
[app:11] (parse): 926.85 ms
[app:11] (inline): 1,496.69 ms
[app:11] (compile): 4,953.85 ms
[app:11] compile: 7,689.47 ms
[app:11] image: 806.53 ms
[app:11] write: 573.77 ms
[app:11] [total]: 21,160.90 ms
$ ls -ltr app
-rwxr-xr-x 1 royvanrijn wheel 6766144 Sep 20 10:11 app
$ ./app
-bash: ./app: cannot execute binary file
This results in an executable ‘app’, but this is one I can’t start on my MacBook, because it is a statically linked Ubuntu executable. So what do all these commands mean? We’ll let’s break it down:

The first part is just running Docker:
docker run -it

Next we map my directory containing the class files to the volume /project in the Docker image:
-v /Projects/graal-example/helloworld/:/project –rm

This is the Docker image we want to run, the one we just created:

And finally we have the commands we pass to native-image inside the Docker image
We start with –static, this causes the created binary to be a statically linked executable

We have the class path and Main class:
-cp . HelloWorld

And finally we tell native-image to name the resulting executable ‘app’
But we can do something cool with it using the following, surprisingly empty, Dockerfile:

FROM scratch
COPY app /app
CMD [“/app”]
We start with the most empty Docker image you can have, scratch and we copy in our app executable and finally we run it. Now we can build our helloworld image:

$ docker build -t royvanrijn/graal-helloworld:latest .
Sending build context to Docker daemon 34.11MB
Step 1/3 : FROM scratch
Step 2/3 : COPY app /app
—> f0894b299e8f
Removing intermediate container 37182de1ef68
—> 49ff43413c7a
Step 3/3 : CMD [“/app”]
—> Running in ea69a913d243
Removing intermediate container ea69a913d243
—> ab33b4d59de3
Successfully built ab33b4d59de3
Successfully tagged royvanrijn/graal-helloworld:latest

$ docker images
royvanrijn/graal-helloworld latest ab33b4d59de3 5 seconds ago 6.77MB
We’ve now turned our Java application into a very small Docker image with a size of just 6.77MB!

In the next blogpost Part 2 we’ll take a look at Java applications larger than just HelloWorld. How will GraalVM’s native-image handle those applications, and what are the limitations we’ll run into?

Connect on LinkedIn Follow me on StackOverflow Follow me on GitHub Follow me on Twitter Check out the RSS feed