Hardening Perl’s Hash Function

Hardening Perl’s Hash Function

B.
About
We are the world’s leading online accommodation provider, operating across 220+ countries in 43 languages. Our team of designers, developers, database engineers & Sysadmins solve complicated problems at huge scale.

Work with us
Visit our site
Share this post
Facebook Twitter Linked In
Get in touch
Twitter
Facebook
LinkedIn
Dribbble
Hardening Perl’s Hash Function
Yves Orton
Wed 06 November 2013
In 2003 the Perl development community was made aware of an algorithmic complexity attack on the Perl’s hash table implementation[1]. This attack was similar to reports over the last few years of attacks on other languages and packages, such as the Java, Ruby and Python hash implementations.

The basic idea of this attack is to precompute a set of keys which would hash to the same value, and thus the same storage bucket. These keys would then be fed (as a batch) to a target which would then have to compare each key against each previously stored key before inserting the new key, effectively turning the hash into a linked list, and changing the performance profile for inserting each item from O(1) (amortized) to O(N). This means that the practice of loading arguments such as GET/POST parameters into hashes provided a vector for denial of service attacks on many HTTP-based applications.

As a response to this, Perl implemented a mechanism by which it would detect long chains of entries within a bucket and trigger a “hash split”. This meant it would double the number of buckets and then redistribute the previously stored keys as required into the newly added buckets. If, after this hash split, the chain was still unacceptably long, Perl would cause the hash to go into a special mode (REHASH mode) where it uses a per-process random hash seed for its hash function. Switching a normal hash to this special mode would cause Perl to allocate a new bucket array, recalculate all of the previously stored keys using the random seed and redistribute the keys from the old bucket array into the new one. This mitigated the attack by remapping the previously colliding keys into a well distributed set of randomly chosen new buckets.

At this point the Perl community thought we had put the subject of hash collision attacks behind us, and for nearly 10 years we heard little more on the subject.

Memory Exhaustion Attack On REHASH Mechanism
Over the years occasionally the subject of changing our hash function would come up. For instance Jarkko made a number of comments that there were faster hash functions and in response I did a bit of research into the subject, but little came of this work.

In 2012 this changed. I was working on several projects that made heavy use of Perl’s hash function, and I decided to invest some efforts to see if other hash functions would provide performance improvements. At the same time other people in the Perl community were becoming interested, partly due to my work and partly due to the publicity from the multi-collision attacks on Python’s and Ruby’s hash functions (MurmurHash and CityHash). Publicity I actually did not notice until after I had pushed patches to switch Perl to use MurmurHash as its default hash, something that got reverted real fast.

In restructuring the Perl hash implementation so it was easier to test different hash functions, I became well acquainted with the finer details of the implementation of the REHASH mechanism. Frankly it got in the way and I wanted to remove it outright. While arguing about whether it could be replaced with a conceptually simpler mechanism I discovered that the defenses put in place in 2003 were not as strong as had been previously believed. In fact they provided a whole new and, arguably, more dangerous attack vector than the original attack they were meant to mitigate. This resulted in the perl5 security team announcing CVE-2013-1667, and the release of security patches for all major Perls versions since 5.8.x.

The problem was that the REHASH mechanism allowed an attacker to create a set of keys which would cause Perl to repeatedly double the size of the hash table, but never trigger the use of the randomized hash seed. With relatively few keys the attacker could make Perl allocate a bucket array with up to 2^32 hash buckets, or as many as memory would allow. Even if the attack did not consume all the memory on the box there would be serious performance consequences as Perl remapped the keys into ever increasing bucket arrays. Even on fast 2013 hardware, counting from 0 to 2^32 takes a while!

This issue affected all versions of Perl from 5.8.2 to 5.16.2. It does not affect Perl 5.18. For those interested the security patches for these versions are as follows:

maint-5.8: 2674b61957c26a4924831d5110afa454ae7ae5a6
maint-5.10: f14269908e5f8b4cab4b55643d7dd9de577e7918
maint-5.12: 9d83adcdf9ab3c1ac7d54d76f3944e57278f0e70
maint-5.14: d59e31fc729d8a39a774f03bc6bc457029a7aef2
maint-5.16: 6e79fe5714a72b1ef86dc890ff60746cdd19f854
At this time most Perl installations should be security-patched. Additionally official Perl maintenance releases 5.16.3, and 5.14.4 were published. But if you would like to know if you are vulnerable you can try the following program:

perl -le’@h{qw(a h k r ad ao as ax ay bs ck cm cz ej fz hm ia ih is
iz jk kx lg lv lw nj oj pr ql rk sk td tz vy yc yw zj zu aad acp
acq adm ajy alu apb apx asa asm atf axi ayl bbq bcs bdp bhs bml)}
=(); print %h=~/128/ && “not “,” ok # perl $]”‘
The following are statistics generated by the time program for the full attack (not the one-liner above) against a Perl 5.16 with and without the fix applied (identical/zero lines omitted) on a laptop with 8GB:

Without the fix patch (0ff9bbd11bcf0c048e5b3e4b893c52206692eed2):

User time (seconds): 62.02
System time (seconds): 1.57
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:04.01
Maximum resident set size (kbytes): 8404752
Minor (reclaiming a frame) page faults: 1049666
Involuntary context switches: 8946
With the fix patch (f1220d61455253b170e81427c9d0357831ca0fac) applied:

User time (seconds): 0.05
System time (seconds): 0.00
Percent of CPU this job got: 56%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.09
Maximum resident set size (kbytes): 16912
Minor (reclaiming a frame) page faults: 1110
Involuntary context switches: 3209
But this doesn’t explain all of the changes in Perl 5.18
The observant reader will have realized that if we could patch older Perls to be robust to CVE-2013-1667 that we could also have patched 5.18 and avoided the problems that were caused by changing Perl’s hash function. The reason we went even further than those maintenance patches is that we found out that Perl had further, if less readily exploitable, vulnerabilities to attack, and we wanted to do our best to fix them all.

This part of the story starts with Ruslan Zakirov posting a report to the perl5-security mailing list. The report outlined the basis of a key discovery attack on Perl’s hash function. At first the Perl security team was not entirely convinced, but he then followed up with more code that demonstrated that his attack was sound. This development meant that the choice of a random seed would not make Perl’s hash function robust to attack. An attacker could relatively efficiently determine the seed, and then use that knowledge to construct a set of attack keys that could be used to attack the hash function.

Nicholas Clark then ramped things up a notch further and did some in-depth analysis on the attack and the issues involved. At the same time so did Ruslan and myself. The conclusion of this analysis was that the attack exploited multiple vulnerabilities in how Perl’s hash data structure worked and that the response would similarly require a multi-pronged approach.

Changes to the One-At-A-Time function
The first vulnerability was that Bob Jenkins’ One-At-A-Time hash, which Perl used, does not “mix” the seed together with the hashed data well enough for short keys. This allows an attacker to mount a key discovery attack by using small sets of short keys and the order they were stored in to probe the “seed” and eventually expose enough bits of the seed that a final collision attack could be mounted.

We addressed this issue by making Perl append a four digit, randomly chosen suffix to every string it hashed. This means that we always “mix” the seed at least 4 times, and we mix it with something that the attacker cannot know. This effectively doubles the number of bits used for “secret” state, and ensures that short keys do not “leak” information about the original seed. The reason we use a suffix is that adding a prefix is the same as starting with a different initial seed state, so does not add any extra security. A suffix modifies the final state after the user input is provided and increases the search space an attacker must consider.

Related to this change was that the original One-At-A-Time function was potentially vulnerable to multi-collision attacks. An attacker could precalculate one or more suffixes such that

H(x) == H( concat(x, suffix) )
which would then allow an attacker to trivially construct an infinite set of keys which would always collide into the same bucket. We hardened the hash by mixing in the length of the key into the seed. We believe that this more or less eliminates the possibility of a multi-collision attack as it means that the seed used to calculate H( concat(x, suffix) ) would not be the same seed as H( concat(x, suffix, suffix) ). Cryptographers are invited to prove us wrong.

Reduce Information Leakage
The second vulnerability was that it is all too easy to leak information about the hash function to an attacker. For instance a web page might accept a set of parameters and respond with information for each of those parameters in the natural key order for the hash. This might provide enough information to mount a key discovery attack.

In order to prevent this information leakage we randomize the element order returned by the keys() and each() functions on the hash. We do this by adding a mask to each hash, and when an insert into the hash occurs we modify the mask in a pseudo-random way. During traversal we iterate from 0 to the k-th bucket and then XOR the iteration value with the mask. The result is that every time a new key is added to the hash the order of keys will change more or less completely. This means that the “natural” key order of the hash exposes almost no useful data to an attacker. Seeing one key in front of another does not tell you anything about which bucket the key was stored in. A nice side effect of this is that we can use this mask to detect people inserting into a hash during each() traversal, which generally indicates a problem in their code and can produce quite surprising results and be very difficult to debug. In Perl 5.18 when this happens we warn the developer about it.

A third vulnerability is related to the case where two keys are to be stored in the same bucket. In this case the order of the keys was predictable: the most recently added key would be “first” out during a keys() or each() traversal. This in itself allows a small amount of data to leak to a potential adversary. By identifying such a case one could find two (or more) strings which had the same least significant bits. By stuffing more keys into the hash and triggering a hash split an attacker could determine that the newly added bit of the hash value was different, or the same, for the two keys. Without the key-order randomization logic mentioned previously the attacker could also determine which of the two had a 1 or 0 in the most significant bit of the used part of the hash value.

While we were not yet able to construct an actual attack based on this information we decided to harden against it anyway. This is done by randomly choosing whether we should insert the colliding key at the top of a bucket chain or if we should insert at the second from top in the chain. Similarly during a hash split we also make such a decision when keys collide while being remapped into the new buckets. The end result is that the order of two keys colliding in a bucket is more or less random, although the order of more than two keys is not.

People complain. Randomization is good anyway!
We introduced randomization into Perl’s hash function in order to harden it against attack. But we have discovered that this had other positive consequences that we did not foresee.

The first of these initially appeared to be a downside. Perl’s hash function behaved consistently for more than a decade. Over that time Perl developers inadvertently created dependencies on this key order. Most of the examples of this were found in test files of CPAN modules: Many of us got lazy and “froze” a key order into the test code. For example by embedding the output of a Data::Dumpercall into one’s tests. Some of these however were real bugs in reputable and well tested modules.

By making Perl’s key order random these dependencies on key order became instantly visible and a number of bugs that probably manifested themselves as “heisenbugs” became regular occurrences, and much easier to track down and identify. I estimate that for every two “non-bugs” (in things like test code) that we found, there was one “real bug” that was identified as well. Considering one of these was in the Perl core, and the other was in DBI, I personally consider this to be a good result.

Many people object that randomization like this makes debugging harder. The premise is that it becomes difficult to recreate a bug and thus debug it. I believe that in practice it is the opposite. Randomization like this means that a formerly rare bug becomes common. Which in turn means it becomes much more obvious that it is related to subtle dependencies on key order. Effectively making it much easier to find such problems.

A last benefit of randomizing the hash function is that we can now, at any time, change or replace the hash function that Perl is built with. In fact 5.18 is bundled with multiple hash functions including the presumed cryptographically strong Siphash. Since hash order from now on will be random, we don’t have to worry if we change the function. External code should already be robust to the hash order being unpredictable.

How tangible are attacks on Perl’s hash function?
There has been a lot of discussion on this subject. Obviously erring on the side of caution on security matters is the right course of action. Nevertheless there is a lot of debate on how practical an attack like this is in real production environments where Perl is commonly used, such as web servers. Here are some of the points to keep in mind:

Perls hash algorithm uses an array of buckets whose size is always between the number of keys stored in it and a factor of two larger. This means that a hash with 20 keys in it will generally have 32 buckets, a 32 keys hash will be split into 64 buckets, and so on. This means the more keys are inserted in a hash the less likely further keys will be put into the same bucket. So attacking a hash cannot make non-attack keys slower[2]. An attacker basically only slows their own fetches down and except as a by-product of resource consumption they will not affect other requests.
For an attack to reach DOS proportions the number of items inserted into the hash would have to be very, very large. On modern CPUs a linked list of thousands to hundreds of thousands of keys would be necessary before there was serious degradation of service. At this point even if the attack was unsuccessful in terms of degrading Perl’s hash algorithm, it would still function effectively as a data-flooding denial of service attack. Therefore, focusing on the hash complexity aspect of the attack seems unwarranted.
Rudimentary out-of-band measures are sufficient to mitigate an attack. Hard restrictions on the number keys that may be accepted by publicly facing processes are sufficient to prevent an attack from causing any damage. For instance, Apache defaults to restricting the number of parameters it accepts to 512, which effectively hardens it against this type of attack. (This is one reason the attack on the rehash mechanism is so important, a “successful” attack requires relatively few keys.) Similarly, a well designed application would validate the parameters it receives and not put them in the hash unless they were recognized.
So long as the hash function chosen is not vulnerable to multi-collision attacks then simple per-process hash seed randomization makes the job of finding an attack keys set prohibitively difficult. One must first perform a hash-seed discovery attack, then generate a large set of keys. If restrictions on the number of keys the process will accept are in place then the keys must be very large before collisions would have a noticeable effect. This also makes the job of finding colliding keys all the more expensive.
In many circumstances, such as a web-service provider, the hosts will be behind load balancers. Which will either mean that every different web host uses different hash seeds, making hash seed discovery attacks very difficult. Or it requires an attacker to open a very long-running, persistent session with the server they wish to attack. This should be easily preventable via normal monitoring procedures.
For all these reasons it appears that hash-complexity attacks in the context of Perl and web hosting environments are of limited interest so long as:

The hash function does not allow multi-collision attacks.
The hash function uses at least a per-process hash seed randomization
The interface to untrusted potential attackers uses simple, hard limits on the number of keys it will accept.
These properties are relatively easy to accomplish without resorting to cryptographically strong hash functions (which are generally slow), or other complicated measures to prevent attacks. As the case of Perl’s rehashing flaw has shown, the cure may be worse than the disease. The code that this CVE exploits for its attack was added to Perl as part of our attempt to defend against the hypothetical attack vector of excessive bucket collisions. We are at this time unaware of any real attack of this form.

Hash Functions For Dynamic Languages
It seems like the bulk of research on hash functions has focused on finding fast, cryptographically secure hash functions for long strings containing binary data. However, dynamic languages, like Perl, make heavy use of hash functions on small strings, which are often restricted to simple alphanumeric characters. Examples are email addresses, identifiers like variables and method names, single character keys, and lists of numeric ids and similar use cases. Not only are they relatively short strings, but they use restricted sets of bits in their bytes.

So far it appears that Bob Jenkins’ One-At-A-Time-Hash with minor modifications provides an acceptable solution. It seems to have good distribution properties, is reasonably fast for short strings, and – with the hardening measures added in Perl 5.18 – it appears to be robust against attack. Analysis done by the Perl5 Security Team suggests that One-At-A-Time-Hash is intrinsically more secure than MurmurHash. However to my knowledge, there is no peer-reviewed cryptanalysis to prove it.

There seems to be very little research into fast, robust, hash algorithms which are suitable for dynamic languages. Siphash is a notable exception and a step forward, but within the Perl community there seems to be consensus that it is currently too slow, at least in the recommended Siphash-2-4 incarnation. It is also problematic that its current implementation only supports 64 bit architectures. (No doubt this will improve over time, or perhaps even already has.)

Universal hashing also seems promising in theory, but unproven in practice for cases like Perl, where the size of the hash table can be very small, and where the input strings are of variable size, and often selected from restricted character sets. I attempted to implement a number of Universal Hashing-style hash functions for Perl, and was disappointed by extremely poor distributions in the bucket hash for simple tasks like hashing a randomly selected set of integers in text form. This may have been a flaw in my implementation, but it appeared that Universal Hashing does not perform particularly well when the input bytes are not evenly distributed. At the very least further work is required to prove the utility of Universal Hashing in a dynamic language context.

The dynamic/scripting language community needs the academic computing community to provide a better tool box of peer reviewed string hash functions which offer speed, good distribution over restricted character sets and on short strings, and that are sufficiently hardened that in practical deployments they are robust to direct attack. Security is important, but theoretical attacks which require large volumes of key/response exchanges cannot trump requirements such as good distribution properties and acceptable performance characteristics. Perl now makes it relatively easy to add and test new hash functions (see hv_func.h), and would make a nice test bed for those interested in this area of research.

Afterwards
I would like to thank Nicholas Clark, Ruslan Zakirov, and Jarkko Hietaniemi for their contributions which led to the changes in Perl 5.18. Nicholas did a lot of deep analysis, and provided the motivation to create my attack proof, Ruslan provided analysis and a working key-discovery attack scripts on Perl’s old hash function, and Jarkko motivated me to look into the general subject.

[1] See also: The bug filed against Perl and the original research.

[2] If anything, such an attack might make access to keys that aren’t part of the attack faster: The attack keys cluster in one bucket. That means the other keys are much more likely to spread out among the remaining buckets that now have fewer keys on average than without the attack.

All antidepressants are more effective than placebo at treating acute depression in adults, concludes study

All antidepressants are more effective than placebo at treating acute depression in adults, concludes study

Meta-analysis led by Dr Andrea Cipriani of 522 trials includes the largest amount of unpublished data to date, and finds that antidepressants are more effective than placebo for short-term treatment of acute depression in adults.

Medication 1
A major study comparing 21 commonly used antidepressants concludes that all are more effective than placebo for the short-term treatment of acute depression in adults, with effectiveness ranging from small to moderate for different drugs.

The international study, published in The Lancet, is a network meta-analysis of 522 double-blind, randomised controlled trials comprising a total of 116477 participants. The study includes the largest amount of unpublished data to date, and all the data from the study have been made freely available online.

Our study brings together the best available evidence to inform and guide doctors and patients in their treatment decisions. We found that the most commonly used antidepressants are more effective than placebo, with some more effective than others. Our findings are relevant for adults experiencing a first or second episode of depression – the typical population seen in general practice. Antidepressants can be an effective tool to treat major depression, but this does not necessarily mean that antidepressants should always be the first line of treatment. Medication should always be considered alongside other options, such as psychological therapies, where these are available. Patients should be aware of the potential benefits from antidepressants and always speak to the doctors about the most suitable treatment for them individually.
– Dr Andrea Cipriani, University of Oxford Department of Psychiatry
An estimated 350 million have depression worldwide. The economic burden in the USA alone has been estimated to be more than US$210 billion. Pharmacological and non-pharmacological treatments are available but because of inadequate resources, antidepressants are used more frequently than psychological interventions. However, there is considerable debate about their effectiveness.

As part of the study, the authors identified all double-blind, randomised controlled trials (RCTs) comparing antidepressants with placebo, or with another antidepressants (head-to-head trials) for the acute treatment (over 8 weeks) of major depression in adults aged 18 years or more. The authors then contacted pharmaceutical companies, original study authors, and regulatory agencies to supplement incomplete reports of the original papers, or provide data for unpublished studies.

The primary outcomes were efficacy (number of patients who responded to treatment, i.e. who had a reduction in depressive symptoms of 50% or more on a validated rating scale over 8 weeks) and acceptability (proportion of patients who withdrew from the study for any reason by week 8).

Overall, 522 double-blind RCTs done between 1979 and 2016 comparing 21 commonly used antidepressants or placebo were included in the meta-analysis, the largest ever in psychiatry. A total of 87052 participants had been randomly assigned to receive a drug, and 29425 to receive placebo. The majority of patients had moderate-to-severe depression.

All 21 antidepressants were more effective than placebo, and only one drug (clomipramine) less acceptable than placebo.

Some antidepressants were more effective than others, with agomelatine, amitriptyline, escitalopram, mirtazapine, paroxetine, venlafaxine, and vortioxetine proving most effective, and fluoxetine, fluvoxamine, reboxetine, and trazodone being the least effective. The majority of the most effective antidepressants are now off patent and available in generic form.

Antidepressants also differed in terms of acceptability, with agomelatine, citalopram, escitalopram, fluoxetine, sertraline, and vortioxetine proving most tolerable, and amitriptyline, clomipramine, duloxetine, fluvoxamine, reboxetine, trazodone, and venlafaxine being the least tolerable.

The authors note that the data included in the meta-analysis covers 8-weeks of treatment, so may not necessarily apply to longer term antidepressant use. The differences in efficacy and acceptability between different antidepressants were smaller when data from placebo-controlled trials were also considered.

In order to ensure that the trials included in the meta-analysis were comparable, the authors excluded studies with patients who also had bipolar depression, symptoms of psychosis or treatment resistant depression, meaning that the findings may not apply to these patients. “Antidepressants are effective drugs, but, unfortunately, we know that about one third of patients with depression will not respond. With effectiveness ranging from small to moderate for available antidepressants, it’s clear there is still a need to improve treatments further,” adds Dr Cipriani.

409 (78%) of 522 trials were funded by pharmaceutical companies, and the authors retrieved unpublished information for 274 (52%) of the trials included in the meta-analysis. Overall, 46 (9%) trials were rated as high risk of bias, 380 (78%) as moderate, and 96 (18%) as low. The design of the network meta-analysis and inclusion of unpublished data is intended to reduce the impact of individual study bias as much as possible. Although this study included a significant amount of unpublished data, a certain amount could still not be retrieved.

Antidepressants are routinely used worldwide yet there remains considerable debate about their effectiveness and tolerability. By bringing together published and unpublished data from over 500 double blind randomised controlled trials, this study represents the best currently available evidence base to guide the choice of pharmacological treatment for adults with acute depression. The large amount of data allowed more conclusive inferences and gave the opportunity also to explore potential biases.
– Professor John Ioannidis, from the Departments of Medicine, Health Research and Policy, Biomedical Data Science, and Statistics, Stanford University, USA
The authors note that they did not have access to individual-level data so were only able to analyse group differences. For instance, they could not look at the effectiveness or acceptability of antidepressants in relation to age, sex, severity of symptoms, duration of illness or other individual-level characteristics.

The findings from this study contrast with a similar analysis in children and adolescents, which concluded that fluoxetine was probably the only antidepressant that might reduce depressive symptoms. The authors note that the difference may be because depression in young people is the result of different mechanisms or causes, and note that because of the smaller number of studies in young people there is great uncertainty around the risks and benefits of using any antidepressants for the treatment of depression in children and adolescents.

Read the study in The Lancet.

Read more about Dr Andrea Cipriani.

The Legend of Cliff Young: The 61 Year Old Farmer Who Won the World’s Toughest Race

The Legend of Cliff Young: The 61 Year Old Farmer Who Won the World’s Toughest Race

The legendary story of Cliff Young is already known to many runners. If you aren’t familiar with it, you’re in for a fascinating read.

An Unlikely Competitor
Cliff Young winning Melbourne Sydney race
Cliff Young

Every year, Australia hosts 543.7-mile (875-kilometer) endurance racing from Sydney to Melbourne. It is considered among the world’s most grueling ultra-marathons. The race takes five days to complete and is normally only attempted by world-class athletes who train specially for the event. These athletes are typically less than 30 years old and backed by large companies such as Nike.

In 1983, a man named Cliff Young showed up at the start of this race. Cliff was 61 years old and wore overalls and work boots. To everyone’s shock, Cliff wasn’t a spectator. He picked up his race number and joined the other runners.

The press and other athletes became curious and questioned Cliff. They told him, “You’re crazy, there’s no way you can finish this race.” To which he replied, “Yes I can. See, I grew up on a farm where we couldn’t afford horses or tractors, and the whole time I was growing up, whenever the storms would roll in, I’d have to go out and round up the sheep. We had 2,000 sheep on 2,000 acres. Sometimes I would have to run those sheep for two or three days. It took a long time, but I’d always catch them. I believe I can run this race.”

When the race started, the pros quickly left Cliff behind. The crowds and television audience were entertained because Cliff didn’t even run properly; he appeared to shuffle. Many even feared for the old farmer’s safety.

The Tortoise and the Hare
Cliff Young waving during ultra marathon
Cliff Young

All of the professional athletes knew that it took about 5 days to finish the race. In order to compete, one had to run about 18 hours a day and sleep the remaining 6 hours. The thing is, Cliff Young didn’t know that!

When the morning of the second day came, everyone was in for another surprise. Not only was Cliff still in the race, he had continued jogging all night.

Eventually Cliff was asked about his tactics for the rest of the race. To everyone’s disbelief, he claimed he would run straight through to the finish without sleeping.

Cliff kept running. Each night he came a little closer to the leading pack. By the final night, he had surpassed all of the young, world-class athletes. He was the first competitor to cross the finish line and he set a new course record.

When Cliff was awarded the winning prize of $10,000, he said he didn’t know there was a prize and insisted that he did not enter for the money. He ended up giving all of his winnings to several other runners, an act that endeared him to all of Australia.

Continued Inspiration
In the following year, Cliff entered the same race and took 7th place. Not even a displaced hip during the race stopped him.

Cliff came to prominence again in 1997, aged 76, when he attempted to raise money for homeless children by running around Australia’s border. He completed 6,520 kilometers of the 16,000-kilometer run before he had to pull out because his only crew member became ill. Cliff Young passed away in 2003 at age 81.

Today, the “Young-shuffle” has been adopted by ultra-marathon runners because it is considered more energy-efficient. At least three champions of the Sydney to Melbourne race have used the shuffle to win the race. Furthermore, during the Sydney to Melbourne race, modern competitors do not sleep. Winning the race requires runners to go all night as well as all day, just like Cliff Young.

The Beauty of the COBOL Programming Language

The Beauty of the COBOL Programming Language

Well-written code is a work of art. Always has been, always will be. A programmer pulls a thought pretty much out of nowhere and transforms it into a working idea that can be used by others. It’s abstract expression made real. Computer programming requires a depth of creativity and discipline of logic that is hard to find elsewhere. Maybe architecture and the theoretical sciences come close, but computer programming stands apart. Computer programming is special and I love it!

Thus, I am always interested to learn a new programming language. Every programming language offers a new way to expand my thinking. It’s always time well-spent.

Recently I decided to learn COBOL, for no other reason than the fact that there are a lot of mainframe installations out there and it’s the mainstay language for many of them. Mainframes are critical to the operations of many banks, insurance companies, transportation systems and governmental agencies. Learning COBOL been on my Bucket List for while. So I took the plunge.

What does the term, COBOL stand for? Common Business-Oriented Language
Working with COBOL in a Modern IDE
The first thing I needed in my journey to learn COBOL was an IDE. I am a big supporter of coding in an integrated development environment (IDE). I like being able to write, test and run code all in one place. Also, I find the support features that an IDE provides, such as visual code structure analysis, code completion and inline syntax checking, allow me to program and debug efficiently.

The IDE I found is an open source product, OpenCobolIDE, as shown below in Figure 1.

Figure 1: OpenCobolIDE provides many tools to make COBOL programming easier

OpenCobolIDE allows me to write, compile and run code all in one place, without having to go out to the command line, which is very good because I am the world’s worst typist!

Programming COBOL in the Real World
When it comes time to do real world, mainframe based COBOL programming, you will do well to look at the IDEs that IBM provides. The tools are designed to go about against its Z Systems environment seamlessly as a set of Eclipse plugins. These tools allow COBOL developers working in the mainframe environment to code, debug, unit test and do problem determination.

Having set up my development environment, the next thing to do was design my first project.

Designing My First COBOL Program
Typically when learning a new language I like to start small and work my way up into more complex tasks. I did the typical Hello World program. You write the code to display string, Hello World, to standard output. Every developer does one. It’s a trivial program, as shown below in Listing 1

beauty-of-cobol-listing-01
Listing 1: A Hello World program written COBOL

While the Hello World required that I learn about the concept of a DIVISION, which I will discuss in detail below, in the scheme of things, I needed more. I wanted to write a program that would force me to learn how the following:

How to create and use variables.
How to structure data into a hierarchy.
How to structure code into encapsulated procedures.
How to do some basic arithmetic.
How to accept user input and then do some manipulation around that input.

Thus, I create the program, RESEL-WORLD, as shown in Listing 2, below.

beauty-of-COBOL-Listing-02

Listing 2: The COBOL program, RESEL-WORLD accepts and manipulates users’ input

The user input and program output is shown below in Figure 2.

Figure 2: The RESEL-WORLD COBOL program asks a user for first name, last name and age, then manipulates the input

The RESEL-WORLD program taught me a lot about COBOL and how to program in the language. In the spirit of giving, I am going to share what I learned from writing the program. However, before I delve into the details, please be advised that the information I’m presenting is but a high-level overview. As with any programming language, it takes about a year of consistent coding to become an entry-level professional. My sincere hope is that the information I provide gives the reader a good sense of the language and enough motivation to want to learn more. As I mentioned early, learning COBOL is worth the effort.

COBOL is Code in a Structured Document
The most important thing to understand when learning COBOL is that is very strict in terms of code layout. The layout rules relate to the use of columns and characters. Also, the format uses a hierarchical outline structure. The following sections describe the details of layout specification.

The COBOL Column Specification
In COBOL a line of code can be no longer than 80 columns in length. You can also think of a column as a character. Columns are segmented into groups, with each group serving a particular purpose. The columns groups are as follows:

Columns 1-6 is the group in which a programmer defines a sequence number. A sequence number is similar to a line number.

Column 7 is reserved for special characters. Asterisk (*) starts a line of comment. Hyphen (-) indicates line continuation and slash (/) is form feed.

Columns 8-11 is called Area A. Area A is the group of columns in which you start a DIVISION, PARAGRAPH and SECTION. We’ll talk more about these below in the next section, The COBOL Structural Hierarchy.

Columns 12-72, also known as Area B, is where you write code statements.

Columns 73-80 is reserved for developer use. You can write poetry in there, if you so desire. Just make sure you don’t go past column 80.

Figure 3, below illustrates the columns grouping described above.

Figure 3: Each column in a COBOL file is specified to serve a specific purpose

COBOL is a compiled language. When it’s time to run your code, the compiler will check to make sure that the code layout adheres to the column grouping specification. If there is a violation, the compiler will error.

Understanding Performance in Terms of Compilation and Compilers
COBOL is a compiled language, as are others such as Java, C# and C++. Compilation is the process of taking textual source code and converting into a binary format that the computer can understand.

There’s a lot of variety in the COBOL world when it comes to compilers, particularly when it comes to cost and performance efficiency. For example, how a compiler orchestrates the way numbers are loaded and then computed in memory matters a lot! A good compiler will make code run really fast.

IBM has been in the COBOL business for a long time and has a keen understanding about making compilers that are fast and cost-effective. For example, the z14 System compilers improve efficiency by taking advantage of almost 24 new low-level instructions. As a result, computation speeds improve dramatically.

As you work more with COBOL, you’ll come to appreciate the power that an industry-leading compiler such the IBM Z Systems brings to the coding experience.

The COBOL Structural Hierarchy
The concept behind the file format of a COBOL program is based on the structure of a document outline, with a single top-level heading followed by subordinate levels. The organizational units that make up the hierarchy are PROGRAM, DIVISION, SECTION, PARAGRAPH, SENTENCE, STATEMENT and CHARACTER. Figure 4 illustrates the hierarchy.

Figure 4: The structural hierarchy of COBOL program entities

PROGRAM is the root level of the COBOL code hierarchy. PROGRAM represents the unit of code that the mainframe job scheduler, the JCL, loads into memory to run. The program is identified by the PROGRAM ID statement in the IDENTIFICATION DIVISION. The IDENTIFICATION DIVISION is part of the next level of hierarchy that descends from PROGRAM. PROGRAM must contain the IDENTIFICATION DIVISION.

There are other DIVISIONs that can be included. These other DIVISIONs are: ENVIRONMENT DIVISION, DATA DIVISION and PROCEDURE DIVISION, which occur after IDENTIFICATION. Also, these subsequent DIVISIONs must appear in the order defined: ENVIRONMENT, DATA and then PROCEDURE. You can read a detailed description of each DIVISION here. DIVISION names are terminated with a period, for example,

DATA DIVISION.

The next organizational level down from DIVISION is SECTION. (Please see Figure 5.) Each DIVISION will contain SECTIONs that are special to it. For example, the DATA DIVISION can contain FILE SECTION, a, WORKING-STORAGE SECTION and/or the LINKAGE SECTION. You can read the details about DIVISIONs and SECTIONs here.

Figure 5: A COBOL program is structured in a hierarchical manner

A SECTION name is terminated with a period, for example:

WORKING-STORAGE SECTION.

A SECTION contains zero or many PARAGRAPHs. (Typically a SECTION will have at least one PARAGRAPH.) A PARAGRAPH name is terminated with a period, similar to a DIVISION and a SECTION.

A PARAGRAPH contains SENTENCEs or STATEMENTs. A SENTENCE is a group of STATEMENTs. Usually a PARAGRAPH contains one or more STATEMENTs. A STATEMENT is a line of execution. A STATEMENT is made up of CHARACTERs. A CHARACTER can be an alphanumeric symbol or a special character. CHARACTER is at the bottom of the COBOL code format hierarchy.

Variable and Data Types
COBOL allows you to declare variables in a variety of data types. Special to COBOL is the concept of a variable level as represented by a level number. A level number defines a variable in terms of being or having a parent variable. (You’ll see more about this in the section, COBOL Supports Hierarchical Data, later on.)

The following statement, which will be declared in the WORKING-STORAGE SECTION of the DATA DIVISION, is the declaration of a variable, WS-QUANTITY. WS-QUANTITY will hold a numeric value.

01 WS-QUANTITY PIC 9(2) VALUE 12

The expression above above declares a variable , WS-QUANTITY to be a two-digit value with an initial value of 12. Also, the variable is declared to be at level 01, which is the highest level possible. The variable has no parent. The way we know that WS-QUANTITY is a variable for a two-digit value is due to the PIC (Picture) clause. You can think of a PIC clause as a type declaration. Table 1 provides a high-level description of the various PIC clause symbols.

Symbol Description Example Declaration Sample Value
9 A numeric value where each occurance of 9 represents a digit 99 or 9(2) 35
a Alphabetic aaa or a(3) “Bob”
x Alphanumeric xxxx or x(4) “R2D2?”
v Implicit decimal* v(3) .175
s Sign s9(2) -76
p Assumed decimal* p9 .6
Table 1: Symbols for a PIC Clause

*COBOL has a special way of storing numbers such that the numeric conversion take place when the characters are loaded into memory. For example, a variable with a PIC of p9(2) will be stored as 56 and loaded into memory as .56. For a more detailed discussion of usage and the PIC clause, go here.

COBOL Supports Hierarchical Data
Storing data in a named group of variables makes programing easier. For example, in the C programming language, we can use a struct to name a group of variables, as shown below in Listing 3.

struct User {
char first_name[10];
char last_name[10]
int user_id;
};
Listing 3: A simple structure in C

One of the real nice things about COBOL is that it allows you to organize data according to a named group of values similar to that of a C struct. What’s interesting in a historical sense is that COBOL has had this capability well before Kernighan and Ritchie created the C programming language.

The way COBOL accomplishes named data grouping is with construct called a record. Figure 6 below shows a graphical representation of a record named, WS-USER.

Figure 6: The record, WS-USER contains subordinate variable, WS-FIRST-NAME, WS-LAST-NAME, WS-AGE

The code snippet in Listing 4, which is from the RESEL-WORLD program at the start of this article, shows the COBOL code that declares the record, WS-USER.

beauty-of-COBOL-Listing-04

Listing 4: The record, WS-USER contains 3 subordinate elements, WS-FIRST-NAME, WS-LAST-NAME, WS-AGE

Notice in Listing 4 above the there are two levels of data variables in play: 01 WS-USER and 05 WS-FIRST-NAME, 05 WS-LAST-NAME, 05 WS-AGE. The variable, WS-USER is a parent to the subordinate, level 05 variables, WS-FIRST-NAME, WS-LAST-NAME, and WS-AGE.

Assigning variable to a numeric level is special to COBOL. The declaration logic is that a record variable (the root of the record) has a level number of 01. Levels 02 to 49 are used to declare subordinates elements. (You can think of a element as a member of the record.)

You use the reserved word OF to access an element of a record. Listing 5 below shows two statements. The first asks a user to input their first name. The second statement takes the value entered and assigns it to the element, WS-FIRST-NAME of the record, WS-USER.

DISPLAY “What is your first name?”.
ACCEPT WS-FIRST-NAME OF WS-USER.
Listing 5: Accessing an element of a record in COBOL

It make sense that records were part of COBOL from the start. COBOL was intended to be a language used for business application. Businesses have been organizing values according to named groups since before the customer form was created. COBOL was designed to reflect business needs. It’s interesting that the fundamental needs of businesses in terms of logical data structures have been surprisingly consistent over time.

COBOL Has a Natural Language Syntax
Expressing a statement in COBOL is very similar to the way a person speaks naturally. Listing 6 below shows the statement that adds together the values of the two variables, WS-AGE-DELTA and WS-AGE (from the record WS-USER), and stores the result in the variable, WS-NEW-AGE.

ADD WS-AGE-DELTA WS-AGE OF WS-USER TO WS-NEW-AGE.
Listing 6: Adding two integers, WS-AGE-and DELTA WS-AGE, then storing the sum in the variable, WS-NEW-AGE

Notice that the statement in Listing 6 is very close to saying, “Add WS-AGE-DELTA and WS-AGE from WS-USER to WS-NEW-AGE.”

What I find pretty astounding is that many modern programming paradigms try to capture the ease of natural language expression that COBOL has had for years. What comes to mind immediately is the use of natural language expressions in the NodeJS Chai package. Chai is used in NodeJS unit testing to express BDD assertions. Here’s a Chai expression that checks a variable, myVar to assert that it’s a string.

expect(myVar).to.be.a(‘string’);
FAST FACT: Is COBOL a case-sensitive language?
No. COBOL will consider a variable named WS-FIRST-NAME to be the same as one named ws-first-name.

COBOL supports DRY
For programming language to be really useful, it needs to support DRY. DRY is an acronym for Don’t Repeat Yourself. Most programmers don’t want to write the same code over and over again, nor should they. What you want to do is to encapsulate code in to a single area of execution that can be called repeatedly. Being able to program by DRY is common for most languages—BASIC, Java, C#, Javascript, Python, to name a few and … you guessed it: COBOL.

COBOL supports code segmentation and reuse in a variety of ways. One way is through linkage, calling routines between programs. Another way is to segment code inline under a PARAGRAPH in the PROCEDURE DIVISION and then call that PARAGRAPH. (See Figure 7.)

Figure 7: You can make code reusable in COBOL PARAGRAPHs

The little RESEL-WORLD program I wrote uses the inline PARAGRAPH technique. Before I go in to the details of calling code encapsulated into a PARAGRAPH, it’s important that you know that in COBOL the entry point of execution into a program is the first line of code after the declaration of the PROCEDURE DIVISION, as shown below in Listing 7.

beauty-of-COBOL-Listing-07
Listing 7: COBOL supports reusable code by using PARAGRAPHS to segment execution

Notice the PERFORM statements right after PROCEDURE DIVISION. These four statements are calling PARAGRAPHs. (For those of you familiar with JavaScript, you can think of each PARAGRAPH as a function definition.) Essentially what the code is saying is, “Execute the code at the paragraph named, GET-DATA, then perform the code at CALC-DATA, SHOW-DATA and then FINISH-UP.”

Encapsulating code into STATEMENTs within a PARAGRAPH prevents code from become spaghetti that is hard to maintain. Also, the encapsulation enforces the sensibility of DRY. Any language worth its salt needs to support some sort of encapsulation. COBOL meets the need with room to spare.

An Expressive Language for Now and the Future
The more I learn about COBOL, the more I like it. The language continues to evolve to meet the needs of our fast-changing times, with revisions as recent as 2014. Since its inception there have been a dozen enhancements to COBOL including a continuing stream of formal standards.

Today’s COBOL supports modern programming paradigms such as object orientation. The IDEs have grown to keep pace with the demands of modern users. And, given the immense installation base out there, there is a lot of money to be made doing COBOL programming. To quote Wikipedia:

In 2006 and 2012, Computerworld surveys found that over 60% of organizations used COBOL (more than C++ and Visual Basic .NET) and that for half of those, COBOL was used for the majority of their internal software.

What’s even more amazing to me is the cleverness of engineering the language has promoted. COBOL developers addressed problems in the past that still vex many today. We modern developers can fall prey to thinking that before we came along, there was no cool stuff. It’s like the young, aspiring guitarist who thinks thinks he has his chops down and can shred to the top of the heap. Then one day his music teacher gives him a 1936 recording of Django Reinhardt playing guitar, and playing with only two functional fingers on his left hand! At that point the young artist realizes that virtuosity transcends era and that creativity has no bounds. This is what it’s like for me as I learn COBOL. It’s a beautiful, expressive language that was cool then and is very cool now. Learning it is making me appreciate how much amazing thinking went on back then and continues to emerge. There’s a lot of rich opportunity at hand to make great code for mainframes using COBOL. It’s only a matter of mastery, creativity and discovery.

Special Thanks
If I have seen further it is by standing on the shoulders of giants.

—Isaac Newton, Letter to Robert Hooke, February 5, 1675

Learning any programming language is hard work. Years ago, when I wanted to learn a new language, I would get a few books on the topic and then hunker down to spend the hours necessary to absorb the required knowledge. If I had the time, I might take a course. Maybe I had an expert friend I could call up when I hit a wall or needed real-world guidance.

We’ve come a long way since that time. The internet is a game-changer. Online tutorials, videos and interactive, digital books make things a whole lot easier than in earlier times. My experience learning COBOL testifies to the value of these modern benefits. But still, as I’ve been learning, I’ve hit more than one wall that required professional guidance and review. Allan Kielstra and Roland Koo at IBM provided the expertise that made my learning COBOL not only less difficult, but also fun. I am in their debt. Their enthusiasm for COBOL was infectious. Their commitment to technical excellence is inspiring.

Comparison of the usage of Apache vs. Nginx vs. Hiawatha for websites

Comparison of the usage of Apache vs. Nginx vs. Hiawatha for websites

Comparison of the usage of Apache vs. Nginx vs. Hiawatha for websites
Request an extensive market report of specific web servers.

Learn more
This report shows the usage statistics of Apache vs. Nginx vs. Hiawatha as web server on the web. See technologies overview for explanations on the methodologies used in the surveys. Our reports are updated daily.

Usage
This diagram shows the percentages of websites using the selected technologies.

How to read the diagram:
Apache is used by 47.3% of all the websites whose web server we know.

Apache X
47.3%
Nginx X
37.0%
Hiawatha X
less than 0.1%
W3Techs.com, 21 February 2018
Percentages of websites using various web servers

Usage broken down by ranking
This diagram shows the percentages of websites using the selected technologies broken down by ranking.

How to read the diagram:
Apache is used by 47.3% of all the websites whose web server we know.
Apache is used by 38.8% of all the websites whose web server we know and that rank in the top 1,000,000.

Apache
47.3%
38.8%
27.3%
21.8%
16.8%

Nginx
37.0%
44.5%
55.7%
62.8%
57.0%

Hiawatha
0.0%
0.0%
0.0%
0.0%
0.0%

W3Techs.com, 21 February 2018
Overalltop 1,000,000top 100,000top 10,000top 1,000
Percentages of websites using the selected web servers broken down by ranking

Historical trend
This diagram shows the historical trend in the percentage of websites using the selected technologies.
Our dedicated trend survey shows more web servers usage and market share trends.Historical trends in the usage of the selected technologies

Market position
This diagram shows the market position of the selected technologies in terms of popularity and traffic compared to the most popular web servers.
Our dedicated market survey shows more web servers market data.

Market position of the selected technologies

More details
You can find complete breakdown reports of web servers in our web servers market reports.