More Hash Function Tests More Hash Function Tests

Home Blog Talks Papers Projects
More Hash Function Tests
Posted on Aug 9, 2016 #code
In the previous post, I wrote about non-crypto hash functions, and did some performance tests. Turns out, it’s great to write about stuff! People at the comments/twitter/internets pointed out more things I should test. So here’s a follow-up post.

What is not the focus
This is about algorithms for “hashing some amount of bytes”, for use either in hashtables or for checksum / uniqueness detection. Depending on your situation, there’s a whole family of algorithms for that, and I am focusing on only one: non-cryptographic fast hash functions.

This post is not about cryptographic hashes. Do not read below if you need to hash passwords, sensitive data going through untrusted medium and so on. Use SHA-1, SHA-2, BLAKE2 and friends.
Also, I’m not focusing on algorithms that are designed to prevent possible hashtable Denial-of-Service attacks. If something comes from the other side of the internet and ends up inserted into your hashtable, then to prevent possible worst-case O(N) hashtable behavior you’re probably off by using a hash function that does not have known “hash flooding” attacks. SipHash seems to be popular now.
If you are hashing very small amounts of data of known size (e.g. single integers or two floats or whatever), you should probably use specialized hashing algorithms for those. Here are some integer hash functions, or 2D hashing with Weyl, or perhaps you could take some other algorithm and just specialize it’s code for your known input size (e.g. xxHash for a single integer).
I am testing 32 and 64 bit hash functions here. If you need larger hashes, quite likely some of these functions might be suitable (e.g. SpookyV2 always produces 128 bit hash).
When testing hash functions, I have not gone to great lengths to get them compiling properly or setting up all the magic flags on all my platforms. If some hash function works wonderfully when compiled on Linux Itanium box with an Intel compiler, that’s great for you, but if it performs poorly on the compilers I happen to use, I will not sing praises for it. Being in the games industry, I care about things like “what happens in Visual Studio”, and “what happens on iOS”, and “what happens on PS4”.

More hash function tests!
I’ve updated my hash testbed on github (use revision 9b59c91cf) to include more algorithms, changed tests a little, etc etc.

I checked both “hashing data that is aligned” (16-byte aligned address of data to hash), and unaligned data. Everywhere I tested, there wasn’t a notable performance difference that I could find (but then, I have not tested old ARM CPUs or PowerPC based ones). The only visible effect is that MurmurHash and SpookyHash don’t properly work in asm.js / Emscripten compilation targets, due to their usage of unaligned reads. I’d assume they probably don’t work on some ARM/PowerPC platforms too.

Hash functions tested below:

xxHash32 and xxHash64 – xxHash.
City32 and City64 – CityHash.
Mum – mum-hash.
Farm32 and Farm64 – FarmHash.
SpookyV2-64 – SpookyHash V2.
Murmur2A, Murmur3-32, Murmur3-X64-64 – MurmurHash family.
These are the main functions that are interesting. Because people kept on asking, and because “why not”, I’ve included a bunch of others too:

SipRef – SipHash-2-4 reference implementation. Like mentioned before, this one is supposedly good for avoiding hash flooding attacks.
MD5-32, SHA1-32, CRC32 – simple implementations of well-known hash functions (from SMHasher test suite). Again, these mostly not in the category I’m interested in, but included to illustrate the performance differences.
FNV-1a, FNV-1amod – FNV hash and modified version.
djb2, SDBM – both from this site.
Performance Results
Windows / Mac
macOS results, compiled with Xcode 7.3 (clang-based) in Release 64 bit build. Late 2013 MacBookPro:

Windows results, compiled with Visual Studio 2015 in Release 64 bit build. Core i7-5820K:


Performance profiles of these are very similar.
xxHash64 wins at larger (0.5KB+) data sizes, followed closely by 64 bit FarmHash and CityHash.
At smaller data sizes (10-500 bytes), FarmHash and CityHash look very good!
mum-hash is much slower on Visual Studio. At first I thought it’s _MUM_UNALIGNED_ACCESS macro that was not handling VS-specific defines (_M_AMD64 and _M_IX86) properly (see PR). Turns out it’s not; the speed difference comes from _MUM_USE_INT128 which only kicks in on GCC-based compilers. mum-hash would be pretty competetive otherwise.
Windows 32 bit results, compiled with Visual Studio 2015 in Release 32 bit build. Core i7-5820K:

On a 32 bit platform / compilation target, things change quite a bit!

All 64 bit hash functions are slow. For example, xxHash64 frops from 13GB/s to 1GB/s. Use a 32 bit hash function when on a 32 bit target!
xxHash32 wins by a good amount.
Note on FarmHash – whole idea behind it is that it uses CPU-specific optimizations (that also change the computed hash value). The graphs above are using default compilation settings on both macOS and Windows. However, on macOS enabling SSE4.2 instructions in Xcode settings makes it much faster at larger data sizes:

With SSE4.2 enabled, FarmHash64 handily wins against xxHash64 (17.9GB/s vs 12.8GB/s) for data that is larger than 1KB. However, that requires compiling with SSE4.2, at my work I can’t afford that. Enabling the same options on XboxOne makes it slower 😦 Enabling just FARMHASH_ASSUME_AESNI makes the 32 bit FarmHash faster on XboxOne, but does not affect performance of the 64 bit hash. FarmHash also does not have any specific optimizations for ARM CPUs, so my verdict with it all is “not worth bothering” – afterall the impressive SSE4.2 speedup is only for large data sizes.

iPhone SE (A9 CPU) results, compiled with Xcode 7.3 (clang-based) in Release 64 bit build:

xxHash never wins here; CityHash and FarmHash are fastest across the whole range.
iPhone SE 32 bit results:

This is similar to Windows 32 bit: 64 bit hash functions are slow, xxHash32 wins by a good amount.

Xbox One (AMD Jaguar 1.75GHz CPU) results, compiled Visual Studio 2015 in Release mode:

Similar to iPhone results, xxHash is quite a bit slower than CityHash and FarmHash. xxHash uses 64 bit multiplications heavily, whereas others mostly do shifts and logic ops.
SpookyHash wins at larger data sizes.
JavaScript (asm.js via Emscripten) results, running on late 2013 MacBookPro.

asm.js is in all practical sense a 32 bit compilation target; 64 bit integer operations are slow.
xxHash32 wins, followed by FarmHash32.
At smaller (<25 bytes) data sizes, simple hashes like FNV-1a, SDBM or djb2 are useful.
Throughput tables
Performance at large data sizes (~4KB), in GB/s:

Hash 64 bit 32 bit
Win Mac iPhoneSE XboxOne Win iPhoneSE asm.js Firefox asm.js Chrome
xxHash64 13.2 12.8 5.7 1.5 1.1 1.5 0.3 0.1
City64 12.2 12.2 7.2 3.6 1.6 1.9 0.4 0.1
Mum 4.0 9.5 4.5 0.5 0.7 1.3 0.1 0.1
Farm64 12.3 11.9 8.2 3.3 2.4 2.2 0.4 0.1
SpookyV2-64 12.8 12.5 7.1 4.3 2.6 1.9 — —
xxHash32 6.8 6.6 4.0 1.5 6.7 3.7 2.5 1.4
Murmur3-X64-64 7.1 5.8 2.3 1.2 0.9 0.8 — —
Murmur2A 3.4 3.3 1.7 0.9 3.4 1.8 — —
Murmur3-32 3.1 2.7 1.3 0.5 3.1 1.3 — —
City32 5.1 4.9 2.6 0.9 4.0 2.6 1.1 0.8
Farm32 5.2 4.3 2.6 1.8 4.3 1.9 2.1 1.4
SipRef 1.4 1.6 1.0 0.4 0.3 0.4 0.1 0.0
CRC32 0.5 0.5 0.2 0.2 0.4 0.2 0.4 0.4
MD5-32 0.5 0.3 0.2 0.2 0.4 0.3 0.4 0.4
SHA1-32 0.5 0.5 0.3 0.1 0.4 0.2 0.3 0.2
FNV-1a 0.9 0.8 0.4 0.3 0.9 0.4 0.8 0.8
FNV-1amod 0.9 0.8 0.4 0.3 0.9 0.4 0.8 0.7
djb2 0.9 0.8 0.6 0.4 1.1 0.6 0.8 0.8
SDBM 0.9 0.8 0.4 0.3 0.8 0.4 0.8 0.8
Performance at medium size (128 byte) data, in GB/s:

Hash 64 bit 32 bit
Win Mac iPhoneSE XboxOne Win iPhoneSE asm.js Firefox asm.js Chrome
xxHash64 6.6 6.2 2.8 0.9 0.7 0.7 0.2 0.1
City64 7.6 7.6 4.4 1.7 1.1 1.5 0.3 0.1
Mum 3.2 7.6 4.1 0.5 0.6 1.1 0.1 0.0
Farm64 6.6 5.7 3.4 1.4 0.9 1.1 0.3 0.1
SpookyV2-64 3.4 3.2 1.7 1.4 0.7 0.5 — —
xxHash32 5.1 5.3 3.4 1.3 5.1 2.8 2.2 1.4
Murmur3-X64-64 5.3 4.3 2.1 1.0 0.8 0.7 — —
Murmur2A 3.6 3.0 2.1 0.8 3.3 1.7 — —
Murmur3-32 3.1 2.3 1.3 0.4 2.8 1.3 — —
City32 4.0 3.6 2.0 0.7 3.3 1.9 1.0 0.8
Farm32 3.9 3.2 1.9 1.0 3.4 1.6 1.6 1.1
SipRef 1.2 1.3 0.8 0.4 0.3 0.3 0.1 0.0
CRC32 0.5 0.5 0.2 0.2 0.4 0.2 0.4 0.4
MD5-32 0.3 0.2 0.2 0.1 0.3 0.2 0.2 0.2
SHA1-32 0.2 0.2 0.1 0.1 0.1 0.1 0.1 0.1
FNV-1a 0.9 1.0 0.5 0.3 0.9 0.5 0.9 0.9
FNV-1amod 0.9 0.9 0.5 0.3 0.9 0.5 0.9 0.8
djb2 1.0 0.9 0.7 0.4 1.1 0.7 0.9 0.8
SDBM 0.9 0.9 0.5 0.3 0.9 0.5 0.9 0.8
Performance at small size (17 byte) data, in GB/s:

Hash 64 bit 32 bit
Win Mac iPhoneSE XboxOne Win iPhoneSE asm.js Firefox asm.js Chrome
xxHash64 2.1 1.8 0.5 0.5 0.4 0.4 0.1 0.0
City64 3.4 3.0 1.5 0.7 0.5 0.7 0.2 0.0
Mum 1.2 2.6 0.9 0.2 0.3 0.5 0.1 0.0
Farm64 3.6 2.6 1.2 0.7 0.6 0.6 0.1 0.0
SpookyV2-64 1.2 1.0 0.6 0.4 0.2 0.1 — —
xxHash32 2.2 2.0 0.7 0.5 1.9 0.8 1.2 0.8
Murmur3-X64-64 1.7 1.3 0.5 0.4 0.3 0.3 — —
Murmur2A 2.4 1.8 1.1 0.5 2.1 1.0 — —
Murmur3-32 2.1 1.5 0.8 0.4 1.8 0.8 — —
City32 2.1 1.9 0.9 0.5 1.7 0.7 0.8 0.7
Farm32 2.5 2.0 0.8 0.5 1.8 0.9 0.9 0.5
SipRef 0.6 0.6 0.3 0.2 0.2 0.1 0.0 0.0
CRC32 0.8 0.7 0.4 0.2 0.6 0.3 0.5 0.4
MD5-32 0.1 0.1 0.0 0.0 0.1 0.0 0.1 0.0
SHA1-32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
FNV-1a 1.3 1.5 1.0 0.3 1.2 0.7 1.4 1.0
FNV-1amod 1.1 1.4 0.9 0.3 1.0 0.7 1.3 0.9
djb2 1.6 1.6 1.1 0.4 1.1 0.7 1.3 0.9
SDBM 1.4 1.3 1.1 0.4 1.2 0.7 1.5 1.0
A note on hash quality
As far as I’m concerned, all the 64 bit hashes are excellent quality.

Most of the 32 bit hashes are pretty good too on the data sets I tested.

SDBM produces more collisions than others on binary-like data (various struct memory dumps, 5742 entries, average length 55 bytes – SDBM had 64 collisions, all the other hashes had zero). You could have a way worse hash function than SDBM of course, but then functions like FNV-1a are about as simple to implement, and seem to behave better on binary data.

A note on hash consistency
Some of the hash functions do not produce identical output on various platforms. Among the ones I tested, mum-hash and FarmHash produced different output depending on compiler and platform used.

It’s very likely that most of the above hash functions produce different output if ran on a big-endian CPU. I did not have any platform like that to test on.

Some of the hash functions depend on unaligned memory reads being possible – most notably Murmur and Spooky. I had to change Spooky to work on ARM 32 bit (define ALLOW_UNALIGNED_READS to zero in the source code). Murmur and Spooky did produce wrong results on asm.js (no crash, just different hash results and way more collisions than expected).

General cross-platform use: CityHash64 on a 64 bit system; xxHash32 on a 32 bit system.

Good performance across various data sizes.
Identical hash results across various little-endian platforms and compilers.
No weird compilation options to tweak or peformance drops on compilers that it is not tuned for.
Best throughput on large data: depends on platform!

Intel CPU: xxHash64 in general, FarmHash64 if you can use SSE4.2, xxHash32 if compiling for 32 bit.
Apple mobile CPU (A9): FarmHash64 for 64 bit, xxHash32 for 32 bit.
Console (XboxOne, AMD Jaguar): SpookyV2.
asm.js: xxHash32.
Best for short strings: FNV-1a.

Super simple to implement, inline-able.
Where exactly it becomes “generally fastest” depends on a platform; around 8 bytes or less on PC, mobile and console; around 20 bytes or less on asm.js.
If your data is fixed size (e.g. one or two integers), look into specialized hashes instead (see above).
← Older Newer →
Possibly Related Posts
Hash Functions all the way down, from 2016 August
Solving DX9 Half-Pixel Offset, from 2016 April
Backporting Fixes and Shuffling Branches, from 2016 April
Have feedback on this post? Let me know on email, mastodon or twitter.
Recent Posts
"Modern" C++ Lamentations
C++11 way of initializing integers
Pathtracer 17: WebAssembly
SPIR-V Compression: SMOL vs MARK
Pathtracer 16: Burst SIMD Optimization
Random list of Demoscene Demos
Pathtracer 15: Pause & Links
code (96)
conferences (10)
d3d (20)
demos (24)
devtools (16)
energy (1)
games (18)
giving (3)
gpu (40)
mobile (7)
opengl (19)
papers (12)
personal (12)
politics (2)
random (43)
rant (43)
rendering (45)
travel (5)
uncategorized (36)
unity (56)
vulkan (2)
web (2)
work (59)
Text content © Aras Pranckevičius. Code snippets are public domain, unless specified otherwise.

MASSCAN: Mass IP port scanner

MASSCAN: Mass IP port scanner

robertdavidgraham / masscan
Code Issues 191 Pull requests 48 Projects 0 Wiki Pulse

MASSCAN: Mass IP port scanner
This is the fastest Internet port scanner. It can scan the entire Internet in under 6 minutes, transmitting 10 million packets per second.

It produces results similar to nmap, the most famous port scanner. Internally, it operates more like scanrand, unicornscan, and ZMap, using asynchronous transmission. The major difference is that it’s faster than these other scanners. In addition, it’s more flexible, allowing arbitrary address ranges and port ranges.

NOTE: masscan uses a custom TCP/IP stack. Anything other than simple port scans will cause conflict with the local TCP/IP stack. This means you need to either use the -S option to use a separate IP address, or configure your operating system to firewall the ports that masscan uses.

This tool is free, but consider funding it here: Bitcoin wallet address: 1MASSCANaHUiyTtR3bJ2sLGuMw5kDBaj4T

On Debian/Ubuntu, it goes something like this:

$ sudo apt-get install git gcc make libpcap-dev
$ git clone
$ cd masscan
$ make
This puts the program in the masscan/bin subdirectory. You’ll have to manually copy it to something like /usr/local/bin if you want to install it elsewhere on the system.

The source consists of a lot of small files, so building goes a lot faster by using the multi-threaded build:

$ make -j
While Linux is the primary target platform, the code runs well on many other systems. Here’s some additional build info:

Windows w/ Visual Studio: use the VS10 project
Windows w/ MingGW: just type make
Windows w/ cygwin: won’t work
Mac OS X /w XCode: use the XCode4 project
Mac OS X /w cmdline: just type make
FreeBSD: type gmake
other: I don’t know, don’t care
To get beyond 2 million packets/second, you need an Intel 10-gbps Ethernet adapter and a special driver known as “PF_RING ZC” from ntop. Masscan doesn’t need to be rebuilt in order to use PF_RING. To use PF_RING, you need to build the following components: (installed in /usr/lib/
pf_ring.ko (their kernel driver)
ixgbe.ko (their version of the Intel 10-gbps Ethernet driver)
You don’t need to build their version of

When Masscan detects that an adapter is named something like zc:enp1s0 instead of something like enp1s0, it’ll automatically switch to PF_RING ZC mode.

Regression testing
The project contains a built-in self-test:

$ make regress
bin/masscan –regress
selftest: success!
This tests a lot of tricky bits of the code. You should do this after building.

Performance testing
To test performance, run something like the following:

$ bin/masscan -p80 –rate 100000000 –router-mac 66-55-44-33-22-11
The bogus –router-mac keeps packets on the local network segments so that they won’t go out to the Internet.

You can also test in “offline” mode, which is how fast the program runs without the transmit overhead:

$ bin/masscan -p80 –rate 100000000 –offline
This second benchmark shows roughly how fast the program would run if it were using PF_RING, which has near zero overhead.

Usage is similar to nmap. To scan a network segment for some ports:

# masscan -p80,8000-8100
This will:

scan the 10.x.x.x subnet, all 16 million addresses
scans port 80 and the range 8000 to 8100, or 102 addresses total
print output to that can be redirected to a file
To see the complete list of options, use the –echo feature. This dumps the current configuration and exits. This output can be used as input back into the program:

# masscan -p80,8000-8100 –echo > xxx.conf
# masscan -c xxx.conf –rate 1000
Banner checking
Masscan can do more than just detect whether ports are open. It can also complete the TCP connection and interaction with the application at that port in order to grab simple “banner” information.

The problem with this is that masscan contains its own TCP/IP stack separate from the system you run it on. When the local system receives a SYN-ACK from the probed target, it responds with a RST packet that kills the connection before masscan can grab the banner.

The easiest way to prevent this is to assign masscan a separate IP address. This would look like the following:

# masscan -p80 –banners –source-ip
The address you choose has to be on the local subnet and not otherwise be used by another system.

In some cases, such as WiFi, this isn’t possible. In those cases, you can firewall the port that masscan uses. This prevents the local TCP/IP stack from seeing the packet, but masscan still sees it since it bypasses the local stack. For Linux, this would look like:

# iptables -A INPUT -p tcp –dport 61000 -j DROP
# masscan -p80 –banners –source-port 61000
You probably want to pick ports that don’t conflict with ports Linux might otherwise choose for source-ports. You can see the range Linux uses, and reconfigure that range, by looking in the file:

On the latest version of Kali Linux (2018-August), that range is 32768 to 60999, so you should choose ports either below 32768 or 61000 and above.

Setting an iptables rule only lasts until the next reboot. You need to lookup how to save the configuration depending upon your distro, such as using iptables-save and/or iptables-persistant.

On Mac OS X and BSD, there are similar steps. To find out the ranges to avoid, use a command like the following:

# sysctl net.inet.ip.portrange.first net.inet.ip.portrange.last
On FreeBSD and older MacOS, use an ipfw command:

# sudo ipfw add 1 deny tcp from any to any 40000 in
# masscan -p80 –banners –source-port 40000
On newer MacOS and OpenBSD, use the pf packet-filter utility. Edit the file /etc/pf.conf to add a line like the following:

block in proto tcp from any to any port 40000
Then to enable the firewall, run the command:

# pfctrl -E
If the firewall is already running, then either reboot or reload the rules with the following command:

# pfctl -f /etc/pf.conf
Windows doesn’t respond with RST packets, so neither of these techniques are necessary. However, masscan is still designed to work best using its own IP address, so you should run that way when possible, even when its not strictly necessary.

The same thing is needed for other checks, such as the –heartbleed check, which is just a form of banner checking.

How to scan the entire Internet
While useful for smaller, internal networks, the program is really designed with the entire Internet in mind. It might look something like this:

# masscan -p0-65535
Scanning the entire Internet is bad. For one thing, parts of the Internet react badly to being scanned. For another thing, some sites track scans and add you to a ban list, which will get you firewalled from useful parts of the Internet. Therefore, you want to exclude a lot of ranges. To blacklist or exclude ranges, you want to use the following syntax:

# masscan -p0-65535 –excludefile exclude.txt
This just prints the results to the command-line. You probably want them saved to a file instead. Therefore, you want something like:

# masscan -p0-65535 -oX scan.xml
This saves the results in an XML file, allowing you to easily dump the results in a database or something.

But, this only goes at the default rate of 100 packets/second, which will take forever to scan the Internet. You need to speed it up as so:

# masscan -p0-65535 –max-rate 100000
This increases the rate to 100,000 packets/second, which will scan the entire Internet (minus excludes) in about 10 hours per port (or 655,360 hours if scanning all ports).

The thing to notice about this command-line is that these are all nmap compatible options. In addition, “invisible” options compatible with nmap are also set for you: -sS -Pn -n –randomize-hosts –send-eth. Likewise, the format of the XML file is inspired by nmap. There are, of course, a lot of differences, because the asynchronous nature of the program leads to a fundamentally different approach to the problem.

The above command-line is a bit cumbersome. Instead of putting everything on the command-line, it can be stored in a file instead. The above settings would look like this:

# My Scan
rate = 100000.00
output-format = xml
output-status = all
output-filename = scan.xml
ports = 0-65535
range =
excludefile = exclude.txt
To use this configuration file, use the -c:

# masscan -c myscan.conf
This also makes things easier when you repeat a scan.

By default, masscan first loads the configuration file /etc/masscan/masscan.conf. Any later configuration parameters override what’s in this default configuration file. That’s where I put my “excludefile” parameter, so that I don’t ever forget it. It just works automatically.

Getting output
By default, masscan produces fairly large text files, but it’s easy to convert them into any other format. There are five supported output formats:

xml: Just use the parameter -oX . Or, use the parameters –output-format xml and –output-filename .

binary: This is the masscan builtin format. It produces much smaller files, so that when I scan the Internet my disk doesn’t fill up. They need to be parsed, though. The command line option –readscan will read binary scan files. Using –readscan with the -oX option will produce a XML version of the results file.

grepable: This is an implementation of the Nmap -oG output that can be easily parsed by command-line tools. Just use the parameter -oG . Or, use the parameters –output-format grepable and –output-filename .

json: This saves the results in JSON format. Just use the parameter -oJ . Or, use the parameters –output-format json and –output-filename .

list: This is a simple list with one host and port pair per line. Just use the parameter -oL . Or, use the parameters –output-format list and –output-filename . The format is:

open tcp 80 XXX.XXX.XXX.XXX 1390380064
Comparison with Nmap
Where reasonable, every effort has been taken to make the program familiar to nmap users, even though it’s fundamentally different. Two important differences are:

no default ports to scan, you must specify -p
target hosts are IP addresses or simple ranges, not DNS names, nor the funky subnet ranges nmap can use (like 10.0.0-255.0-255).
You can think of masscan as having the following settings permanently enabled:

-sS: this does SYN scan only (currently, will change in the future)
-Pn: doesn’t ping hosts first, which is fundamental to the async operation
-n: no DNS resolution happens
–randomize-hosts: scan completely randomized
–send-eth: sends using raw libpcap
If you want a list of additional nmap compatible settings, use the following command:

# masscan –nmap
Transmit rate (IMPORTANT!!)
This program spews out packets very fast. On Windows, or from VMs, it can do 300,000 packets/second. On Linux (no virtualization) it’ll do 1.6 million packets-per-second. That’s fast enough to melt most networks.

Note that it’ll only melt your own network. It randomizes the target IP addresses so that it shouldn’t overwhelm any distant network.

By default, the rate is set to 100 packets/second. To increase the rate to a million use something like –rate 1000000.

This section describes the major design issues of the program.

Code Layout
The file main.c contains the main() function, as you’d expect. It also contains the transmit_thread() and receive_thread() functions. These functions have been deliberately flattened and heavily commented so that you can read the design of the program simply by stepping line-by-line through each of these.

This is an asynchronous design. In other words, it is to nmap what the nginx web-server is to Apache. It has separate transmit and receive threads that are largely independent from each other. It’s the same sort of design found in scanrand, unicornscan, and ZMap.

Because it’s asynchronous, it runs as fast as the underlying packet transmit allows.

A key difference between Masscan and other scanners is the way it randomizes targets.

The fundamental principle is to have a single index variable that starts at zero and is incremented by one for every probe. In C code, this is expressed as:

for (i = 0; i < range; i++) {
We have to translate the index into an IP address. Let's say that you want to scan all "private" IP addresses. That would be the table of ranges like:
In this example, the first 64k indexes are appended to 192.168.x.x to form the target address. Then, the next 16-million are appended to 10.x.x.x. The remaining indexes in the range are applied to 172.16.x.x.

In this example, we only have three ranges. When scanning the entire Internet, we have in practice more than 100 ranges. That's because you have to blacklist or exclude a lot of sub-ranges. This chops up the desired range into hundreds of smaller ranges.

This leads to one of the slowest parts of the code. We transmit 10 million packets per second, and have to convert an index variable to an IP address for each and every probe. We solve this by doing a "binary search" in a small amount of memory. At this packet rate, cache efficiencies start to dominate over algorithm efficiencies. There are a lot of more efficient techniques in theory, but they all require so much memory as to be slower in practice.

We call the function that translates from an index into an IP address the pick() function. In use, it looks like:

for (i = 0; i < range; i++) {
ip = pick(addresses, i);
Masscan supports not only IP address ranges, but also port ranges. This means we need to pick from the index variable both an IP address and a port. This is fairly straightforward:

range = ip_count * port_count;
for (i = 0; i < range; i++) {
ip = pick(addresses, i / port_count);
port = pick(ports, i % port_count);
scan(ip, port);
This leads to another expensive part of the code. The division/modulus instructions are around 90 clock cycles, or 30 nanoseconds, on x86 CPUs. When transmitting at a rate of 10 million packets/second, we have only 100 nanoseconds per packet. I see no way to optimize this any better. Luckily, though, two such operations can be executed simultaneously, so doing two of these as shown above is no more expensive than doing one.

There are actually some easy optimizations for the above performance problems, but they all rely upon i++, the fact that the index variable increases one by one through the scan. Actually, we need to randomize this variable. We need to randomize the order of IP addresses that we scan or we'll blast the heck out of target networks that aren't built for this level of speed. We need to spread our traffic evenly over the target.

The way we randomize is simply by encrypting the index variable. By definition, encryption is random, and creates a 1-to-1 mapping between the original index variable and the output. This means that while we linearly go through the range, the output IP addresses are completely random. In code, this looks like:

range = ip_count * port_count;
for (i = 0; i < range; i++) {
x = encrypt(i);
ip = pick(addresses, x / port_count);
port = pick(ports, x % port_count);
scan(ip, port);
This also has a major cost. Since the range is an unpredictable size instead of a nice even power of 2, we can't use cheap binary techniques like AND (&) and XOR (^). Instead, we have to use expensive operations like MODULUS (%). In my current benchmarks, it's taking 40 nanoseconds to encrypt the variable.

This architecture allows for lots of cool features. For example, it supports "shards". You can setup 5 machines each doing a fifth of the scan, or range / shard_count. Shards can be multiple machines, or simply multiple network adapters on the same machine, or even (if you want) multiple IP source addresses on the same network adapter.

Or, you can use a 'seed' or 'key' to the encryption function, so that you get a different order each time you scan, like x = encrypt(seed, i).

We can also pause the scan by exiting out of the program, and simply remembering the current value of i, and restart it later. I do that a lot during development. I see something going wrong with my Internet scan, so I hit to stop the scan, then restart it after I've fixed the bug.

Another feature is retransmits/retries. Packets sometimes get dropped on the Internet, so you can send two packets back-to-back. However, something that drops one packet may drop the immediately following packet. Therefore, you want to send the copy about 1 second apart. This is simple. We already have a 'rate' variable, which is the number of packets-per-second rate we are transmitting at, so the retransmit function is simply to use i + rate as the index. One of these days I'm going to do a study of the Internet, and differentiate "back-to-back", "1 second", "10 second", and "1 minute" retransmits this way in order to see if there is any difference in what gets dropped.

C10 Scalability
The asynchronous technique is known as a solution to the "c10k problem". Masscan is designed for the next level of scalability, the "C10M problem".

The C10M solution is to bypass the kernel. There are three primary kernel bypasses in Masscan:

custom network driver
user-mode TCP stack
user-mode synchronization
Masscan can use the PF_RING DNA driver. This driver DMAs packets directly from user-mode memory to the network driver with zero kernel involvement. That allows software, even with a slow CPU, to transmit packets at the maximum rate the hardware allows. If you put 8 10-gbps network cards in a computer, this means it could transmit at 100-million packets/second.

Masscan has its own built-in TCP stack for grabbing banners from TCP connections. This means it can easily support 10 million concurrent TCP connections, assuming of course that the computer has enough memory.

Masscan has no "mutex". Modern mutexes (aka. futexes) are mostly user-mode, but they have two problems. The first problem is that they cause cache-lines to bounce quickly back-and-forth between CPUs. The second is that when there is contention, they'll do a system call into the kernel, which kills performance. Mutexes on the fast path of a program severely limits scalability. Instead, Masscan uses "rings" to synchronize things, such as when the user-mode TCP stack in the receive thread needs to transmit a packet without interfering with the transmit thread.

The code runs well on Linux, Windows, and Mac OS X. All the important bits are in standard C (C90). It therefore compiles on Visual Studio with Microsoft's compiler, the Clang/LLVM compiler on Mac OS X, and GCC on Linux.

Windows and Macs aren't tuned for packet transmit, and get only about 300,000 packets-per-second, whereas Linux can do 1,500,000 packets/second. That's probably faster than you want anyway.

Safe code
A bounty is offered for vulnerabilities, see the file for more information.

This project uses safe functions like strcpy_s() instead of unsafe functions like strcpy().

This project has automated unit regression tests (make regress).

A lot of effort has gone into making the input/output look like nmap, which everyone who does port scans is (or should be) familiar with.

This tool created by Robert Graham: email: twitter: @ErrataRob

Desktop version

View POST Data using Chrome Developer Tools

View POST Data using Chrome Developer Tools

View POST Data using Chrome Developer Tools
2016-09-19 Web Scraping Andrew B. Collier

When figuring out how to formulate the contents of a POST request it’s often useful to see the “typical” fields submitted directly from a web form.

Open Developer Tools in Chrome. Select the Network tab (at the top).
Submit the form. Watch the magic happening in the Developer Tools console.

Click on the first document listed in the Developer Tools console, then select the Headers tab.

That’s just scratching the surface of the wealth of information available on the Network tab. Read this to find out more.

Next: Neha Narula: The future of money.
Previous: Deleting All Nodes and Relationships.

AlphaZero: Shedding new light on the grand games of chess, shogi and Go

AlphaZero: Shedding new light on the grand games of chess, shogi and Go

In late 2017 we introduced AlphaZero, a single system that taught itself from scratch how to master the games of chess, shogi (Japanese chess), and Go, beating a world-champion program in each case. We were excited by the preliminary results and thrilled to see the response from members of the chess community, who saw in AlphaZero’s games a ground-breaking, highly dynamic and “unconventional” style of play that differed from any chess playing engine that came before it.

Today, we are delighted to introduce the full evaluation of AlphaZero, published in the journal Science (Open Access version here), that confirms and updates those preliminary results. It describes how AlphaZero quickly learns each game to become the strongest player in history for each, despite starting its training from random play, with no in-built domain knowledge but the basic rules of the game.


Chess, a Drosophila of reasoning Garry Kasparov on AlphaZero Chess, a Drosophila of reasoning
I can’t disguise my satisfaction that it plays with a very dynamic style, much like my own!”

Garry Kasparov, former World Chess Champion
This ability to learn each game afresh, unconstrained by the norms of human play, results in a distinctive, unorthodox, yet creative and dynamic playing style. Chess Grandmaster Matthew Sadler and Women’s International Master Natasha Regan, who have analysed thousands of AlphaZero’s chess games for their forthcoming book Game Changer (New in Chess, January 2019), say its style is unlike any traditional chess engine.” It’s like discovering the secret notebooks of some great player from the past,” says Matthew.

AlphaZero: Player and potential

Traditional chess engines – including the world computer chess champion Stockfish and IBM’s ground-breaking Deep Blue – rely on thousands of rules and heuristics handcrafted by strong human players that try to account for every eventuality in a game. Shogi programs are also game specific, using similar search engines and algorithms to chess programs.

AlphaZero takes a totally different approach, replacing these hand-crafted rules with a deep neural network and general purpose algorithms that know nothing about the game beyond the basic rules.

AlphaZero – Generality
In chess, AlphaZero first outperformed Stockfish after just 4 hours; in shogi, AlphaZero first outperformed Elmo after 2 hours; and in Go, AlphaZero first outperformed the version of AlphaGo that beat the legendary player Lee Sedol in 2016 after 30 hours. Note: each training step represents 4,096 board positions.
To learn each game, an untrained neural network plays millions of games against itself via a process of trial and error called reinforcement learning. At first, it plays completely randomly, but over time the system learns from wins, losses, and draws to adjust the parameters of the neural network, making it more likely to choose advantageous moves in the future. The amount of training the network needs depends on the style and complexity of the game, taking approximately 9 hours for chess, 12 hours for shogi, and 13 days for Go.

Some of its moves, such as moving the King to the centre of the board, go against shogi theory and – from a human perspective – seem to put AlphaZero in a perilous position. But incredibly it remains in control of the board. Its unique playing style shows us that there are new possibilities for the game.”

Yoshiharu Habu, 9-dan professional, only player in history to hold all seven major shogi titles
The trained network is used to guide a search algorithm – known as Monte-Carlo Tree Search (MCTS) – to select the most promising moves in games. For each move, AlphaZero searches only a small fraction of the positions considered by traditional chess engines. In Chess, for example, it searches only 60 thousand positions per second in chess, compared to roughly 60 million for Stockfish.

The fully trained systems were tested against the strongest hand-crafted engines for chess (Stockfish) and shogi (Elmo), along with our previous self-taught system AlphaGo Zero, the strongest Go player known.

Each program ran on the hardware for which they were designed. Stockfish and Elmo used 44 CPU cores (as in the TCEC world championship), whereas AlphaZero and AlphaGo Zero used a single machine with 4 first-generation TPUs and 44 CPU cores. A first generation TPU is roughly similar in inference speed to commodity hardware such as an NVIDIA Titan V GPU, although the architectures are not directly comparable.
All matches were played using time controls of three hours per game, plus an additional 15 seconds for each move.
Download hundreds of AlphaZero’s games Download hundreds of AlphaZero’s games Download hundreds of AlphaZero’s games
In each evaluation, AlphaZero convincingly beat its opponent:

In chess, AlphaZero defeated the 2016 TCEC (Season 9) world champion Stockfish, winning 155 games and losing just six games out of 1,000. To verify the robustness of AlphaZero, we also played a series of matches that started from common human openings. In each opening, AlphaZero defeated Stockfish. We also played a match that started from the set of opening positions used in the 2016 TCEC world championship, along with a series of additional matches against the most recent development version of Stockfish, and a variant of Stockfish that uses a strong opening book. In all matches, AlphaZero won.
In shogi, AlphaZero defeated the 2017 CSA world champion version of Elmo, winning 91.2% of games.
In Go, AlphaZero defeated AlphaGo Zero, winning 61% of games.

However, it was the style in which AlphaZero plays these games that players may find most fascinating. In Chess, for example, AlphaZero independently discovered and played common human motifs during its self-play training such as openings, king safety and pawn structure. But, being self-taught and therefore unconstrained by conventional wisdom about the game, it also developed its own intuitions and strategies adding a new and expansive set of exciting and novel ideas that augment centuries of thinking about chess strategy.

Chess has been used as a Rosetta Stone of both human and machine cognition for over a century. AlphaZero renews the remarkable connection between an ancient board game and cutting-edge science by doing something extraordinary.”

Garry Kasparov, former World Chess Champion
The first thing that players will notice is AlphaZero’s style, says Matthew Sadler – “the way its pieces swarm around the opponent’s king with purpose and power”. Underpinning that, he says, is AlphaZero’s highly dynamic game play that maximises the activity and mobility of its own pieces while minimising the activity and mobility of its opponent’s pieces. Counterintuitively, AlphaZero also seems to place less value on “material”, an idea that underpins the modern game where each piece has a value and if one player has a greater value of pieces on the board than the other, then they have a material advantage. Instead, AlphaZero is willing to sacrifice material early in a game for gains that will only be recouped in the long-term.


Mastering board games Perspective from Deep Blue co-creator Murray Campbell Mastering board games
“Impressively, it manages to impose its style of play across a very wide range of positions and openings,” says Matthew, who also observes that it plays in a very deliberate style from its first move with a “very human sense of consistent purpose”.

“Traditional engines are exceptionally strong and make few obvious mistakes, but can drift when faced with positions with no concrete and calculable solution,” he says. “It’s precisely in such positions where ‘feeling’, ‘insight’ or ‘intuition’ is required that AlphaZero comes into its own.”

The implications go far beyond my beloved chessboard… Not only do these self-taught expert machines perform incredibly well, but we can actually learn from the new knowledge they produce.”

Garry Kasparov, former World Chess Champion
This unique ability, not seen in other traditional chess engines, has already been harnessed to give chess fans fresh insight and commentary on the recent World Chess Championship match between Magnus Carlsen and Fabiano Caruana and will be explored further in Game Changer. “It was fascinating to see how AlphaZero’s analysis differed from that of top chess engines and even top grandmaster play,” says Natasha Regan. “AlphaZero could be a powerful teaching tool for the whole community.”

AlphaZero’s teachings echo what we saw when AlphaGo played the legendary champion Lee Sedol in 2016. During the games, AlphaGo played a number of highly inventive winning moves, including move 37 in game two, which overturned hundreds of years of thinking. These moves – and many others – have since been studied by players at all levels including Lee Sedol himself, who said of Move 37: “I thought AlphaGo was based on probability calculation and it was merely a machine. But when I saw this move I changed my mind. Surely AlphaGo is creative.”

As with Go, we are excited about AlphaZero’s creative response to chess, which has been a grand challenge for artificial intelligence since the dawn of the computing age with early pioneers including Babbage, Turing, Shannon, and von Neumann all trying their hand at designing chess programs. But AlphaZero is about more than chess, shogi or Go. To create intelligent systems capable of solving a wide range of real-world problems we need them to be flexible and generalise to new situations. While there has been some progress towards this goal, it remains a major challenge in AI research with systems capable of mastering specific skills to a very high standard, but often failing when presented with even slightly modified tasks.

AlphaZero’s ability to master three different complex games – and potentially any perfect information game – is an important step towards overcoming this problem. It demonstrates that a single algorithm can learn how to discover new knowledge in a range of settings. And, while it is still early days, AlphaZero’s creative insights coupled with the encouraging results we see in other projects such as AlphaFold, give us confidence in our mission to create general purpose learning systems that will one day help us find novel solutions to some of the most important and complex scientific problems.

Read the paper in Science
Download an Open Access version of the paper [PDF]
Read the accompanying Science editorial by Garry Kasparov
Read the accompanying Perspective article in Science by Deep Blue co-creator Murray Campbell
Download the top 20 AlphaZero-Stockfish games chosen by Grandmaster Matthew Sadler [.zip]
Download the top 10 AlphaZero-Elmo games chosen by shogi Master Yoshiharu Habu [.zip]
Download 210 AlphaZero-Stockfish chess games and 100 AlphaZero-Elmo shogi games
Download the accompanying artwork
Find out more about the AlphaZero book Game Changer (New in Chess, January 2019)
This work was done by David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis.