aras-p.info: More Hash Function Tests

aras-p.info: More Hash Function Tests

Home Blog Talks Papers Projects
More Hash Function Tests
Posted on Aug 9, 2016 #code
In the previous post, I wrote about non-crypto hash functions, and did some performance tests. Turns out, it’s great to write about stuff! People at the comments/twitter/internets pointed out more things I should test. So here’s a follow-up post.

What is not the focus
This is about algorithms for “hashing some amount of bytes”, for use either in hashtables or for checksum / uniqueness detection. Depending on your situation, there’s a whole family of algorithms for that, and I am focusing on only one: non-cryptographic fast hash functions.

This post is not about cryptographic hashes. Do not read below if you need to hash passwords, sensitive data going through untrusted medium and so on. Use SHA-1, SHA-2, BLAKE2 and friends.
Also, I’m not focusing on algorithms that are designed to prevent possible hashtable Denial-of-Service attacks. If something comes from the other side of the internet and ends up inserted into your hashtable, then to prevent possible worst-case O(N) hashtable behavior you’re probably off by using a hash function that does not have known “hash flooding” attacks. SipHash seems to be popular now.
If you are hashing very small amounts of data of known size (e.g. single integers or two floats or whatever), you should probably use specialized hashing algorithms for those. Here are some integer hash functions, or 2D hashing with Weyl, or perhaps you could take some other algorithm and just specialize it’s code for your known input size (e.g. xxHash for a single integer).
I am testing 32 and 64 bit hash functions here. If you need larger hashes, quite likely some of these functions might be suitable (e.g. SpookyV2 always produces 128 bit hash).
When testing hash functions, I have not gone to great lengths to get them compiling properly or setting up all the magic flags on all my platforms. If some hash function works wonderfully when compiled on Linux Itanium box with an Intel compiler, that’s great for you, but if it performs poorly on the compilers I happen to use, I will not sing praises for it. Being in the games industry, I care about things like “what happens in Visual Studio”, and “what happens on iOS”, and “what happens on PS4”.

More hash function tests!
I’ve updated my hash testbed on github (use revision 9b59c91cf) to include more algorithms, changed tests a little, etc etc.

I checked both “hashing data that is aligned” (16-byte aligned address of data to hash), and unaligned data. Everywhere I tested, there wasn’t a notable performance difference that I could find (but then, I have not tested old ARM CPUs or PowerPC based ones). The only visible effect is that MurmurHash and SpookyHash don’t properly work in asm.js / Emscripten compilation targets, due to their usage of unaligned reads. I’d assume they probably don’t work on some ARM/PowerPC platforms too.

Hash functions tested below:

xxHash32 and xxHash64 – xxHash.
City32 and City64 – CityHash.
Mum – mum-hash.
Farm32 and Farm64 – FarmHash.
SpookyV2-64 – SpookyHash V2.
Murmur2A, Murmur3-32, Murmur3-X64-64 – MurmurHash family.
These are the main functions that are interesting. Because people kept on asking, and because “why not”, I’ve included a bunch of others too:

SipRef – SipHash-2-4 reference implementation. Like mentioned before, this one is supposedly good for avoiding hash flooding attacks.
MD5-32, SHA1-32, CRC32 – simple implementations of well-known hash functions (from SMHasher test suite). Again, these mostly not in the category I’m interested in, but included to illustrate the performance differences.
FNV-1a, FNV-1amod – FNV hash and modified version.
djb2, SDBM – both from this site.
Performance Results
Windows / Mac
macOS results, compiled with Xcode 7.3 (clang-based) in Release 64 bit build. Late 2013 MacBookPro:

Windows results, compiled with Visual Studio 2015 in Release 64 bit build. Core i7-5820K:

Notes:

Performance profiles of these are very similar.
xxHash64 wins at larger (0.5KB+) data sizes, followed closely by 64 bit FarmHash and CityHash.
At smaller data sizes (10-500 bytes), FarmHash and CityHash look very good!
mum-hash is much slower on Visual Studio. At first I thought it’s _MUM_UNALIGNED_ACCESS macro that was not handling VS-specific defines (_M_AMD64 and _M_IX86) properly (see PR). Turns out it’s not; the speed difference comes from _MUM_USE_INT128 which only kicks in on GCC-based compilers. mum-hash would be pretty competetive otherwise.
Windows 32 bit results, compiled with Visual Studio 2015 in Release 32 bit build. Core i7-5820K:

On a 32 bit platform / compilation target, things change quite a bit!

All 64 bit hash functions are slow. For example, xxHash64 frops from 13GB/s to 1GB/s. Use a 32 bit hash function when on a 32 bit target!
xxHash32 wins by a good amount.
Note on FarmHash – whole idea behind it is that it uses CPU-specific optimizations (that also change the computed hash value). The graphs above are using default compilation settings on both macOS and Windows. However, on macOS enabling SSE4.2 instructions in Xcode settings makes it much faster at larger data sizes:

With SSE4.2 enabled, FarmHash64 handily wins against xxHash64 (17.9GB/s vs 12.8GB/s) for data that is larger than 1KB. However, that requires compiling with SSE4.2, at my work I can’t afford that. Enabling the same options on XboxOne makes it slower 😦 Enabling just FARMHASH_ASSUME_AESNI makes the 32 bit FarmHash faster on XboxOne, but does not affect performance of the 64 bit hash. FarmHash also does not have any specific optimizations for ARM CPUs, so my verdict with it all is “not worth bothering” – afterall the impressive SSE4.2 speedup is only for large data sizes.

Mobile
iPhone SE (A9 CPU) results, compiled with Xcode 7.3 (clang-based) in Release 64 bit build:

xxHash never wins here; CityHash and FarmHash are fastest across the whole range.
iPhone SE 32 bit results:

This is similar to Windows 32 bit: 64 bit hash functions are slow, xxHash32 wins by a good amount.

Console
Xbox One (AMD Jaguar 1.75GHz CPU) results, compiled Visual Studio 2015 in Release mode:

Similar to iPhone results, xxHash is quite a bit slower than CityHash and FarmHash. xxHash uses 64 bit multiplications heavily, whereas others mostly do shifts and logic ops.
SpookyHash wins at larger data sizes.
JavaScript
JavaScript (asm.js via Emscripten) results, running on late 2013 MacBookPro.

asm.js is in all practical sense a 32 bit compilation target; 64 bit integer operations are slow.
xxHash32 wins, followed by FarmHash32.
At smaller (<25 bytes) data sizes, simple hashes like FNV-1a, SDBM or djb2 are useful.
Throughput tables
Performance at large data sizes (~4KB), in GB/s:

Hash 64 bit 32 bit
Win Mac iPhoneSE XboxOne Win iPhoneSE asm.js Firefox asm.js Chrome
xxHash64 13.2 12.8 5.7 1.5 1.1 1.5 0.3 0.1
City64 12.2 12.2 7.2 3.6 1.6 1.9 0.4 0.1
Mum 4.0 9.5 4.5 0.5 0.7 1.3 0.1 0.1
Farm64 12.3 11.9 8.2 3.3 2.4 2.2 0.4 0.1
SpookyV2-64 12.8 12.5 7.1 4.3 2.6 1.9 — —
xxHash32 6.8 6.6 4.0 1.5 6.7 3.7 2.5 1.4
Murmur3-X64-64 7.1 5.8 2.3 1.2 0.9 0.8 — —
Murmur2A 3.4 3.3 1.7 0.9 3.4 1.8 — —
Murmur3-32 3.1 2.7 1.3 0.5 3.1 1.3 — —
City32 5.1 4.9 2.6 0.9 4.0 2.6 1.1 0.8
Farm32 5.2 4.3 2.6 1.8 4.3 1.9 2.1 1.4
SipRef 1.4 1.6 1.0 0.4 0.3 0.4 0.1 0.0
CRC32 0.5 0.5 0.2 0.2 0.4 0.2 0.4 0.4
MD5-32 0.5 0.3 0.2 0.2 0.4 0.3 0.4 0.4
SHA1-32 0.5 0.5 0.3 0.1 0.4 0.2 0.3 0.2
FNV-1a 0.9 0.8 0.4 0.3 0.9 0.4 0.8 0.8
FNV-1amod 0.9 0.8 0.4 0.3 0.9 0.4 0.8 0.7
djb2 0.9 0.8 0.6 0.4 1.1 0.6 0.8 0.8
SDBM 0.9 0.8 0.4 0.3 0.8 0.4 0.8 0.8
Performance at medium size (128 byte) data, in GB/s:

Hash 64 bit 32 bit
Win Mac iPhoneSE XboxOne Win iPhoneSE asm.js Firefox asm.js Chrome
xxHash64 6.6 6.2 2.8 0.9 0.7 0.7 0.2 0.1
City64 7.6 7.6 4.4 1.7 1.1 1.5 0.3 0.1
Mum 3.2 7.6 4.1 0.5 0.6 1.1 0.1 0.0
Farm64 6.6 5.7 3.4 1.4 0.9 1.1 0.3 0.1
SpookyV2-64 3.4 3.2 1.7 1.4 0.7 0.5 — —
xxHash32 5.1 5.3 3.4 1.3 5.1 2.8 2.2 1.4
Murmur3-X64-64 5.3 4.3 2.1 1.0 0.8 0.7 — —
Murmur2A 3.6 3.0 2.1 0.8 3.3 1.7 — —
Murmur3-32 3.1 2.3 1.3 0.4 2.8 1.3 — —
City32 4.0 3.6 2.0 0.7 3.3 1.9 1.0 0.8
Farm32 3.9 3.2 1.9 1.0 3.4 1.6 1.6 1.1
SipRef 1.2 1.3 0.8 0.4 0.3 0.3 0.1 0.0
CRC32 0.5 0.5 0.2 0.2 0.4 0.2 0.4 0.4
MD5-32 0.3 0.2 0.2 0.1 0.3 0.2 0.2 0.2
SHA1-32 0.2 0.2 0.1 0.1 0.1 0.1 0.1 0.1
FNV-1a 0.9 1.0 0.5 0.3 0.9 0.5 0.9 0.9
FNV-1amod 0.9 0.9 0.5 0.3 0.9 0.5 0.9 0.8
djb2 1.0 0.9 0.7 0.4 1.1 0.7 0.9 0.8
SDBM 0.9 0.9 0.5 0.3 0.9 0.5 0.9 0.8
Performance at small size (17 byte) data, in GB/s:

Hash 64 bit 32 bit
Win Mac iPhoneSE XboxOne Win iPhoneSE asm.js Firefox asm.js Chrome
xxHash64 2.1 1.8 0.5 0.5 0.4 0.4 0.1 0.0
City64 3.4 3.0 1.5 0.7 0.5 0.7 0.2 0.0
Mum 1.2 2.6 0.9 0.2 0.3 0.5 0.1 0.0
Farm64 3.6 2.6 1.2 0.7 0.6 0.6 0.1 0.0
SpookyV2-64 1.2 1.0 0.6 0.4 0.2 0.1 — —
xxHash32 2.2 2.0 0.7 0.5 1.9 0.8 1.2 0.8
Murmur3-X64-64 1.7 1.3 0.5 0.4 0.3 0.3 — —
Murmur2A 2.4 1.8 1.1 0.5 2.1 1.0 — —
Murmur3-32 2.1 1.5 0.8 0.4 1.8 0.8 — —
City32 2.1 1.9 0.9 0.5 1.7 0.7 0.8 0.7
Farm32 2.5 2.0 0.8 0.5 1.8 0.9 0.9 0.5
SipRef 0.6 0.6 0.3 0.2 0.2 0.1 0.0 0.0
CRC32 0.8 0.7 0.4 0.2 0.6 0.3 0.5 0.4
MD5-32 0.1 0.1 0.0 0.0 0.1 0.0 0.1 0.0
SHA1-32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
FNV-1a 1.3 1.5 1.0 0.3 1.2 0.7 1.4 1.0
FNV-1amod 1.1 1.4 0.9 0.3 1.0 0.7 1.3 0.9
djb2 1.6 1.6 1.1 0.4 1.1 0.7 1.3 0.9
SDBM 1.4 1.3 1.1 0.4 1.2 0.7 1.5 1.0
A note on hash quality
As far as I’m concerned, all the 64 bit hashes are excellent quality.

Most of the 32 bit hashes are pretty good too on the data sets I tested.

SDBM produces more collisions than others on binary-like data (various struct memory dumps, 5742 entries, average length 55 bytes – SDBM had 64 collisions, all the other hashes had zero). You could have a way worse hash function than SDBM of course, but then functions like FNV-1a are about as simple to implement, and seem to behave better on binary data.

A note on hash consistency
Some of the hash functions do not produce identical output on various platforms. Among the ones I tested, mum-hash and FarmHash produced different output depending on compiler and platform used.

It’s very likely that most of the above hash functions produce different output if ran on a big-endian CPU. I did not have any platform like that to test on.

Some of the hash functions depend on unaligned memory reads being possible – most notably Murmur and Spooky. I had to change Spooky to work on ARM 32 bit (define ALLOW_UNALIGNED_READS to zero in the source code). Murmur and Spooky did produce wrong results on asm.js (no crash, just different hash results and way more collisions than expected).

Conclusions
General cross-platform use: CityHash64 on a 64 bit system; xxHash32 on a 32 bit system.

Good performance across various data sizes.
Identical hash results across various little-endian platforms and compilers.
No weird compilation options to tweak or peformance drops on compilers that it is not tuned for.
Best throughput on large data: depends on platform!

Intel CPU: xxHash64 in general, FarmHash64 if you can use SSE4.2, xxHash32 if compiling for 32 bit.
Apple mobile CPU (A9): FarmHash64 for 64 bit, xxHash32 for 32 bit.
Console (XboxOne, AMD Jaguar): SpookyV2.
asm.js: xxHash32.
Best for short strings: FNV-1a.

Super simple to implement, inline-able.
Where exactly it becomes “generally fastest” depends on a platform; around 8 bytes or less on PC, mobile and console; around 20 bytes or less on asm.js.
If your data is fixed size (e.g. one or two integers), look into specialized hashes instead (see above).
← Older Newer →
Possibly Related Posts
Hash Functions all the way down, from 2016 August
Solving DX9 Half-Pixel Offset, from 2016 April
Backporting Fixes and Shuffling Branches, from 2016 April
Have feedback on this post? Let me know on email, mastodon or twitter.
Recent Posts
"Modern" C++ Lamentations
C++11 way of initializing integers
Pathtracer 17: WebAssembly
SPIR-V Compression: SMOL vs MARK
Pathtracer 16: Burst SIMD Optimization
Random list of Demoscene Demos
Pathtracer 15: Pause & Links
Categories
code (96)
conferences (10)
d3d (20)
demos (24)
devtools (16)
energy (1)
games (18)
giving (3)
gpu (40)
mobile (7)
opengl (19)
papers (12)
personal (12)
politics (2)
random (43)
rant (43)
rendering (45)
travel (5)
uncategorized (36)
unity (56)
vulkan (2)
web (2)
work (59)
Text content © Aras Pranckevičius. Code snippets are public domain, unless specified otherwise.

MASSCAN: Mass IP port scanner

MASSCAN: Mass IP port scanner

robertdavidgraham / masscan
Code Issues 191 Pull requests 48 Projects 0 Wiki Pulse
README.md

MASSCAN: Mass IP port scanner
This is the fastest Internet port scanner. It can scan the entire Internet in under 6 minutes, transmitting 10 million packets per second.

It produces results similar to nmap, the most famous port scanner. Internally, it operates more like scanrand, unicornscan, and ZMap, using asynchronous transmission. The major difference is that it’s faster than these other scanners. In addition, it’s more flexible, allowing arbitrary address ranges and port ranges.

NOTE: masscan uses a custom TCP/IP stack. Anything other than simple port scans will cause conflict with the local TCP/IP stack. This means you need to either use the -S option to use a separate IP address, or configure your operating system to firewall the ports that masscan uses.

This tool is free, but consider funding it here: Bitcoin wallet address: 1MASSCANaHUiyTtR3bJ2sLGuMw5kDBaj4T

Building
On Debian/Ubuntu, it goes something like this:

$ sudo apt-get install git gcc make libpcap-dev
$ git clone https://github.com/robertdavidgraham/masscan
$ cd masscan
$ make
This puts the program in the masscan/bin subdirectory. You’ll have to manually copy it to something like /usr/local/bin if you want to install it elsewhere on the system.

The source consists of a lot of small files, so building goes a lot faster by using the multi-threaded build:

$ make -j
While Linux is the primary target platform, the code runs well on many other systems. Here’s some additional build info:

Windows w/ Visual Studio: use the VS10 project
Windows w/ MingGW: just type make
Windows w/ cygwin: won’t work
Mac OS X /w XCode: use the XCode4 project
Mac OS X /w cmdline: just type make
FreeBSD: type gmake
other: I don’t know, don’t care
PF_RING
To get beyond 2 million packets/second, you need an Intel 10-gbps Ethernet adapter and a special driver known as “PF_RING ZC” from ntop. Masscan doesn’t need to be rebuilt in order to use PF_RING. To use PF_RING, you need to build the following components:

libpfring.so (installed in /usr/lib/libpfring.so)
pf_ring.ko (their kernel driver)
ixgbe.ko (their version of the Intel 10-gbps Ethernet driver)
You don’t need to build their version of libpcap.so.

When Masscan detects that an adapter is named something like zc:enp1s0 instead of something like enp1s0, it’ll automatically switch to PF_RING ZC mode.

Regression testing
The project contains a built-in self-test:

$ make regress
bin/masscan –regress
selftest: success!
This tests a lot of tricky bits of the code. You should do this after building.

Performance testing
To test performance, run something like the following:

$ bin/masscan 0.0.0.0/4 -p80 –rate 100000000 –router-mac 66-55-44-33-22-11
The bogus –router-mac keeps packets on the local network segments so that they won’t go out to the Internet.

You can also test in “offline” mode, which is how fast the program runs without the transmit overhead:

$ bin/masscan 0.0.0.0/4 -p80 –rate 100000000 –offline
This second benchmark shows roughly how fast the program would run if it were using PF_RING, which has near zero overhead.

Usage
Usage is similar to nmap. To scan a network segment for some ports:

# masscan -p80,8000-8100 10.0.0.0/8
This will:

scan the 10.x.x.x subnet, all 16 million addresses
scans port 80 and the range 8000 to 8100, or 102 addresses total
print output to that can be redirected to a file
To see the complete list of options, use the –echo feature. This dumps the current configuration and exits. This output can be used as input back into the program:

# masscan -p80,8000-8100 10.0.0.0/8 –echo > xxx.conf
# masscan -c xxx.conf –rate 1000
Banner checking
Masscan can do more than just detect whether ports are open. It can also complete the TCP connection and interaction with the application at that port in order to grab simple “banner” information.

The problem with this is that masscan contains its own TCP/IP stack separate from the system you run it on. When the local system receives a SYN-ACK from the probed target, it responds with a RST packet that kills the connection before masscan can grab the banner.

The easiest way to prevent this is to assign masscan a separate IP address. This would look like the following:

# masscan 10.0.0.0/8 -p80 –banners –source-ip 192.168.1.200
The address you choose has to be on the local subnet and not otherwise be used by another system.

In some cases, such as WiFi, this isn’t possible. In those cases, you can firewall the port that masscan uses. This prevents the local TCP/IP stack from seeing the packet, but masscan still sees it since it bypasses the local stack. For Linux, this would look like:

# iptables -A INPUT -p tcp –dport 61000 -j DROP
# masscan 10.0.0.0/8 -p80 –banners –source-port 61000
You probably want to pick ports that don’t conflict with ports Linux might otherwise choose for source-ports. You can see the range Linux uses, and reconfigure that range, by looking in the file:

/proc/sys/net/ipv4/ip_local_port_range
On the latest version of Kali Linux (2018-August), that range is 32768 to 60999, so you should choose ports either below 32768 or 61000 and above.

Setting an iptables rule only lasts until the next reboot. You need to lookup how to save the configuration depending upon your distro, such as using iptables-save and/or iptables-persistant.

On Mac OS X and BSD, there are similar steps. To find out the ranges to avoid, use a command like the following:

# sysctl net.inet.ip.portrange.first net.inet.ip.portrange.last
On FreeBSD and older MacOS, use an ipfw command:

# sudo ipfw add 1 deny tcp from any to any 40000 in
# masscan 10.0.0.0/8 -p80 –banners –source-port 40000
On newer MacOS and OpenBSD, use the pf packet-filter utility. Edit the file /etc/pf.conf to add a line like the following:

block in proto tcp from any to any port 40000
Then to enable the firewall, run the command:

# pfctrl -E
If the firewall is already running, then either reboot or reload the rules with the following command:

# pfctl -f /etc/pf.conf
Windows doesn’t respond with RST packets, so neither of these techniques are necessary. However, masscan is still designed to work best using its own IP address, so you should run that way when possible, even when its not strictly necessary.

The same thing is needed for other checks, such as the –heartbleed check, which is just a form of banner checking.

How to scan the entire Internet
While useful for smaller, internal networks, the program is really designed with the entire Internet in mind. It might look something like this:

# masscan 0.0.0.0/0 -p0-65535
Scanning the entire Internet is bad. For one thing, parts of the Internet react badly to being scanned. For another thing, some sites track scans and add you to a ban list, which will get you firewalled from useful parts of the Internet. Therefore, you want to exclude a lot of ranges. To blacklist or exclude ranges, you want to use the following syntax:

# masscan 0.0.0.0/0 -p0-65535 –excludefile exclude.txt
This just prints the results to the command-line. You probably want them saved to a file instead. Therefore, you want something like:

# masscan 0.0.0.0/0 -p0-65535 -oX scan.xml
This saves the results in an XML file, allowing you to easily dump the results in a database or something.

But, this only goes at the default rate of 100 packets/second, which will take forever to scan the Internet. You need to speed it up as so:

# masscan 0.0.0.0/0 -p0-65535 –max-rate 100000
This increases the rate to 100,000 packets/second, which will scan the entire Internet (minus excludes) in about 10 hours per port (or 655,360 hours if scanning all ports).

The thing to notice about this command-line is that these are all nmap compatible options. In addition, “invisible” options compatible with nmap are also set for you: -sS -Pn -n –randomize-hosts –send-eth. Likewise, the format of the XML file is inspired by nmap. There are, of course, a lot of differences, because the asynchronous nature of the program leads to a fundamentally different approach to the problem.

The above command-line is a bit cumbersome. Instead of putting everything on the command-line, it can be stored in a file instead. The above settings would look like this:

# My Scan
rate = 100000.00
output-format = xml
output-status = all
output-filename = scan.xml
ports = 0-65535
range = 0.0.0.0-255.255.255.255
excludefile = exclude.txt
To use this configuration file, use the -c:

# masscan -c myscan.conf
This also makes things easier when you repeat a scan.

By default, masscan first loads the configuration file /etc/masscan/masscan.conf. Any later configuration parameters override what’s in this default configuration file. That’s where I put my “excludefile” parameter, so that I don’t ever forget it. It just works automatically.

Getting output
By default, masscan produces fairly large text files, but it’s easy to convert them into any other format. There are five supported output formats:

xml: Just use the parameter -oX . Or, use the parameters –output-format xml and –output-filename .

binary: This is the masscan builtin format. It produces much smaller files, so that when I scan the Internet my disk doesn’t fill up. They need to be parsed, though. The command line option –readscan will read binary scan files. Using –readscan with the -oX option will produce a XML version of the results file.

grepable: This is an implementation of the Nmap -oG output that can be easily parsed by command-line tools. Just use the parameter -oG . Or, use the parameters –output-format grepable and –output-filename .

json: This saves the results in JSON format. Just use the parameter -oJ . Or, use the parameters –output-format json and –output-filename .

list: This is a simple list with one host and port pair per line. Just use the parameter -oL . Or, use the parameters –output-format list and –output-filename . The format is:

open tcp 80 XXX.XXX.XXX.XXX 1390380064
Comparison with Nmap
Where reasonable, every effort has been taken to make the program familiar to nmap users, even though it’s fundamentally different. Two important differences are:

no default ports to scan, you must specify -p
target hosts are IP addresses or simple ranges, not DNS names, nor the funky subnet ranges nmap can use (like 10.0.0-255.0-255).
You can think of masscan as having the following settings permanently enabled:

-sS: this does SYN scan only (currently, will change in the future)
-Pn: doesn’t ping hosts first, which is fundamental to the async operation
-n: no DNS resolution happens
–randomize-hosts: scan completely randomized
–send-eth: sends using raw libpcap
If you want a list of additional nmap compatible settings, use the following command:

# masscan –nmap
Transmit rate (IMPORTANT!!)
This program spews out packets very fast. On Windows, or from VMs, it can do 300,000 packets/second. On Linux (no virtualization) it’ll do 1.6 million packets-per-second. That’s fast enough to melt most networks.

Note that it’ll only melt your own network. It randomizes the target IP addresses so that it shouldn’t overwhelm any distant network.

By default, the rate is set to 100 packets/second. To increase the rate to a million use something like –rate 1000000.

Design
This section describes the major design issues of the program.

Code Layout
The file main.c contains the main() function, as you’d expect. It also contains the transmit_thread() and receive_thread() functions. These functions have been deliberately flattened and heavily commented so that you can read the design of the program simply by stepping line-by-line through each of these.

Asynchronous
This is an asynchronous design. In other words, it is to nmap what the nginx web-server is to Apache. It has separate transmit and receive threads that are largely independent from each other. It’s the same sort of design found in scanrand, unicornscan, and ZMap.

Because it’s asynchronous, it runs as fast as the underlying packet transmit allows.

Randomization
A key difference between Masscan and other scanners is the way it randomizes targets.

The fundamental principle is to have a single index variable that starts at zero and is incremented by one for every probe. In C code, this is expressed as:

for (i = 0; i < range; i++) {
scan(i);
}
We have to translate the index into an IP address. Let's say that you want to scan all "private" IP addresses. That would be the table of ranges like:

192.168.0.0/16
10.0.0.0/8
172.16.0.0/12
In this example, the first 64k indexes are appended to 192.168.x.x to form the target address. Then, the next 16-million are appended to 10.x.x.x. The remaining indexes in the range are applied to 172.16.x.x.

In this example, we only have three ranges. When scanning the entire Internet, we have in practice more than 100 ranges. That's because you have to blacklist or exclude a lot of sub-ranges. This chops up the desired range into hundreds of smaller ranges.

This leads to one of the slowest parts of the code. We transmit 10 million packets per second, and have to convert an index variable to an IP address for each and every probe. We solve this by doing a "binary search" in a small amount of memory. At this packet rate, cache efficiencies start to dominate over algorithm efficiencies. There are a lot of more efficient techniques in theory, but they all require so much memory as to be slower in practice.

We call the function that translates from an index into an IP address the pick() function. In use, it looks like:

for (i = 0; i < range; i++) {
ip = pick(addresses, i);
scan(ip);
}
Masscan supports not only IP address ranges, but also port ranges. This means we need to pick from the index variable both an IP address and a port. This is fairly straightforward:

range = ip_count * port_count;
for (i = 0; i < range; i++) {
ip = pick(addresses, i / port_count);
port = pick(ports, i % port_count);
scan(ip, port);
}
This leads to another expensive part of the code. The division/modulus instructions are around 90 clock cycles, or 30 nanoseconds, on x86 CPUs. When transmitting at a rate of 10 million packets/second, we have only 100 nanoseconds per packet. I see no way to optimize this any better. Luckily, though, two such operations can be executed simultaneously, so doing two of these as shown above is no more expensive than doing one.

There are actually some easy optimizations for the above performance problems, but they all rely upon i++, the fact that the index variable increases one by one through the scan. Actually, we need to randomize this variable. We need to randomize the order of IP addresses that we scan or we'll blast the heck out of target networks that aren't built for this level of speed. We need to spread our traffic evenly over the target.

The way we randomize is simply by encrypting the index variable. By definition, encryption is random, and creates a 1-to-1 mapping between the original index variable and the output. This means that while we linearly go through the range, the output IP addresses are completely random. In code, this looks like:

range = ip_count * port_count;
for (i = 0; i < range; i++) {
x = encrypt(i);
ip = pick(addresses, x / port_count);
port = pick(ports, x % port_count);
scan(ip, port);
}
This also has a major cost. Since the range is an unpredictable size instead of a nice even power of 2, we can't use cheap binary techniques like AND (&) and XOR (^). Instead, we have to use expensive operations like MODULUS (%). In my current benchmarks, it's taking 40 nanoseconds to encrypt the variable.

This architecture allows for lots of cool features. For example, it supports "shards". You can setup 5 machines each doing a fifth of the scan, or range / shard_count. Shards can be multiple machines, or simply multiple network adapters on the same machine, or even (if you want) multiple IP source addresses on the same network adapter.

Or, you can use a 'seed' or 'key' to the encryption function, so that you get a different order each time you scan, like x = encrypt(seed, i).

We can also pause the scan by exiting out of the program, and simply remembering the current value of i, and restart it later. I do that a lot during development. I see something going wrong with my Internet scan, so I hit to stop the scan, then restart it after I've fixed the bug.

Another feature is retransmits/retries. Packets sometimes get dropped on the Internet, so you can send two packets back-to-back. However, something that drops one packet may drop the immediately following packet. Therefore, you want to send the copy about 1 second apart. This is simple. We already have a 'rate' variable, which is the number of packets-per-second rate we are transmitting at, so the retransmit function is simply to use i + rate as the index. One of these days I'm going to do a study of the Internet, and differentiate "back-to-back", "1 second", "10 second", and "1 minute" retransmits this way in order to see if there is any difference in what gets dropped.

C10 Scalability
The asynchronous technique is known as a solution to the "c10k problem". Masscan is designed for the next level of scalability, the "C10M problem".

The C10M solution is to bypass the kernel. There are three primary kernel bypasses in Masscan:

custom network driver
user-mode TCP stack
user-mode synchronization
Masscan can use the PF_RING DNA driver. This driver DMAs packets directly from user-mode memory to the network driver with zero kernel involvement. That allows software, even with a slow CPU, to transmit packets at the maximum rate the hardware allows. If you put 8 10-gbps network cards in a computer, this means it could transmit at 100-million packets/second.

Masscan has its own built-in TCP stack for grabbing banners from TCP connections. This means it can easily support 10 million concurrent TCP connections, assuming of course that the computer has enough memory.

Masscan has no "mutex". Modern mutexes (aka. futexes) are mostly user-mode, but they have two problems. The first problem is that they cause cache-lines to bounce quickly back-and-forth between CPUs. The second is that when there is contention, they'll do a system call into the kernel, which kills performance. Mutexes on the fast path of a program severely limits scalability. Instead, Masscan uses "rings" to synchronize things, such as when the user-mode TCP stack in the receive thread needs to transmit a packet without interfering with the transmit thread.

Portability
The code runs well on Linux, Windows, and Mac OS X. All the important bits are in standard C (C90). It therefore compiles on Visual Studio with Microsoft's compiler, the Clang/LLVM compiler on Mac OS X, and GCC on Linux.

Windows and Macs aren't tuned for packet transmit, and get only about 300,000 packets-per-second, whereas Linux can do 1,500,000 packets/second. That's probably faster than you want anyway.

Safe code
A bounty is offered for vulnerabilities, see the VULNINFO.md file for more information.

This project uses safe functions like strcpy_s() instead of unsafe functions like strcpy().

This project has automated unit regression tests (make regress).

Compatibility
A lot of effort has gone into making the input/output look like nmap, which everyone who does port scans is (or should be) familiar with.

Authors
This tool created by Robert Graham: email: robert_david_graham@yahoo.com twitter: @ErrataRob

Desktop version