Bcrypt password cracking extremely slow? Not if you are using hundreds of FPGAs!

9 min readMay 5, 2020

Cracking classic hashes

Moore’s law is the observation that the number of transistors in a dense integrated circuit doubles about every two years. This roughly doubles computing power about every two years as well. Password hashing algorithms typically have a lifetime of many decades. This means that the level of protection of a given password hash algorithm decreases over time: attackers can crack longer and more complex passwords in the same amount of time.

The introduction of powerful Graphics Processing Units (GPUs) and supporting software frameworks revolutionized password cracking. Starting over a decade ago, high performance GPU password cracking tools were published, massively outperforming Central Processing Units (CPUs) that were typically used for password cracking at the time. The speed-up was a factor of 50 to 100 for many algorithms: another big step back for the protection level of your passwords. The impact of the combination of Moore’s law and the introduction of GPUs is massive.

Let us take a closer look at a classic password hashing algorithm that was very popular a decade ago and still is in use today: MD5. The hash rate —the number of passwords that can be guessed — on an decade old AMD Phenom II X4 965 CPU was about 95M hashes per second. On a recent Nvidia RTX 2080Ti high-end GPU, the hash rate is about 54,000M hashes per second: about a factor 570 faster. Similar speed-ups can be seen for other classic password hashing algorithms like SHA-1 and SHA-2. It is clear that classic password hashing algorithms are losing the battle. And it is clear that the GPU is the weapon of choice for cracking classic password hashes.

Advanced hashes

Some password hash designers recognized the importance of designing algorithms that could cope with ever increasing computing power. They introduced two new characteristics: a variable iteration count and memory hardness.

Using a variable iteration count is a way to make password cracking more time consuming by requiring repeated hashing. The number of rounds for repeated hashing is configurable. This allows the algorithm to stay resistant to password cracking attacks even if computation power significantly increases: if computing power doubles the iteration count can also be doubled, resulting in a similar hash rate.

Memory hardness is used to slow down specific hardware platforms by using a relatively large amount of memory to calculate a password hash. If the amount of required memory is higher than the amount of local fast memory available to the computing core (‘level 1 cache’), the computing core needs to wait for data from slower memory. This will result in a significant drop in performance. For some memory hard algorithms the required amount of memory is fixed. If the cracking platform has more level 1 cache on board, the algorithm’s countermeasure is ineffective. For other algorithms, the memory usage is variable and can simply be set to a value higher than the amount available in the most potent attack platform. This value can be updated over time. So having more computing power only is not good enough to accelerate password cracking, the platform also needs enough fast memory to fit the entire algorithm in.

In the last decades, a number of advanced password hashing algorithms was introduced. The most well-known ones are bcrypt (1999), scrypt (2009) and Argon2 (2015). All use a configurable iteration count. Memory usage of bcrypt is fixed, the others also support configurable memory usage.

Meet bcrypt

Scattered Secrets is a password breach notification and prevention service. We continuously collect publicly available hacked databases and try to crack the corresponding passwords. The majority of breached databases we encounter contain classic hashes, but the number of databases that contain advanced hashes is increasing — typically deploying bcrypt hashes. Even though both scrypt and Argon2 are better choices because of the configurable memory usage, it seems that those two are not used on a large scale yet.

Taking a closer look at bcrypt hashes, we see that the configurable iteration count in bcrypt is called the ‘work factor’. The work factors we see in the wild vary between 7 and 14, meaning between 2⁷ = 128 and 2¹⁴ = 16,384 iterations.

Figure 1: bcrypt work factors in breach data

Let us check the hash rates for all real-life work factors on both an AMD EPYC 7401P and an Nvida RTX-2080Ti, a CPU and a high-end GPU with comparable prices. For completeness, work factor 5 is also included, since this is the de-facto standard for benchmarking purposes.

Figure 2: bcrypt hash rates, CPU versus GPU

It is clear that the hash rate on both CPU and GPU is extremely low compared to the 54,000M hashes per second for MD5 on a GPU.

It is also clear that the memory hardness of the 1999 algorithm is still good enough in 2020 to slow GPUs down: the 50 to 100 times advantage over the CPU is completely gone. The reason is obvious if you take a closer look at the internals of bcrypt and the specifications of the heart of the GPU, in this case the Turing TU102. Bcrypt needs over 4 kilobyte of fast memory to run at full speed. The TU102’s 4,352 cores have only 1 kilobyte of level 1 cache available per core, meaning that the cores are spending a lot of time waiting for access to slower memory. In contrast, the used CPU has a level 1 cache of 96 kilobyte (32 kB for data, 64 kB for instructions) per core, so no delays there.

Conclusion? Cracking bcrypt hashes on a CPU or GPU is not very effective. Anything other than a very basic dictionay attack is unfeasable. We need something different.

FPGAs to the rescue

Field Programmable Gate Arrays (FPGAs) are chips that can be programmed for a special application after manufacturing. Unlike a CPU or GPU, FPGAs do not run code. Instead an FPGA is the code, build in hardware. This means that FPGAs can be programmed as dedicated optimized password hashing circuitry. Although FPGA clock speeds are significantly lower than those of CPUs and GPUs — typically hundreds of Megahertz versus a few Gigahertz — the dedicated circuitry runs more efficiently. This is true for both performance per Megahertz and performance per Watt.

The first commercial FPGA-based crackers were available in the mid 2000s. It took many years before the first free and open source password crackers became available. In 2016, the community enhanced version of John The Ripper started supporting the ZTEX 1.15y: quad Spartan-6 LX150 FPGA boards that were quite popular for mining cypto currency in earlier years. With the release of version 1.9.0-jumbo-1 in 2019, John The Ripper officially added support for 7 hash types including bcrypt. Although the boards — introduced in 2011 — are not using the latest generation of FPGAs, they are good enough to run 124 optimized bcrypt cores per FPGA. This results in a high bcrypt hash rate: higher than the hash rate of the latest generation of high-end GPUs:

Figure 3: bcrypt hash rates

A single quad FPGA board from 2011 outperforms a latest generation RTX-2080Ti GPU with factor of over 4. For Scattered Secrets it was clear that using john with the ZTEX boards was the way forward for bcrypt cracking.

From proof of concept to production v1

The ZTEX 1.15y board is a discontinued product. Although the boards were popular for crypto currency mining, the availability on the second hand market was and is limited. Finding boards was a challenge. It took us almost two years of monitoring several online market places until we were able to find boards at large quantities. Once found, buying the devices was challenging as well. Sellers typically want to stay anonymous so checking trustworthiness is an issue. Hardware escrow? Coming from professional corporate environments the processes used looked… interesting… to us. The only accepted payment method? Crypto currency. Testing or pick-up in person? No way. All ingredients of a good scam. No guts no glory, after transferring Bitcoins to our anonymous friends, we have aquired a large number of boards of which almost all were in working condition.

In early 2019 we built out first FPGA cracker, based on 12 ZTEX boards and an Intel J4005 based PC. This machine was mainly used for testing the concept for large scale deployment in production.

Figure 4: proof of concept / production v1 setup

We used 12 FPGA boards for a number of practical reasons. First of all, 12 devices can be connected using two of the shelf USB hubs, so switching hardware during testing or in case of issues in production does not require waiting for shipments of exotic products and is inexpensive. Secondly, 12 devices can be easily build into an off the shelf 4U case by non-experts. Thirdly, common specs power supplies can be used.

The initial system design was performing surprisingly well: FPGA performance scaled perfectly, system cooling performed above expectation, the controller PC had more than enough computing power to keep all FPGAs busy and the system was stable. After a two month test period and some minor upgrades, the first v1 system was put into production and has so far been running without noticeble issues.

Figure 5: from zero to (datacenter) hero

Production v2

With the high bcrypt hash rate of the now proven concept, it was clear that using more FPGA-based crackers was on top of our wishlist. Preparations started as soon as the first v1 system was in production. To maximize performance per system and to professionalize construction, we contacted an instrument maker. His design simplified the setup and added another 6 FPGA boards per system, totaling at 18 boards / 72 FPGAs now.

Figure 6: simplified design with 6 boards per row (x3 rows)

The hardware setup is very similar to the v1 design: the only major changes include a more powerful and more efficient power supply and exotic USB many port hubs to connect all 18 FPGA boards to the controller PC using two USB hubs.

Figure 7: the first completed production v2 setup

The number of issues was quite limited. Problems were mainly related to the USB hubs. It seems that some USB hubs require more power than the PC mainboard can provide. Other USB hubs were unreliable: USB connectivity dropped during longer test runs. Finally, specifications of some Chinese manufacturers were incorrect, not providing USB 2.0 (480 Mbps) links as promised, but USB 1.1 (12 Mbps) which is too slow to keep the FPGAs working at full speed. Thorough testing showed that HS8836-based hubs were the most reliable.

After some minor design updates, the first v2 system was ready. Three clones quickly followed. After successful burn-in tests, the first set of four v2 systems was moved to production in late 2019.

Figure 9: a new set of bcrypt v2 crackers ready for production

A picture of a stack of FPGA crackers posted on social media resulted in a number of questions. Most questions are answered above. The most important ones are not: what is the hash rate and what is the power usage of one of the bcrypt crackers? Here you go:

Figure 10: performance and power usage of a v2 cracker

To translate the figures to GPU performance: to match the bcrypt crunching power of a single v2 cracker, you need about 75 to 80 Nvidia RTX-2080Ti GPUs. That is one FPGA-based machine versus a server rack full of GPU-based systems, burning about 25 kilowatts of power! So FPGA-based cracking is not only fast, it is also relatively economical to run.

After running many FPGA-based systems for months in production now, we have cracked an enormous number of bcrypt hashes that were previously practically uncrackable using conventional hardware. This allows us to generate a lot of quality content, allowing our clients to protect their accounts against account takeover with our unique intel. Scattered Secrets ♥ passwords ;) Don’t forget to check if your passwords are breached!

Edit Aug-2020: original GPU benchmarks were based on hashcat 5.1.0 using OpenCL. Since publication, hashcat’s bcrypt performance was improved significantly and hashcat 6 was introduced, using CUDA. As a result, the bcrypt hash rate for work factor 5 on hashcat 6.1.1 using CUDA on a RTX 2080Ti is now ~53k/s instead of the original ~28k/s.