Homomorphic encryption (HE) is an emerging, computationally intensive application for computer architects. HE allows computations on encrypted data (called ciphertexts). In the machine-learning-as-a-service era, HE is highlighted as an enabler for privacy-preserving cloud computing, as it allows safe offloading of processing private data. The concept of HE was initially suggested in 1978. However, early proposals of HE were either unsafe or supported only one type of HE operations (called somewhat HE), namely homomorphic addition (HAdd) or multiplication (HMult) (e.g., ElGamal and Paillier). Accordingly, the applicability of HE was greatly limited. However, the fully HE (FHE), proposed by Gentry in 2009, made a major breakthrough by supporting both HAdd and HMult.
Popular FHE schemes are mostly based on a computational problem called ring learning with errors (RLWE). It is known to be difficult to solve even for quantum computers and hence also becomes the mathematical foundation of new post-quantum cryptography algorithms (check the standardization effort by NIST on quantum-resistant public-key cryptographic algorithms). Because the RLWE-based FHE schemes are noisy in nature, noise accumulates as we apply a sequence of computations on ciphertexts. The limited number of operations applicable to the data prevents complex applications (e.g., deep-learning models with high accuracy) from exploiting HE. Therefore, FHE supports bootstrapping, a method of initializing noise in ciphertexts, enabling an unbounded number of HAdd and HMult without decryption.
FHE is gaining interest in a number of application areas, such as secure genome analysis, finance, healthcare, private information retrieval, machine learning, and password protection. For example, the Password Monitor of Microsoft Edge notifies users if their saved passwords have been found in a third-party breach. Microsoft Edge exploits HE to perform this password monitoring without learning or storing the users’ passwords.
Computational challenges and variants of FHE
One main barrier to adopting FHE is its high computational and memory overhead. New schemes and new algorithmic optimizations have reduced this overhead and resulted in a speedup of 1,000,000x at least compared to its first HE implementation in 2009 (at least partially nurtured by a series of DARPA programs such as PROCEED and DPRIVE). However, FHE operations still experience tens of thousands of slowdowns compared to unencrypted operations. Attempting to tackle this, prior works have sought hardware solutions to accelerate FHE operations, including CPU extensions, GPU, FPGA, and ASIC.
An FHE scheme encodes a message into a plaintext and encrypts it into a ciphertext. The supported data type of a message depends on a particular FHE scheme. For example, Gentry-Sahai-Waters (GSW), Fastest Homomorphic Encryption in the West (FHEW), and Fast Fully Homomorphic Encryption over the Torus (TFHE) support boolean circuits, Brakerski-Gentry-Vaikuntanathan (BGV) and Brakerski/Fan-Vercauteren (BFV) support modular arithmetic on integers, and Cheon-Kim-Kim-Song (CKKS) supports approximate number arithmetic on real or complex numbers. In this post, we focus on CKKS; however, the computational characteristics analyzed here are applicable to other popular FHE schemes, such as BFV and BGV, as they share similar primitive functions that compose FHE operations (e.g., HAdd and HMult).
Key operations of FHE
A message is a vector or array of complex numbers, and each entry of a message vector is called a slot. Encoding/encrypting a message turns it into a ciphertext, a pair of polynomials each in a cyclotomic polynomial ring. When the degree of each polynomial is N, the array size of the corresponding message can be up to N/2. A ciphertext includes a message, a randomly sampled polynomial, a secret-key polynomial (S), and a small random error polynomial required for an LWE security guarantee.
As each coefficient of a polynomial is a modulo of a large integer Q (up to 1,000s of bits), FHE schemes represent each coefficient using the residue number system (RNS). Using the Chinese remainder theorem, we can represent any coefficient with a unique tuple of residue values when their moduli are pairwise coprime and the product of all these (typically word-sized) moduli are equal to Q. Then, adding/subtracting/multiplying two coefficients is equivalent to performing the same modular operations on each pair of residues, greatly reducing the computational complexity.
CKKS supports HAdd, HMult, and homomorphic rotation (HRot) as its primitive operations. HAdd and HMult correspond to the element-wise addition and multiplication of two messages. The complexity of HAdd is O(N); however, that of HMult is O(N2) if we simply multiply two polynomials. We apply number-theoretic transform (NTT), its FFT variants in particular, on each polynomial to reduce the complexity of multiplication to O(N logN). Still, NTT is computationally heavy and takes a dominant portion of processing time for HE operations (> 80%), especially for relatively small N (< 215). Therefore, representative acceleration works, such as HEAX and F1, focused on accelerating NTT.
Both HMult and HRot produce polynomials only decryptable with keys, which are different from the original secret key S, in the course of computation. To make such polynomials decryptable with S again, we perform key-switching, which multiplies a polynomial with a special public key called the evaluation key (evk). The size of an evk is larger than the size of a ciphertext. Moreover, HRot, which circularly shifts a message by r slots, requires a separate evk for each rotation amount r, incurring a serious memory overhead.
Considerations in applying FHE to applications
The error included in a ciphertext is amplified during FHE ops; in particular, HMult accounts for the dominant portion of this noise growth. The maximum number of HMult applicable to a ciphertext is called the maximum multiplicative level (L). L gets larger as Q, the modulus of a polynomial, increases.
The level of security for an FHE scheme is represented by λ, a parameter measured by the logarithmic time complexity for an attack to deduce the secret key. A sufficiently high λ is required for safety; the standard established by recent HE studies and libraries targets 128-bit or higher security (λ). λ is proportional to N and inversely proportional to log Q. Therefore, to support a more multiplicative level (L), Q should be increased. To keep the security level, N should be increased accordingly, leading to a larger ciphertext.
Because the primitive operations of CKKS are HAdd, HMult, and HRot, performing linear operations on ciphertexts is relatively cheap. However, it is much more expensive to compare ciphertexts. For example, ReLU, typically used for the activation function of deep learning, is a cheap unencrypted operation. However, because it accompanies comparison (with zero), a CPU implementation supporting Convolutional Neural Networks with CKKS approximates ReLU with polynomials consisting of HAdd and HMult, consuming a number of multiplicative levels. Sorting experiences the same issue.
A glimpse of architectural challenges for FHE
Bootstrapping restores the multiplicative level of a ciphertext. Bootstrapping must be commonly performed for the practical usage of HE with a complex sequence of FHE operations. It mainly consists of homomorphic linear transforms and approximate sine evaluation. These can be broken down into hundreds of primitive HE operations and consume ten to twenty multiplicative levels. Therefore, the maximum multiplicative level of a ciphertext must be larger than the level consumed during bootstrapping. Considering that we need to provide 20+ multiplicative levels (L) and 128-bit security level (λ) for bootstrappable CKKS schemes, the degree of a polynomial (N) of a ciphertext exceeds 214, the size of a ciphertext is tens of megabytes, and bootstrapping requires loading GBs of evaluation keys and plaintexts from memory. When an application does not require bootstrapping, the parameters (L, N, λ) can be adjusted just large enough to provide the necessary multiplicative level. Variants of FHE configured not to support bootstrapping are called leveled FHE (LHE).
FHE can be combined with other techniques to provide privacy-preserving computation. Hybrid private inference protocols, such as Gazelle and Delphi, exploit both multi-party computation (MPC) and LHE. Cheetah introduced algorithmic optimization for an HE-based DNN and proposed an accelerator design suitable for this. Instead of bootstrapping, Cheetah uses MPC to mitigate errors during the FHE operation. It sends a ciphertext with error from the clouds back to the clients after performing a linear layer (e.g., convolution), and the clients recrypt it as a fresh ciphertext after performing ReLU and pooling function using a Garbled CIrcuit. In MPC, the network latency/bandwidth from the frequent communication with the client limits the performance, introducing a different challenge compared to FHE.
Comparison is expensive for the FHE schemes supporting arithmetic (e.g., CKKS) but relatively cheap for those supporting boolean circuits (e.g., FHEW and TFHE). Therefore, it is important to find the proper FHE schemes depending on the computational needs of an application. For example, TFHE reported the bootstrapping throughput of around 70 bits per second, whereas a GPU implementation of CKKS reported that of over 2,000,000 bits per second. The working set sizes of the two FHE schemes are orders of magnitude different (megabytes vs. gigabytes) as well, offering huge differences in architectural challenges to overcome.
About the author: Jung Ho Ahn is a professor at Seoul National University. His research interests include bridging the gap between the performance demand of emerging applications and the performance potential of modern and future massively parallel systems.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.