Public Key Infrastructure Automation

Public Key Infrastructure 2 of 3 – Certificate Automation

Following the last post on PKI, we’ll discuss automation of certificate issuance. Two key activities to automate are: validation of the requestor and issuance of the certificate.

Validation

Validation isn’t always required. For private CAs, the trust boundary does not go beyond the internal engineering team, there is little incentive to perform any validation. AWS Private CA is based on this idea. The requestor can claim to be any identity. The private CA, when issuing the certificate, does not perform any validation. Neither is there a need to convince any entity outside of the trust boundary of the validity of the certificate. Validation is optional. For public facing certificate however, validation is a must because we’re convincing every browser in the world of the validity of the certificate requestor. Common validation levels include:

  • Domain Validation (DV): the certificate requestor must demonstrate the right to administratively manage the affected DNS domain.
  • Organization Validation (OV): in addition to the DV criterion, the issuer verifies the actual existence of the requestor’s organization as a legal entity.
  • Extended Validation (EV): the certificate requestor must persuade the certificate provider of its legal identity, including manual verification checks y a human. Unlike DV and OV certificates, only a subset of CAs can issue EV certificates.

For both OV and EV, a certificate provider publishes its vetting criteria through its certificate policy. They require human validation of any registrants. At corporate level, EV certificates are required for sensitive public-facing workloads (e.g. banking, financial, health information). For non-sensitive public-facing workloads, DV certificates may be sufficient. For non-public facing workloads, such as software testing, they may go with DV certificates or no validation at all, depending on the specific use case. Since I set up PKI for the latter, I’ll focus on DV. DV is the most basic level and can be fully automated.

DigiCert, a well-known trusted third party, has a detailed page on the differences among DV, OV and EV.

Certificate Automation

As to automation, there are some common certificate automation protocols:

  • ACME (Automated Certificate Management Environment): commonly used in web server automation.
  • SCEP (Simple Certificate Enrollment Protocol): commonly used in enterprise environments for managing certificates in the network devices such as routers, switches and IP phones.
  • EST (Enrolment over Secure Transport): a more secure alternative to SCEP suitable for various use cases beyond network devices.
  • CMP (Certificate Management Protocol): more comprehensive protocol with a wide range of functionalities for complex certificate management scenarios.

SCEP is common in network industry. EST and CMP target very specific scenarios. We’ll examine ACME as it’s most relevant to the use case of web service. The biggest advocate of ACME is Let’s Encrypt, a non-profit CA run by ISRG that provisions X.509 certificates at no charge. Let’s Encrypt is the world’s largest CA, aiming to secure all websites with HTTPS. ACME only issues DV certificates, since they can be fully automated.

The ACME Protocol

The ACME protocol automates validation and issuance. The certificate requestor will have to use an ACME-capable client. The certificate provider (CA) needs to act as ACME server. At a high level, the flow looks like this:

I came across this good diagram on the ACME flow from a post from small step. It has all the transactions in detail. As it shows, the delivery (issuance) of certificate material is based on HTTP POST method. The domain validation process is based on a challenge-response model. The ACME specification makes this an extension point, with the following most comment challenge types:

  • HTTP-01 (HTTP Challenge): the domain in question needs to host a random number at a random URL under /.well-known/acme-challenge on port 80. The CA will fire an HTTP GET request to that URL. This is easy to configure because we usually have full control on the web server. There must be network connectivity between the web server and the CA to allow HTTP traffic.
  • DNS-01 (DNS Challenge): the requestor provisions a TXT record with random value. The ACME server does not need to connect to the web server. It only needs to perform a DNS lookup to confirm the challenge. However, the certificate requestor needs the privilege to modify DNS record.
  • TLS-ALPN-01 (TLS ALPN Challenge): ALPN is the protocol during TLS negotiation. The client presents a self-signed TLS certificate containing the challenge response as a special X.509 certificate extension. This challenge type is useful when a security policy requires the CA to reach the client via a TLS connection.
  • DEVICE-ATTEST-01 (Device Attestation Challenge): This is for Apple Managed Device Attestation (ADA) and other secure zero-touch provisioning (SZTP) applications as part of your device management (MDM) strategy. Certificates identify specific hardware devices, via permanent device IDs. These are typically client certificates that can be used for device authentication.

At the implementation level, Let’s encrypt drives its public CA with Boulder. It supports two challenge types. When hosting a private CA, you can use Boulder too. Some feel Boulder is complicated and you can consider the following alternatives:

  • LabCA: based on Boulder and supports hosting in docker.
  • Step CA (open source): a simple CA solution
  • Cert Manager: very popular choice on Kubernetes
  • Hashicorp Vault: a secret management solution including certificate management capability with ACME support.

You may combine different solutions for all level of CAs. For example, use Step CA for internal root CA, and Cert Manager for intermediate CAs for Kubernetes workloads. On the client side, Let’s Encrypt recommends Certbot. However, there are many choices. Step CLI (by Step CA), acme.sh, etc. Let’s Encrypt compiled a list here.

If we’re not seeking automation with ACME in our process, and just want to manually sign certificates, we can use generic tools (e.g. openSSL, cfssl, easyRSA, etc). They act both as client (gingnerate CSR) and server (signing CSR) using different command switches.

Renewal and Revocation

Lifecycle management involves renewal and revocations. Renewal is essentially re-issue certificates closer to expiration date. In software testing, we often use short-lived certificates, to ensure that our test scenario covers automated certificate renewal as well. It is the responsibility of requestor to initiate the renewal, and distribute the renewed certificates. With Let’s Encrypt, the renewal process will challenge the requestor again for validation purpose. However, in some cases, the certificate provider may choose not to perform validation on every renewal. For example, short-lived certificate gets renewed every week, while validation is performed every year. During the renewal process, the private key of the website does not change. Note the difference between renewal and rekey. If the website’s private key is compromised, then instead of renewal, we should re-issue a private key and request a new certificate.

Revocation is a challenging process. To declare that a certificate should no longer be trusted, there are currently two ways: CRL and OCSP but both have drawbacks. CRLs are lists of all the certificates that a CA has issued but revoked. This list can grow very large. It is not feasible for the application (e.g. Browser) to download the giant list for each CA regularly and check for every website that matches the CA. OCSP provides a query-based method. The application can query the revocation status against the OCSP endpoint. It however brings its own challenges. The OCSP server is subject to downtime. The network connectivity between application and OCSP server causes latency. Many applications simply treats query timeout as not revoked. To reduce the load, application may cache OCSP responses, leading to potentially out-dated status. Worse, a malicious CA can track website of the application user.

Let’s encrypt has a page on these challenges, and it proposes a new browser-summarized CRLs. It was still a recent effort so we’ll see how that plays out.

Summary

Following the first post on the PKI concepts, we discussed the automation of certificate issuance in this post. In the next one, let’s go over some labs.