When we need technical information about one of our projects, we want it to exist and to be correct. We likely need it in a hurry, perhaps to diagnose something important. All too often, such documentation is missing, incomplete, inconsistent, or has grown useless over time.
We’ve stored a variety of different technical documents in our wiki, Confluence:
- “How-to” documents
- Project history and decisions
- Architectural overviews
- Questions and FAQs
In each case, we realised that our project code moves much faster than the documentation and that this detachment fosters deterioration and divergence. So, what can we do to keep our documentation as fresh as our code?
Those tasked with designing, maintaining, and continuously improving a whole fleet of microservices have a greater challenge: doing all of the above at scale. How can we tell, without meeting after meeting, and without laborious manual effort, that good practice applied to one microservice has been applied to 99 other microservices? That a platform or infrastructure bug has been mitigated across the board? That we’re achieving convergence rather than divergence?
Please Don’t README:
All of our services have a README, but in many cases, this was created at the very start of the project and not updated since. A couple of paragraphs of inconsistently formatted text hardly makes this the go-to document for the project. We want it to become the “source of truth”.
So how can we gather together the many sources of information that reveal the true nature of our projects, both individually and collectively?
Let your code do the talking:
At build time, a large number of static analysis checks and validations are performed on our JVM services. A complete view of the project’s bytecode and dependencies provides an opportunity to make our already-written code work for us again: applying custom business and domain logic to filter, aggregate, and organise the results in one place, without any additional work from the developer beyond — at most — adding a couple of extra annotations.
Our existing READMEs became just one of a number of template inputs, injected into a newly generated README at build time.
In order to keep build-time generation honest, reliable, and visible, we ensure that every bit of generation — every validation task — becomes part of this newly generated document. Since the README is source-controlled, it makes data-gathering and document-generation a self-testing loop. We turned runtime-scope information into static/build-time information and made it available during development, not only after deployment.
Where possible, we also make this project metadata as Kubernetes ConfigMaps, which provides visibility at the cluster level, and provides opportunities to aggregate the results, producing global metrics and visualisations.
Let’s see what types of documentation automation can help with.
Automated API Documentation:
Automated documentation is not new. Out of the box, JavaDoc is generated from our Java and Kotlin code to document our lowest-level APIs in a human-readable way.
The Swagger API and related tools give us generated JSON descriptors for our private APIs, and this can be extended to describe our public APIs, in a more structured way than JavaDoc allows. Adding Swagger2Markup gives us nicely-formatted Markdown files complete with generated example API usage.
The challenge has been to deploy these documents somewhere accessible. We now add a link to the API documentation in our auto-generated READMEs, which is an inexpensive way to make them discoverable, and we can easily deploy them into our Kubernetes cluster as and when these need to be accessible to a wider audience.
APIs are important, but how can we expose more technical, implementation details about our services?
Simple project metadata:
We provide a summary of simple metadata, as follows.
We use the Parent POM version as the single best descriptor of a service’s commitment to our technical Release Train: the aspiration and the ability to keep a large fleet of services, all covering distinct functional areas, more-or-less continually technically consistent. So this is one of the most important pieces of metadata to expose.
A formatted list of developer names can be useful when deciding who to add to Pull Requests, but the rest of the data is most useful of all when added as Kubernetes deployment metadata. This gives K8s UIs like K9s all it needs to help us perform clever filtering, e.g. selecting pods via the name of the owning development team:
This also gives us:
- Information about Java and Spring Boot versions
- Information about Docker images, Helm resources, custom JVM arguments
- A summary of the data sent to Spring Cloud Config at bootstrap time
This is information extracted via static analysis. For example, we detect correct usage of our logging API, as well as our Prometheus API, and expose the information to Kubernetes, where a configuration controller aggregates the data across all our microservices to provide a single aggregated errors Grafana dashboard.
Each microservice that opts into timing some operations automatically gains both a README entry and a Grafana dashboard with a graph for each timer, at zero cost to the developer:
- Direct links to that service’s log entries in Logz.io
- Direct links to that service’s page in a number of infrastructure UIs
- Direct links to generated database schema diagrams
Errors in one microservice are important enough for an individual developer or team, and we won’t talk specifically about monitoring or alerting here — but at an aggregate level we want to be able to get an overview of an entire fleet of microservices, and an auto-generated Grafana dashboard can help visualise.
While this may not scale to 500 services, it should be good for 50–100. Again, there is no up-front cost for developers; the underlying data is mechanically extracted from the individual services, then deployed into Kubernetes. The only slight cleverness is the discovery of the little packages of metadata and their aggregation into a single graphic.
We use RabbitMQ for cross-service messaging. With the aid of static analysis and a configuration controller, we can automatically aggregate and marry-up the sender and receiver relationships across all of our microservices, providing friendly and customisable visualisations of our Rabbit topology. Completing the circle, we link each README to the relevant generated service and exchange diagrams.
Is this documentation? Yes, this is precisely the kind of diagram a technical architect would want to construct in order to begin to describe the interaction between 50/250/500 microservices and to start proposing refactors. Happily, these diagrams adapt in the light of each refactor and do not require any up-front developer time to create.
A technical architect cannot manage what they cannot measure, and this provides the basis for a variety of “clean architecture” metrics.
Future plans for metadata:
Some fairly easily generated pieces of metadata we hope to add shortly, and others that are more long-term:
- Metrics about the alert rules, defined in code, that our services expose.
- Metrics about our unit/integration test usage, and warnings of violations of any of our in-house testing conventions and standards.
- Links to technical debt tickets and statuses.
- The ability to measure the progress and rollout of cross-team, departmental technical initiatives (e.g. the adoption of a new Java version, infrastructure item, or automated refactor) via auto-generated metrics or dashboards. This may not seem important with 50 microservices, but with 250 or 500, this is more than any team of humans wants to collate by hand.
This covers the technical detail of services, project metadata, and compliance with our conventions. It exposes information that may help diagnose runtime issues and both internal and public APIs. What are the remaining types of technical documentation, and what good can automation do in each case?
“Question and Answer” documentation:
We used a hosted Q&A service to allow “bottom-up” documentation to emerge for a system or service as people actually use it, which develops in response to the “top-down” documentation prepared in anticipation of how people would use it. Both are of equal value.
We wanted to make sure that too this can be accessed from the “source of truth” that is a project README, and this too can be done provided questions are tagged with the appropriate project name, Id, or slug. If humans can help create one side of the mapping, from project to questions, then Q&A “topic” pages can provide the reverse mapping from question topics back to projects.
Although we have not yet taken this any further, these two-way mappings can allow “Big Data”-style knowledge solutions to be developed to more intimately relate real-world issues back to questions, then to answers, and back to the relevant developers, projects, and code areas.
Generating-away trivial instructions:
As our processes have become more automated and streamlined over time, there’s much less need to maintain “How-to” documents full of steps and commands.
We no longer need to describe in plaintext, or on a Wiki, the process of building an application server or a microservice, because whether you use Ansible, Docker, or equivalents, this list of instructions becomes part of the code, and when systematically rolled-out to a fleet of services and proven dozens of times per day via Continuous Deployment, then you have something that is continuously proven from end-to-end.
ChatOps has helped us too. Previously manual tasks like creating build and deployment pipelines can be reduced to one or two easily-remembered chat commands, and context-sensitive help makes it pointless to document the procedure manually.
Similarly, scripts that go out of their way to be helpful reduce the need to document workarounds, “gotcha”s, and manual testing procedures. Our scripts should:
- Say exactly what they will do (and won’t do)
- Say exactly what they eventually did (or didn’t)
- Be idempotent: safe to run and re-run, however, far down the road you already are.
- Clean up after themselves, whether they succeeded or failed in their mission.
By going with the flow of the technical frameworks we use, rather than diverging, there is less need to document workarounds; and with previously manual processes becoming better aligned with formal, established development stages (i.e. compile-time, test-time, application startup or shutdown, Helm deployment), there is simply less need to provide so many potentially divergent sets of individual steps and instructions.
Project history and decisions:
We have recently adopted Architectural Decision Records as a way to bind together decisions with the relevant code changesets, rather than a long-form document maintained separately, and to provide a source-controlled history. While we are at the early stages of our usage, we are confident that we can replace “history” and “background” documents that have deteriorated over time with new, generated documents, as well as providing opportunities for cross-linking across projects.
We’ve demonstrated that it is possible to bring many sources of documentation together in one place, to provide a single source of truth that it is worth developers’ time to consult and to greatly scale up the size of our microservice fleet without a corresponding cost of manually maintaining documentation or searching for information. We’ve linked documentation to discoverability and made it much clearer what we’re deploying, how each part works, and how the individual parts fit into a greater whole.
By Andrew Regan — Technical Architect at Crunch