KCK: Purpose and Introduction

by Fred McDavid

published: Feb. 12, 2018

This series of articles will introduce a pet project of mine, KCK, a self-priming, intelligent cache designed to maximize the performance of data pipelines and data-driven websites. Its goal is to make fast, Python-powered networked systems a bit faster.

For this first article, I’m going to define what I mean when I talk about “fast, Python-powered networked systems,” because if you’re not running some version of my flavor of such a system or if you've not put much thought into architecture generally, then my project will likely just add complexity and you’ll kick yourself for ever having looked at it. Beware all who pass through these gates, for here be dragons.

So, my idea of a well-formed website is based on a few simple ideas:

  • there should be a proper separation of front-end and back-end concerns
  • databases should be properly indexed
  • queries are optimized

I've built systems this way in Perl, Ruby and now Python. The front-end systems made HTTP requests to the back-end servers and the back-end servers talk to the databases and do whatever other heavy lifting needs to happen.

If you structure a site this way, build a proper database schema, and make efficient queries, these systems scale very well because the front-end servers are able to scale independent of the back-end servers and heavy load on one doesn't affect the performance of the other.

The diagram below shows the basic structure of these systems.

A well-formed web system.

The front-end and back-end logic has been separated across two types of HTTP servers, each of which can be scaled independent of the other.

For many web-based business applications, this architecture is enough to handle hundreds or even thousands of simultaneous users.

Because the front-end and back-end servers are horizontally scalable, the database ends up being the final bottleneck. If the system isn't doing much on a per-user basis, if the database is properly indexed and if the queries are efficient, a database like PostgreSQL or MySQL can handle a LOT of traffic.

But when the fundamentals have all been properly addressed and execution time continues to exceed performance goals, the options for improving responsiveness, apart from simply buying bigger, faster computers, look like:

  • use a cache and a cache priming mechanism to replace synchronous, slow data gathering functions
  • where possible, produce parallel versions of slow execution paths
  • find ways to optimize data processing
  • reduce the amount of data being gathered
  • reduce the number of database and/or API queries needed to produce the data product

KCK was built to provide vastly increased levels of scalability at a greatly reduced cost for web systems. But underperforming reporting and billing systems--or any other systems whose performance is ultimately a function of how fast the database can answer queries--while they might have a different architecture than the web system discussed here, will have similar options available and KCK may well provide performance benefits in these contexts as well.

KCK is designed to be a flexible and powerful solution for each of these options:

  • it has a high-performing cache with background refresh workers, both of which can scale horizontally as needed
  • KCK's cache+primer+refresh-worker model encourages the development of data pipelines as simple, logical units which are highly parallelizable
  • KCK makes it easy to assemble as much data in advance as possible and it provides a mechanism for updating cache entries in place where possible
  • KCK's approach to caching is to make stale data tolerance levels adjustable on a per-entry basis

The diagram below shows how KCK fits into a system’s architecture.

A well-formed web system.

The front-end and back-end logic has been separated across two types of HTTP servers, each of which can be scaled independent of the other.

Again, KCK is a self-priming, intelligent cache designed to maximize the performance of data pipelines and data-driven websites.

I’ve been looking forward to sharing the MVP, which was loosely defined to be the point where it can both reduce operational cost and improve performance by an order of magnitude and where I have the Jupyter notebooks to prove it.

I'm close. Future posts will spell it all out. More to come.

Curious folk should have a look at the repo and/or get in touch.

Feedback welcomed, contributors championed.