Content Tracker

This was a fun and very ambitious project that grew out of Priceonomics' own desire to create a repeatable process for content marketing.

The nutshell version is that it's a voracious data-ingesting machine that continuously queries Google Analytics, Facebook, Twitter and others on behalf of the users to get current inbound links, tweets, and shares. It reports any new information as it arrives via Slack and it reports statistics about what it's learned via the website.

Content Tracker

This was a fun and very ambitious project that grew out of Priceonomics' own desire to create a repeatable process for content marketing.

The nutshell version is that it's a voracious data-ingesting machine that continuously queries Google Analytics, Facebook, Twitter and others on behalf of the users to get current inbound links, tweets, and shares. It reports any new information as it arrives via Slack and it reports statistics about what it's learned via the website.

Design Criteria

The main design constraint for Tracker is to keep the information current. If someone with a lot of followers tweets about something one of the customers has written, the customer should be made aware immediately. If they make the front page of reddit and go from 500 to 5000 page views, the customer should hear about it.

Implementation

Our team built the system such that the web servers and the background queues autoscale, so the big trick ended up being the data pipeline.

We carefully managed database access and made sure the vast majority of the website touched only the cache. From there, we simply had to ensure that the cached data was always fresh. Figuring how to keep the cache fresh with as few resources as possible occupied a significant chunk of our development efforts.

Hilights

The system expects component failures and deals with them gracefully.

Deploys are 99% automated and involve no user-facing downtime.

Data is managed such that most outages can be corrected within a few minutes.

The system is highly optimized. It asks for as little as it can from Google Analytics and the rest. Updates throughout the data pipeline involve as few fast and efficient operations as necessary to keep the various bits of related data in sync.

At one point, the system involved 60+ very capable AWS servers and, after months of refactoring code and rethinking assumptions, it runs on 1/10 as many servers and a much less expensive vm for the database.