Devops and Security Vodcast: Code Quality & Helpful Tools
&yet’s ops and security guys hash it out in this latest vodcast.
Nathan Lafreniere talks about what’s in his devops toolkit, his code deployment process, how ops can help maintain code quality, and his new documentation library, ape.
Adam Baldwin discusses his new Node.js header security library for express, helmet, a few headers that most apps should be including by default now, and some random bits about realtime security.
Fortunately for you this particular cut doesn’t include Adam singing Russian Unicorn but it does feature a yeti and Adam doing what he would consider dancing.
Please let us know what you would like to hear about in the future regarding ops and security.
Credits:
“Talent”: Nathan (left) and @adam_baldwin (right).
Video filmed and produced by the awesome Ms. Mel.
filed under
devops,
node.js,
process,
qa,
security,
and
vodcast
posted February 17, 2012 by Adam Baldwin
Redis Reliability for Realtime Apps
The Problem
When I was at FOSDEM last weekend, I talked to several people who couldn’t believe that I would use Redis as a primary database in single page webapps. When mentioning that on Twitter, someone said, “Redis really only works if it’s acceptable to lose data after a crash.”
For starters, read http://redis.io/topics/persistence. What makes Redis different from other databases in terms of reliability is that a command can return “OK” before the data is written to disk (I’ll get to this). Beyond that, it is easy to take snapshots, compress append-only log files, configure fsync behavior in Redis. There are tests for dealing with disk access suddenly cut off while writing, and steps are taken to prevent this from causing corruption. In addition, you have `redis-check-aof` for dealing with log file corruption.
Note that because you have fine tuned control over how fsync works, you don’t have to rely on the operating system to make sure that operations are written to disk.
No Really, What Was the Problem Again?
Since commands fail in any database, client libraries wait for OKs, Errors, and Timeouts to deal with data reliability. Every database based application has to deal with the potential error. The difference is that we expect the pattern to be command-result based, when in fact, we can take a more asynchronous approach with Redis.
Asynchronous reliability
The real difference is that Redis will return an OK as long as it was written to RAM (see Antirez’s clarification in the comments) while other databases tend to send OK only after the data is written to disk. We can still get on par (and beyond) with other database reliability easily enough by having a very simple check that you may be doing anyway without realizing it. When sending any command or atomic group of commands to Redis in the context of a single page app, I always send some sort of `PUBLISH` at the end. This publish bubbles back up to update the user clients as well as inform any other interested party (separate cluster processes for example) about what is going on in the database application. If the client application lets the user know that it didn’t get an update corresponding with a user action within a certain amount of time, then we know the command didn’t complete. Beyond this, we can write to a Redis *master* and `LISTEN` for publishes on a Redis *slave*! Now the client application can know that the data has been saved on more than one server; that sounds pretty reliable to me.
Using this information, the client application can intelligently deal with user action reliability all the way to the slave, and inform users with a simple error, resubmit their action without prompting, or request that the server do some sort of reliability check (in or out of context of the user action), etc.
tl;dr
- Single page app sends a command
- Application server runs an atomic action on Redis *master*.
- Redis master syncs to Redis *slave*
- `PUBLISH` at the end of said atomic action routes to application server from Redis *slave*.
- `PUBLISH` routes to single page app that sent the command, and thus the client application knows that said atomic action succeeded on two servers.
- If the client application hasn’t heard a published confirmation, the client can deal with this as an error however it deems appropriate.
Further Thoughts
Data retention, reliability, scaling, and high availability are all related concepts, but not the same thing. This post specifically deals with data retention. There are existing strategies and efforts for the other related problems that aren’t covered in this post.
If data retention is your primary need from a database, I recommend giving Riak a look. I believe in picking your database based on your primary needs. With Riak, commands can wait for X number of servers in the cluster to agree on a result, and while we can do something similar on the application level with Redis, Riak comes with this baked in.
David Search commented while reviewing this post, “Most people don’t realize that a fsync doesn’t actually guarantee data is written these days either (depending on the disk type/hardware raid setup/etc).” This further strengthens the concept of confirming that data exists on multiple servers, either asynchronously as this blog post outlines, or synchronously like with Riak.
About Nathan Fritz
Nathan Fritz aka @fritzy works at &yet as the Chief Architect. He is currently working on a book called “Redis Theory and Patterns.”
If you’re building a single page app, keep in mind that &yet offers consulting, training and development services. Send Fritzy an email (nathan@andyet.net) and tell us what we can do to help.
Update: Comment From Antirez
Antirez chimed in the comments to correct this post.
“actually, it is much better than that ;)
Redis with AOF enabled returns OK only *after* the data was written on disk. Specifically (sometimes just transmitted to the OS via write() syscall, sometimes after also fsync() was called, depending on the configuration).
1) It returns OK when aof fsync mode is set to ‘no’, after the wirte(2) syscall is performed. But in this mode no fsync() is called.
2) It returns OK when aof fsync mode is set to ‘everysec’ (the default) after write(2) syscall is performed. With the exception of a really busy disk that has still a fsync operation pending after one seconds. In that case, it logs the incident on disk and forces the buffer to be flushed on disk blocking if at least another second passes and still the fsync is pending.
3) It returns OK both after write(2) and fsync(2) if the fsync mode is ‘always’, but in that setup it is extremely slow: only worth it for really special applications.
Redis persistence is not less reliable compared to other databases, it is actually more reliable in most of the cases because Redis writes in an append-only mode, so there are no crashed tables, no strange corruptions possible.”
filed under
architecture,
ops,
realtime,
and
redis
posted February 9, 2012 by Nathan Fritz
Adam Baldwin and Nathan LaFreniere are yetis.
Security expert and dev/ops badass join the &yet team January 1
Because we are huge fans of human namespace collisions and amazing people, we’re adding two new members to our team: Adam Baldwin and Nathan LaFreniere, both in transition from nGenuity, the security company Adam Baldwin co-founded and built into a well-respected consultancy that has advised the likes of GitHub, AirBNB, and LastPass on security.
We have relied on Adam and Nathan’s services through nGenuity to inform, improve, and check our development process, validating and invalidating our team’s work and process, providing education and correction along the way. We are thrilled to be able to bring these resources to bear with greater influence, while providing Adam Baldwin with the authority to improve areas in need of such.
Adam Baldwin
Adam Baldwin has served as &yet’s most essential advisor since our first year, providing me with confidence in venturing more into development as an addition to my initial web design freelance business, playing “panoptic debugger” when I struggled with it, helping us establish good policy and process as we built our team, improving our system operations, and always, always, bludgeoning us about the head regarding security.
It really can’t be expressed how much respect I and our team at &yet have for Adam and his work.
He’s uncovered Basecamp vulnerabilities that encouraged 37Signals to change their policies for handling reported vulnerabilities, found huge holes in Sprint/Verizon MiFi (that made for one of the most hilarious stories I’ve been a part of), published vulnerabilities *twice* to root Rackspace, shared research to uberhackers at DEFCON, and has provided security advice for a number of first-class web apps, including ones you’re using today and conceivably right now.
Adam Baldwin will be joining our team at &yet as CSO—it’s a double title: Chief of Software Operations and Chief Security Officer.
Adam will be adding his security consultancy, alongside &yet’s other consulting services, but will also be overseeing our team’s software processes, something he has informed, shaped, and helped externally verify since, I think, before most of our team was born.
On a personal note (a longer version of which is here), I must say it’s a real joy to be able to welcome one of my best friends into helping lead a business he helped build as much as anyone our team.
Nathan LaFreniere
As excited as I am personally to add Adam Baldwin, our dev team is even more thrilled about adding Nathan, whose services we have become well accustomed to relying on in our contract with nGenuity and in a large project where we’ve served a mutual customer.
Nathan is a multitalented dev/ops badass well-versed in automated deployment tools.
He solves operations problems with a combination of experience, innovation, and willingness to learn new tools and approaches.
He’s already gained a significant depth of experience building custom production systems for Node.js, including some tools we’ve come to rely on heavily for &bang.
Nathan’s passion for well-architected, smoothly running, and meticulously monitored servers has helped our developers sleep at night, very literally.
I know getting the luxury of having a huge amount of Nathan’s time at our developers disposal sounds to them like diving into a pool of soft kittens who don’t mind you diving on them and aren’t hurt at all by it either oh and they’re declawed and maybe wear dentures but took them out.
So that’s what we have for you today.
We think you’re gonna love it.
filed under
new hires,
ops,
and
security
posted December 16, 2011 by Adam Brault
Realtime web app architecture with Thoonk: a series of tubes, not tables
Now you’re thinking with feeds!
When I look at a single-page webapp, all I see are feeds; I don’t even see the UI anymore. I just see lists of items that I care about. Some of which only I have access to and some of which other groups have access to. I can change, delete, re-position, and add to the items on these feeds and they’ll propagate to the people and entities that have access to them (even if it is just me on another device or at a later date).
I’ve seen it this way for years, but I haven’t grokked it enough to articulate what I was seeing until now.
What Thoonk Is
Thoonk is a series of higher-level objects built on Redis that sends publish, edit, delete, and position events when they are changed. These objects are feeds for making real-time applications and feed services.
What is a Thoonk feed?
A Thoonk feed is a list of indexed data objects that are limited by topic and by what a single entity might subscribe to. An RSS/ATOM feed qualifies. What makes a Thoonk feed different from a table? A table is limited to a topic, but lacks single entity interest limitations. A Thoonk feed isn’t just a message broker, it’s a database-store that sends out events when the data changes.
Let’s use &bang as an example. Each team-member has a list of tasks. In a relational database we might have a table that looks like this:
team_member_tasks
id | team_id | member_id | description | complete bool | etc.
Whenever a user renders their list, I would query that list, limiting by a specific user and a specific team.
If we converted this table, without changing it, into a Thoonk feed, then we would only be able to subscribe to ALL tasks and not just the tasks of a particular team or member. So, instead, a Thoonk feed might look like:
team:<team_id>:member:<member_id>:tasks
{description: "", completed: false, etc, etc}
Now when the user wants a rendered list of tags, I can do one index look-up rather than three, and I am able to subscribe to changes on the specific team member’s tasks, or even to team:353:member:*:tasks to subscribe to all of that team’s tasks.
[Note: I suppose you could arrange a relational database this way, but it wouldn’t really be able to take advantage of SQL, nor could you subscribe to the table to get changes.]
It’s Feeds All the Way Up
If I use Thoonk subscribe-able feeds as my data-storage engine, life gets so much easier. When a user logs in, I can subscribe contextualized callbacks just for them to the feeds of data that they have access to read from. This way, if their data changes for any reason, by any process, by any server, it can bubble all the way up to the user without having to run any queries. I can also subscribe separate processes that can automatically scrub, pre-index, cull, or any number of tasks to any Thoonk feed a particular process cares about. I can use processes in mixed languages to provide monitoring and additional API’s to the feeds.
But What About Writes?
Let’s not think in terms of writes. Writes are just changes to feed items (publishing, editing, deleting, repositioning) that writes the data to ram/disk and informs any subscribers of the change. Let’s instead think in terms of user-actions. A user-action (such as delegating a task to another user in &bang) needs ACL and may affect multiple feeds in a single call. If we defer user-actions to jobs (a special kind of Thoonk feed), we can easily isolate, scale, share, and distribute the business-logic involved in dealing with a user-action.
What Are Thoonk Jobs?
Thoonk Jobs are items that represent business-logic needing to be done reliably, a single time, by any available worker. Jobs are consumed as fast as a worker-pool can consume them. A job feed is a list of job items, each of which may exist in the state of available, in-flight, and stalled. Available jobs are taken and are placed in an in-flight set while they are being processed. When the job is done, the job is removed from the in-flight set, and its item is deleted. If the worker fails to complete the job (either because of an error, distaste, or a monitoring process deciding that the job has timed out), the job may be placed back to the available list or the stalled set.
Why use Thoonk Jobs for User-Actions?
- User-actions that fail for some reason can be retried (you can also limit the # of retries).
- The work can be distributed across processes and servers.
- User-actions can burst much faster than the workers can handle them.
- A user-action that ultimately fails can be stalled, where an admin is informed to investigate and potentially edit and/or retry when the issue that caused it has been resolved or to test said resolution.
- Any process in any language can contribute jobs (and get results from them) without having to re-implement the business logic or ACL.
The Last One is a Doozy
Scaling, reliability, monitoring and all of that is nice, but being able to build your application out rather than up is, I believe, the greatest reason for this approach. &bang is written in node.js, but if I have a favorite library for implementing a REST interface or an XMPP interface written in Python or Ruby (or any other language), I can quickly put that together and add it as a process. In fact, I can pretty much add any piece of functionality as a process without having to reload the rest of the application server, and really isolate a feature as its own process. User-actions from this process can be published to Thoonk Job feeds without having to worry about request validation or ACL since that is handled by the worker itself.
Rather than having a very large, complex application, I can have a series of very small processes that automatically cluster and are informed of changes in areas of their specific concerns.
Scaling Beyond Redis
Our testing indicates that Redis will not be a choke point until we have nearly 100,000 active users. The plan to scale beyond that is to shard &bang by teams. A quick look-up will tell us which server a team resides on, and users and processes can subscribe callbacks to connections on those servers. In that way, we can run many Redis servers, and theoretically scale vertically. High-availability is handled by a slave for each shard and a gossip protocol for promoting slaves.
Conflict Resolution and Missed Updates
Henrik’s recent post spawned a couple of questions about conflict resolution. First I’ll give a deflection, and then I’ll give a real answer.
&bang doesn’t yet need conflict resolution. None of the writes are actually done on the client as they are all RPC calls which go into a job queue. Then the workers validate the payload, check the ACL, and update some feeds, at which point the data bubbles back up to the client. The feed updates are atomic, and happen quite quickly. Also, two users being able “to edit the same item only comes up with delegated task, in which case the most recent edit wins.
Ok, now the real answer. Thoonk is going to have revision history and incrementing revision numbers for 1.0. Each historical item is the same as the publish/edit/delete/reposition updates that are sent via pubsub. When a user change job is done, the client can send its current revision numbers for the feeds involved, and thus conflicts on an edit can be detected. The historical data should be enough data to facilitate some form of conflict resolution (determined by the application implementer). The revision numbers can also bubble up to the client, so the client can detect missing updates and ask for a replay from a given revision number.
Currently we’re punting on missed items. Anytime the &bang user is disconnected, the app is disabled and refreshed when it is able to reconnect. A more elaborate solution using the new Thoonk features I just listed is probably coming and perhaps some real offline-mode support with local “dirty” changes that get resolved when you come back online.
All Combined
Using Thoonk, we were able to make &bang scale to 10s of thousands of active users on a single server, burst user-activity beyond our choke-points, isolate user-action business-logic and ACL, automatically cluster to more servers and processes, choose any Redis client library supported language for individual features and interfaces, bubble data changes all the way up to the user regardless of the source of change, provide an easy way of iterating, and generally create a kick-ass, realtime, single-page webapp.
Can I Use Thoonk Now?
Thoonk.js and Thoonk.py are MIT licensed, and free to use. While we are using Thoonk.js in production and it is stable there, the API is not final. Currently I’m moving the the feed logic to Redis Lua scripts, which will be officially supported in Redis 2.6 with an RC1 promised for this December. I plan to be ready for that. The Lua scripting will give us performance gains, and remove unnecessary extra logic to keep publish/edit/delete/reposition commands atomic, but most importantly it will allow us to share the core code with all implementations of Thoonk, allowing us to easily add and support more languages. As mentioned previously, as I do the Redis Lua scripting, I’ll be adding revision history and revision numbers to feeds, which will facilitate conflict detection and replay of missed events.
That said, feel free to comment, contribute, steal, or abuse the project in the meantime. A 1.0 release will indicate API stability, and I will encourage its use in production at that point. I will soon be breaking out the Lua scripts to their own git repo for easy implementation.
If you want to keep an eye on what we’re doing, follow me @fritzy and @andyet on twitter. Also be sure to check out &bang for getting stuff done with your team.
If you’re building a single page app, keep in mind that &yet offers consulting, training and development services. Shoot Henrik an email (henrik@andyet.net) and tell us what we can do to help.
filed under
andbang,
architecture,
javascript,
nodejs,
realtime,
and
thoonk
posted November 18, 2011 by Nathan Fritz