We loved what former Turntable.fm VP of Technology Joseph Perla [above] wrote about scaling a startup -- and the discussion his post generated on Hacker News. Syndicated with Joseph's permission. Read the original here. Please post your questions and thoughts in the comments below.
By Joseph Perla
[ @jperla ]
These are case studies.
I will talk about my last two startups where I used a lot of techniques to build them quickly and scale them up. Here I explore different techniques I used to architect them to scale which are quite simple, but someone who is not familiar with building systems may be interested in learning how to build his or her own scalable site.
This is based on the outline of a paper by Lampson with more modern web-based examples:
Labmeeting was a search engine for biomedical literature and a social network for scientists. http://www.crunchbase.com/company/labmeeting.
Turntable.fm (Stickybits, Inc) is a social music website. http://www.turntable.fm.
Keep it simple.
We built API's before making the website or mobile apps at Stickybits. That means that the design of data access, security, and data flow happens long before the first interfaces are created. Simplicity in a small interface is key, with well-defined and single-purpose functions coming from each module and submodule of the API. The whole front-end interface uses exclusively less than 30 methods in 5 modules available in the API.
Get it right.
From day one, we built automated tests into Turntable to catch any conceivable and subtle bugs that we may introduce during development. In advance, we knew that it would be a complex, dynamic site with hard to reproduce state. This made it all the more important that each simple method in the API performed exactly as it needed to both in the edge cases and in the normal case. We had individual function tests, module level tests, and full integration tests that automatically started a full chatserver and tested real requests. The tests were run on every commit and no bugs were allowed to persist before writing new code.
Don't hide power
You can see the docs and use this library yourself at the Pebbles introduction.
Use procedure arguments to provide flexibility in an interface
We created a system for filtering through news articles. The system has many basic parameters that can be passed that are very simple, but the parameters are simply procedures. Therefore, if someone had a special complicated need, they could write their own function that returned a boolean value of whether to filter the news and pass that through the interface.
Leave it to the client
The interface at Turntable is very simple, and we expect the client to perform complicated manipulations of the many elements of the interface and keep track of all those states. This allowed the backend to be developed very quickly, although it meant that frontends, like an iPhone app, take a little longer to develop.
Keep basic interfaces stable. Keep a place to stand if you do have to change interfaces.
The API of Stickybits and Turntable are versioned from the very beginning. They can thus offer full compatibility with previous functionality, but enhancements and changes can be made in newer versions.
Making implementations work
Plan to throw one away.
Many of the routines in the initial prototypes were written very quickly and with an eye to throwing them out once in full production mode. For example, the Random Room feature on Turntable literally pulls every single room into the Python main memory process and then chooses a room randomly from there because Mongo doesn't have a random function. A fully optimized version would be a little more complicated, but many routines were designed that way with an eye to throwing the inner part of the function out and rewriting once the bottlenecks are identified.
Keep secrets of the implementation
We built Turntable up from separate silo'd modules that, while decreasing performance a bit, allowed them to operate independently and with maximum flexibility to respond to changes in requirements in the interface. For example, the Rooms manager knew nothing about how users were stored or queried. The User api could store users in memory, on disk, in MongoDB, or halfway across the world. Rooms only knew that it could call the same API external methods used to look up a user or set of users.
Use a good idea again instead of generalizing it
At Labmeeting, we had to extract author names from PDFs. We realized that we could do decently well at extracting the names using machine learning techniques, but never perfectly. However, by indexing a gazette, a complete database of every possible PDF, then we could simply make some guesses (possibly using machine learning) and then just look up those guesses in the gazette to see if there is a match. It becomes a problem of efficient enumeration. We didn't generalize it, and used the idea again in a slightly different context. Each PDF has a scientific abstract with various complicated terms from biology and physics. We wanted to identify those important terms to allow further exploration. Again, some indicators could point us in the right direction, but we did not get everything. So, we crawled Wikipedia to compile a gazette of biological terms, then merely used those terms in the abstracts that appear in the gazette modulo very frequent words like DNA. This was highly accurate again. We linked these extracted entities to Wikipedia to provide further information for the curious.
Handle all the cases
Handle normal and worst cases separately as a rule
At Labmeeting, we analyzed PDFs to extract the title, publication date, and other information. The special case of a PDF which is encrypted and unparseable and no text can be extracted went straight to a separate method. The special case could possibly be handled by a more general-purpose algorithm for text extraction that happens to special case to a right answer, but it is more straightforwardly handled separately. Anyone reading the code could see it plainly, rather than having to think through the special case in more complicated parsing code.
Split resources in a fixed way if in doubt
At Labmeeting, we put the database index on a separate machine from the Solr search index. We had millions of search queries coming into the search system, and we didn't want those queries to slow down the db, and thus normal operation of the site. Writes take much longer than reads, and are more important for logged in users. External users of the site using the search engine just hit the index, performing exclusively reads on the index. This allowed us to scale up the search index independently from the database.
Use static analysis if you can
At Stickybits, before every commit, I had a version of PyFlakes run on all of my new code. PyFlakes is a static analysis tool for Python that finds common errors that can be detected before run-time. For example, PyFlakes can find improper number of arguments to a function call and references to variable names that are not in scope (like typos). Static analysis finds a lot of bugs that might appear in production only rarely in edge cases. It is most useful in a language like Python that is dynamic and thus doesn't have a lot of the normal safety features available to a statically typed language.
Dynamic translation from a convenient representation to one that can be quickly interpreted
At Stickybits, we filtered through Twitter comments that were automatically added to certain barcodes based on the name of the product attached to the barcodes. The Twitter comments were very noisy and usually ridiculous or nonsensical. We created a small library of methods in Python that basically defined a little mini-domain specific functional language for filtering through tweet content to separate the wheat from the chaff. It was still fully Python, but we used strictly a small library of function describing the parameters and then passed information between each other, all interpreted by one master function. It compiled to Python bytecode, of course, so it was fast enough to filter Tweets in real time.
Cache answers to expensive computations
Obvious we did this all of the time at Labmeeting. For example, we performed a document similarity search to find "Related Papers" when we showed one individual paper to recommend other papers a scientist may want to read. The vector computation and search for this is quite expensive so we cache the results for a month. Another example: we had to open up a PDF file which has a research publication, perform text extraction, and then do an information extraction step from the text to analyze the title, authors, publication date and other information. This is a difficult problem to do and involves searching a gazette of 30 million documents and querying the PubMed database at least once. Once this process was completed for one step we saved it to the paper metadata so that we would not have to calculate it again for that PDF each time. The flip-side of caching is that for quickly changing data then one needs to be careful about cache invalidation.
When in doubt, use brute force
We wanted to get the first version of Turntable finished very quickly. There are many ways to optimize a system to improve performance, but they come at the cost of decreasing modularity, making more assumptions, and, most directly costly, developer time. The first implementations of the room search algorithm for Turntable used brute force linear search by pulling the name of every room and then searching each one for the substring. This takes a few minutes to do and does not require a complicated separate hosted index. Moreover, for the small number of rooms used during testing, it ends up being much faster than doing a network query to a search index!
Compute in background when possible
After a user uploads a song to Turntable, a process must go through the song, analyze it and normalize it, perhaps convert it to mp3, extract the metadata, and deduplicate it. This process can take a while, so we avoid this process from blocking the web server by pushing it to a queue. When the queue completes, it sends a message back through the chat server to the user, which adds the song to the person's queue. We used a similar process for PDF analysis at Labmeeting.
Use Batch Processing if possible
The analytics system we used at Stickybits used batch processing to load or reload the whole dataset of user interactions into the data warehouse. Once housed, of course, it only updated incrementally the newly updated values in small blocks of 10. If we were to load in the original data in blocks of 10, however, it would have taken hours to load all of the previous data. Batch processing speeds up this initial process.
Shed load to control demand.
Turntable currently still has a limit on the number of people that can fit into a room. More than 200 starts to degrade performance. Python is not the fastest language, and message updates scale up to O(N*2) if each of N users send a message. The entire backend uses Python and originally a synchronous version of pymongo with blocking behavior. Limiting each room size allows each room to be hosted on separate resources, with the main bottleneck being these real-time large-room interactions.