How tweets end up being news. System Design of Twitter (or 𝕏).

Twitter is a micro-blogging platform that has been around for almost as long as Facebook, yet it has been able to stay relevant without completely changing its character. It still is pretty much the same app for a common user and in this article, I am going to talk about how it manages to do so from a design perspective.

Before reading the rest of the article, I want the readers to know that this is not the exact architecture of Twitter (the code is open source and anyone can view it). System Design is more about discussing the tradeoffs of one design solution over another

Functional Requirements

So first, let's get our facts about what we are aiming to build looks like. Any person can make an account on Twitter and follow other people. Most people don't have a lot of followers, including me (this is your sign to follow me) but some people are quite popular, which means everyone would like to see what they are tweeting.

From this, we can derive the three most fundamental functions of Twitter :

Users can follow other users.
Users can post content (or tweet).
Users get a news feed based on who they are following and what they are interested in.

The first two requirements are pretty basic and easy to implement. But the third function wherein we have to create a curated news feed for every user is going to be challenging as we need to decide on how to rank a tweet and do that for every user. So there's some machine learning algorithm involved there. We aren't going to dive deep into what the algorithm will look like.

This can look complicated at first but if you are reading this as a student who is just preparing for interviews, there's no need to worry about the algorithm as interviewers are looking for something simple even though that sounds counterintuitive.

Since the platform has now started allowing regular users to post media as well, we are going to look at how that works and we will include that in our design as well.

Capacity Estimation

As of October 2023 (which is when I am writing this), Twitter has approximately 500 million monthly active users. Assuming around 40% of them use the platform daily we get to around 200 million daily active users (this is something you should clarify with your interviewer). Assuming 25% of them tweet at least once a day (again something to clarify) we get to an estimate of 50 million tweets a day. An average user will read 100 tweets in a day (CLARIFY CLARIFY CLARIFY).

Most tweets are just text. Including the metadata as well we get to a figure close to 1KB of data per tweet. But some tweets include media as we discussed earlier which are going to be reasonably large, assuming close to 10MB or more. Factoring that in we get to 1MB of data per tweet.

Doing the math for the data read per day from the servers :

1MB X 100 tweets X 200 million daily active users = 20PB (petabyte) of data read everyday.

1MB X 50 million daily tweets = 50GB of data written everyday.

High-Level Design

Since we know this is going to be a read heavy system, we start off with a load balancers to distribute our requests across different servers. We are using a Relational Database for storing our tweets and a GraphQL database for storing our user collection to track followers easily.

It's obvious to question my choice of using a relational database and not a NoSQL one because everyone seems to love it for some reason. When it comes to tweets there are a lot of join operations involved which are quite fast on relational setups. Also, there's the issue of scaling the database which is pretty straightforward with RDBMS.

A read-heavy system can perform much better with a cache so let's use one here.

We also need to store media which requires an object storage solution since our database is not the best option for storing media. We can use a service like AWS S3 or Google Cloud Platform.

We know that tweets including media are usually the ones that are more likely to go viral and this virality is often tied with geography, i.e., a tweet that goes viral in India is most likely to have originated from India itself so it would add more burden to our app servers to connect our app storage with them. We can opt for a Content Delivery Network (or CDN) which is a geographically distributed group of servers that caches content close to end users. A CDN allows for the quick transfer of assets needed for loading Internet content, including HTML pages, JavaScript files, stylesheets, images, and videos.

Design Details

Since we are using a CDN for handling media requests we need a tweet request to be made up of two separate requests - one for the text content and the other for the media content.

We also need to ensure authentication and we have two options - an authentication key passed along our request or passing in the user ID in our request.

createTweet(text, media, userId);

getFeed(uId); is also going to be another API request we will be creating for our client. Selecting which tweets to show on a person's home page (or explore page) is an important decision and we will leave that to a ranking algorithm which may rank a tweet based on its proximity to the user's recent activity and how many people he/she follows has interacted with that tweet. Retweets find a spot mostly, likes and comments also impact a tweet's likelihood of getting featured in someone's feed.

Database Scaling

As I mentioned before that I have decided to go with a relational database for our design. Relational Databases are quite easy to scale with the help of sharding. Sharding is a very important concept that helps the system to keep data in different resources according to the sharding process. The word “Shard” means “a small part of a whole“. Hence Sharding means dividing a larger part into smaller parts. In DBMS, Sharding is a type of database partitioning in which a large database is divided or partitioned into smaller data and different nodes. These shards are not only smaller but also faster and hence easily manageable.

What parameter we are going to use for sharding is crucial. The obvious choice here is to tweetId to partition our database but it's not the most efficient one. Just like in the real world, social circles can get quite cliquey and information within these circles can be irrelevant to people outside these groups, a similar pattern is observed online where Twitter circles can get quite cliquey and tweets shared by the members of this group can be irrelevant for other users. So a better option to cleave our database would be userId.

We have a read-heavy system so we would like to save costs with read-only partitions. We can share data from the write partitions asynchronously to other partitions as inconsistency is not a dealbreaker in the case of something like Twitter.

Enjoyed this article? Follow me on Twitter (or 𝕏).

https://x.com/saurabhdhingraa/status/1712074707573559299?s=20