Dream. Dare. Do – that is Suyati’s work principle in a nutshell.
Building a social inbox has never been simpler. MongoDB has the unique capability to scale up write intensive and robust data back ends on-demand. A case in point is Mongologue–developed by Suyati–a PHP framework on MongoDB. A light weight micro blogger clone using MongoDB. Join our Webinar to understand the use of Mongologue for building micro-blogging’s network back end
In this webinar we would be discussing the following:
Thank you for joining our webinar on building on Building Social Inbox with MongoDB. I think we’ll wait for a few more minutes before we start so that we give some more time for the last minute participants to log on.
Good morning to everyone present here !
Welcome to our webinar on building on Building Social Inbox with MongoDB. I'm Muktha Ashok, and I'll be your moderator for today's webinar.
How to use the Goto webinar interface:
Before we begin let’s review a few pointers on how we can interact during the webinar. Now we are looking at the attentive interface of the webinar, the viewer window on the left allows you to see everything that we’ll be sharing on our screen, and the control panel at the right is how you can participate in today’s event.
We would also like to hear from you, so please do ask questions and share your comments during the webinar. Now, this option is located on the left tab. We’ll be reviewing your questions as they come in and will take them up in the Q & A session towards the end. Also, the webinar is being recorded and everyone will receive an email with a link to view the recording of today’s event. Finally, I want to ask you to please complete a very short feedback form that would be online after the webinar. It won't take long and your feedback is very important to us.
About Suyati Technologies:
As we move on to the webinar I would like to introduce you to our company, Suyati Technologies. Suyati basically focuses on developing niche IT solutions and services including CMS, CRM and e-commerce. We are an Ektron-featured implementation partner, a Microsoft Gold partner, and a Salesforce App Exchange partner with an extensive experience in .Net, open source and mobile app technologies.
Speaking about our expertise in FOS technologies, we have 5+ years of experience in designs, development, and deployment of complex open source. And we have seen a gamut of successful projects which would include a conference app, a job placement portal, an amusement park management system etc.
Featured speaker for today:
Now at this point, I would like to introduce you to our featured speaker, Krishnanunni - a technical architect at Suyati Technologies. His focus areas include real time analysis of web data and scalable web applications. A frequent speaker at free and open source meets, and he's well versed with conducting workshops on mining data for twitter.And he's also conducted workshops on rapid application development using Python.Please feel free to post questions, follow us, or even tag us. You can use the hash tags and handles that you see on the screen.
Agenda for this webinar:
Now I will walk you very quickly through the agenda of the day.
We’ll start off with an introduction to MongoDB and will speak about Social Inboxes and what are the different types of Social Inboxes followed by challenges in building social inboxes. Patterns of building social inboxes and then implementing a write intensive social inbox. Finally we'll conclude with introduction to Mongologue.
Now, I would like to pass on to Krishnanunni. I'll quickly start my session by an introduction to MongoDB. Before that I would like to thank the moderator for the very generous introduction she gave me.
So about MongoDB - MongoDB derives its name from the word Humongous Data. So, as the name signifies, MongoDB is all about handling large volumes of data. It’s oriented towards performance, availability, and automatic scaling. Those are the key points that make MongoDB a spectacular choice if you want to write an application that needs to be scalable and needs to work with a lot of data. You can find MongoDB for all platforms under mongodb.org/downloads
They are released as open source under the NGPA license. You can also try out MongoDB online using try.mongodb.org . You can find extensive documentation on MongoDB within their website - docs.mongodb.org/manual
Also, a lot of community support on all the major platforms and user groups on google etc. It does not work like traditional relational databases. MongoDB stores all its data in the form of documents which are like structured to look like Json arrays even though they are stored as BJason or Binary Json. All the data is stored as documents in MongoDB as Json arrays. You can see the example here where you can see one document about a user here, where we have a name and age, status groups etc. They are all field values which goes into Json array kind of document. We can also have field value where a document is embedded within another document. So MongoDB has no restriction. It’s all left to your imagination. There is no fixed schema about how the documents should be structured or what goes into collection etc. It’s all depending on the data that you have. You put it in as documents as and how you need it. MongoDB stores all documents in collections as logical group of documents. You can relate them with tables in relational databases. If you relate it that way then the collections would be tables and documents would be records, and each entity within a document would be a field. So collections can be a group of users, a group of letters, a group of anything that can be logically grouped together. Again, there is no restriction that each document in collection should be of the same format. They can differ in format again as and how the application would require.
Topic for today:
Now coming to today’s major topic about building social inboxes. I would like to quickly speak about what is a social inbox. So we've all seen social inboxes. We use examples of it every day. Starting from our email, twitter, Facebook newsfeed etc. everything is a social inbox. Basically it is an inbox where all the people in the social graph, all the posts show up all the data that they pump into the system show up. It can be a twitter newsfeed. It can be your email it can be your skype chat. It can be anything. So now you see application that stays which revolve around social inboxes quite a lot. There are a lot of social networking applications, mobile apps available nowadays. This session is about how to build a very scalable and very quick implementation of social inbox using MongoDB.
The considerations in going to build social inboxes. Majorly, two of them are there:
One would be the write and read requests that would go into creating social inbox.
The sheer volume of data that would be pumped into the inbox that would go into social media system would be huge. You can imagine since all of us use Facebook and Twitter, you can imagine the kind of volume that social networks will have to deal with.
So the performance and the volume is something that is a major concern while designing a social inbox.
And also we need to keep a very real time user experience. So that means, if a friend of yours is creating a post in the United States of America you need to get that in China as soon as possible. So the real time usage is to be maintained. These are the major considerations that we need to go into when you create a social inbox. Let’s examine what a user would do in a system that has the social inbox, let’s say any social media. He creates posts, he reads timelines, he follows other users, he gets followed by other users, he need to read his posts etc.
So all these constitutes a lot of data. All these while a lot of operations where you move data from one phase to another. So available to everybody. And again keep in mind it has to be done in as real time as possible without affecting the performance of the system at all.
More details on Inbox:
So for an inbox we have, like I said, a real time read and write. So we also have sorting of posts. Posts that are most recent at the very top. You might want to sort into order of friends who have posted them etc., so there's a lot of sorting going there. So there's a lot of paging and delivery. We should now, let’s say, you have a 100 friends. All of them post at the same time. You cannot have all the data pumped into your inbox at the same time. So you need to create a page so that it is readable for the user and the inbox has to keep in mind that somebody who's reading the inbox, somebody who's interacting with in real time. So the data cannot just be pumped into it. So you need to be able to page it and deliver it in a timely manner. Just to give you an idea about what we are talking about, an average tweet on Twitter makes it to about 254 inboxes. So that means in Twitter, a network like Twitter, a user will have on an average of 254 followers who would be required to read the posts. So any tweet that you make on any post will be delivered in 254 inboxes at the same time. That is just to give an idea about the scale that we'll be working with. So the Myth of Normalized Data.
In a situation like this, we'll have to throw out all our legacy ideas about data and how normalized data will help us and how duplication is a bad thing etc. We'll have to throw it out the window to make all this performance possible. We need to have a lot of duplication introduced into the data for performance. That means you need multiple copies of the same data so that you can deliver them on real-time basis all the inboxes at the same time. And we need to have a lot of parallel copies also for scalability.
So that means we need horizontal scalability also possible within the database to make it extremely scalable and have a high performance.
So some schema designs:
So there are some techniques that MongoDB talks about and has been debated about online.
For designing a schema for such an inbox. So three of the techniques that they have:
One would be Fan-out on Read, one would be Fan-out Write, and the third one would be Fan-out write with Buckets.
So ideally Fan-out write with buckets is the best way to handle this, but it’s also a very complicated way. So we'll not go there. We’ll keep our implementation around fan-out write.
On fan-out on read, the data would be kept in one place, and all the people who will be reading that particular post will have to come to that place, fetch it, and read it.
Good thing about this is the send message is very efficient because you keep it in one place. There’s no fan-out. The read is the worst. With that implementation all the users will have to find one particular record out of all the records that is there which might be shared on any server of your system. Since there is only one copy, all queries will have to be routed to that one copy. So the performance for reading the post would be worse. The data size is good, the best, because message is stored only once. But let’s check fan-out on write.
So fan-out on write, what happens is you need to create a post, a copy is kept for each user, for each inbox, the inbox that has easy availability of that particular post that makes the write, which makes read very efficient, so because the inbox will have easy availability of that post. The write is also efficient because you can have a shard or replica or one instance of MongoDB, small subset of the whole database. We’ll talk about a shard a bit more in detail later.
The write mechanism is also good. The data size unfortunately is the worse.
Because each copy is for the recipient. Unfortunately, that is one of the things that we'll have to live with for high performance. So for all inboxes, all social inboxes, reading and write performance are very important. So we have to go with an implementation that gives you, that gives more or less the same performance as your reading and writing. Hence, the reason we have chosen Fan-out on write.
How to get started:
So let’s get started with some code.
Inserting users to a collection
We are going to create a user called John. John is 24 years old. We will add John into one of our collections. So, let’s say if you don't already have collection, Mongo will create one for you. Once the collection is created, the user is inserted into it.
We try to find all users in that particular collection. We have one object in our user's collection. Mongo assigns an object id to that particular collection, so you can see a random sixteen digit; a very long object type there.
Let’s talk about our particular database schema would be for a social inbox. So like I said,
A user creates posts, reads timeline, follows other users, and gets followed by users. This is just a reminder of all the things that a user does. To structure a user document we need specific information—we need to know the name of the user, or the handle of the user. We also need to know who all the followers are, and also who he’s following.
So let’s create another user:
Creating a user, and more
Let’s say that John is following Doc.
So, we've created that user document. Now let’s try and save this. So that user was successfully completed. Let’s see all the users in that collection right now. As you can see we have 2 objects in the collections, 2 documents in the user connection. One would be John and the other is Doc.
Mongo has no restrictions on what the structure of the document should be.
Now let’s think about a post. What makes a post?
So we need to know at least who created the post. So that would be from. You need to know on what date on what time this particular post was created. So you have date field.
We would also like to add a message. “Hello”
So let’s create a post:
And we will save the newly created post, into the posts collection. So that has been successfully inserted now. Let’s try and retrieve that.
So we can now see that the post is created. And it’s created on this particular date; we have the timestamp there with the date and the time. We also send a message.
So we’ve just created a post now. From a post, we can try and retrieve a user. Let’s try the post. From this post let’s try and retrieve the user who has created this post.
So you can see how the query is formed here. We can see how the dB is used to find one particular user from the user’s collection whose name is actually the name of the person who has created the post.
John is that person who created that particular post so we would retrieve that from this collection. Now we need to actually create a post. We would need to put all the copies of this post into each follower's inbox.
So let's say that Doc has created a new post called “Hello John”. Doc and John follow each other. We use a loop and we iterate through each follower Doc has or we can set another value to the post or copy of that post we have been trying to create. There we have the followers name so we need another collection called 'inbox'. We will try to create John's inbox; so let's create copy of that post John created and we'll add another key to it called 'to' and we'll say that John is the person who has to receive this particular post.
We can see 'from', 'date', the 'message' and 'to'; then we just move into our inbox collection. All the messages that are marked John now will be retrieved into John's messages. We've retrieved all the information for John so if you've multiple posts in that inbox, you can retrieve all of them in one go; so each time John tries to read his inbox we can easily retrieve the messages that are marked. John, without actually going through each post and running any condition against it, everything that is marked to John which is direct operation we can retrieve immediately in one go, to be moved to John's timeline. We can also sort the posts by the dates on which it is sent so that the latest ones are retrieved first. We can limit them. We can also retrieve say 5 or 10 of them and then we can page using the last retrieved ones.
Let's talk about shards:
So you might be wondering, how this would actually impact any performance because we have only one collection where all the inboxes are going. So even if it's a relational database I can have just one field that says to and I’d get almost same performance but this is where the automatic scaling capabilities of MongoDB come into play. So MongoDB has the automatic capability of horizontal scaling; so we can create shards of the same DB where we can split those DB on certain keys. So let's say we have inbox collection so we can split inbox collection into multiple horizontal servers, or horizontal shards. That’s what MongoDB calls it. You can split them into horizontal shards so you can use any shard.
So let's say we shard the collection. In our case it would be the inbox collection and you would shard it based on to whom it was sent, you can split it on the to field, from field you can even split it on the date field so you can split it on any one of the fields that you want. Just make sure that the availability is there so that Mongo automatically just searches in that shard—it would not have to look through all the information in the rest of the shards. So if you split according to whom that post was marked, all of John's post will be one shard; so Mongo would know that is the shard I have to look at. I don't have to look at other shards where the posts of all the other users are there.
There is no filtering happening; you can directly just go through that shard making the entire search much more efficient, much quicker. So that's how we leverage the automatic scaling capabilities of MongoDB and we just look at one shard, so this is one capability we have used in lot of scenarios because you can specify the shard key, the key on which we need to split all the data up so if you're splitting the inbox collection all the posts that have that particular key will be moved to that one shard and if we have multiple instance of the same key, let’s say we have to is marked on John to is marked to Doc the to is marked to James, all three of them will be moved into three separate shards and all of John's posts will be on one shard, all of doe's posts will be on second shard all of James' post will be on third shard and all would automatically go when the query comes which shard to look at, the developer or the database administrator would not really require.. code the application would not require to have any logic to look at that particular shard which is automatically maintained by the MongoDB server which is what makes Mongo awesome choice for this kind of application.
So now to go further we have a lot of other things also that would be required to actually built in all these features social media App would be.... groups so you would need to follow 1 you would need other media in the posts you would need comments, likes etc. This is where one of the frameworks that we built comes into picture -Mongologue, which is available publicly right now on GitHub. It is a framework that we've built on PHP which will help you to set up your own micro-blogging site easily using MongoDB. So all the hard work of sharding, managing shards, the collections, the schema... writing implementation etc. are already being done by Mongologue and it provides you models where you can easily use it.
So for all the PHP developers out there who use composer, you can just add Mongologue into your application using adding the.. composer file so you just say Suyati/mongologue and the develop version so right now it is under development and it would automatically come into your application you can just initialize Mongologue using the mongologue factory, create factory and then you create mongologue and you specify where your Mongologue DB server is, so all the sharding etc. can happen on the server. You just create shards and Mongologue would automatically know how to work with it again. Mongologue takes care of it all by itself. The code does not have to know; so you specify which database is being used and you just go ahead and add users. Anything you want and you can have handles. e-mails, first name, last name and any other information you just register that particular user on Mongologue, So Mongologue provides all the models you have made; like the user model, so if you go up you can see post model. So when you create post you have to have who created the post and time on which it is created any message.If there is any category is to be assigned or any other data you want and you can also add other files you can have images , resources you can have links, videos, anything you want on the Mongologue post model would handle that you just create the post and Mongologue would automatically figure out all the followers to whom this particular post should be delivered to and move those posts into those users inboxes. You can create groups and you can have users join the group, you can follow other groups. You can join groups so Mongologue is basically 100% unit tested so if you look at the unit tests you can find the complete specifications. Mongologue is completely unit tested and we have integration tests. About the complete specifications for MongoDB - we've 100% and we've unit test which are available publicly on GitHub as open source. It is also available through composer. It's released publicly again, it's under the Travis CI so you can always check for the help of the package.
1. Which are some of the players who use MongoDB?
MongoDB recently has been used by a lot of big enterprises in the industry. CNN uses MongoDB i.e. CNN.com. Foursquare and IBIBO also use MongoDB.
2. Should I use MongoDB for banking solutions?
Not really recommended because banking has a transaction nature and it's more suited to be built with relational kind of data structure. MongoDB is mostly used for different types of data that change and need to have a lot of flexibility; like in Social Media.
3. Does increasing the nodes on Mongo improve the read/write?
Not really, there are other frameworks that do that. It will not improve the read and write. Increasing the number of shards on MongoDB has been illustrated; would really help the write and read.
4. What is MongoDB written in?
MongoDB is written in C++ and it has drivers on all the major languages. MongoDB has drivers/libraries in all the major languages. It has Python, it has PHP, it has CSharp, .NET etc.
5. Can we convert the relational data to MongoDB?
Yes there are tools that do that. Query is one tool that does that. There are also other tools available.
6. How about associating with the ASP.NET project?
It would be a great idea but the first thing that you need to figure out is the nature of the operations that you have to perform on the data. Like mentioned earlier, banking solutions are very relational, and the applications may not be the best suited to MongoDB. But applications like Social Networks and recommendation engines data stores, data harvesting etc. would be the best option with the ASP.NET project.
7. Why would you choose MongoDB?
Mongo has higher variability and its performance is very high. Its resonance sequence plus its blazing past are its best features. From Application Design point of view, it supports automatic skills.
8. Does MongoDB provide transaction/locking?
No, it's very light-weighted in that way so it's extremely quick.
9. How can I migrate my existing MySQL Database?
We have tools that are available on the website.
10. When do you use MongoDB over SQL?
When we have something document oriented, that is we have different kind of data, lot of unstructured data coming in, that's one indication that you would be better off with something like SQL.
11. Do I need a DBA?
No you don't need DBA. MongoDB projects do not require; having said that, you can still have people monitoring the availability of DB.
Krishnanunni is working as a Technical Analyst at Suyati Technologies. His focus areas include real-time analysis of Big Data and Scalable Web Applications. He is a frequent speaker at Free and Open Source Software (FOSS) meets, and in the past, he has conducted workshops on mining data from Twitter using CouchDB and Node.js and rapid application development using Python and Qt.