You can read the part 1 here.
Designing a Booking Service
In part 2 we will start to design a booking service, iterate over it and discuss all the steps. Let’s start with some requirements. We need a back-office to create hotels. Each hotel has a list of rooms. Each room has a capacity, a price and a reservation list with the interval of dates when it is booked. A hotel has an address, a city, a country and a list of facilities like pool, gym, children parks, etc. A back-office user can created and edit the hotel data, create new rooms, and manage room availability including booking cancellation. A customer can book rooms and cancel reservations.
Creating a relational database for this model is pretty straightforward, a table per entity and a foreign key for each 1-n relation. We implement the commands using database transaction, to ensure there are no race conditions between Clients bookings and/or a Back-Office edition. Our architecture is a very simple monolithic app.
Off course this application gets a lot of traction and soon the database have hundreds of thousand of hotels, some million rooms and several million bookings. Meanwhile we need to implement a full text search, with a fancy auto-complete suggester. Happily we know Solr that provides everything we need to implement a rocking search API that even includes a suggester out-of-the box. We only need to configure the Solr farm, define the documents schema and integrate the data. We create a new component to do the integration job, we call it the integrator.
Now we just need to develop the integrator to get data from a -> b. We create a query that scans the entire database, converts the result in the Solr document format and inserts all the data. We run it one time but it takes like an hour or more to complete. We find it acceptable as a one time data bootstrap. But now we need some way to capture the new changes that occur on the SQL and propagate them to Solr. It is not acceptable to re-run the full scan every time. We can tweak the integrator with lots of tricks like performing interval runs, creating an update date field to every entity and index it, to allow to fetch only data that changed since the last run.
Eventually we will need to develop an API to calculate the filtering counter for each available hotel facility. We find that a query that performs n aggregations to count each filter based on any arbitrary query is not easy to implement, specially when we need to aggregate millions of entries n times and give a response in near real time. We decide to develop a custom service with an in-memory bitmap index to solve it.
We extend the integrator to export the data to the new service. We find that for the bitmap index only a subset of the data is relevant, the original query that finds ALL changes is not ideal. We create a bundle of sophisticated abstractions in the integrator to allow to specify scanning strategies, we ensure the right indexes are created, we create the right abstractions to support multiple output models and endpoints, and everything start to become freaking complex. The SQL database doesn’t stop evolving, new tables are created, some new columns are added, the integrator needs to be constantly updated with new developments. The data doesn’t stop to grow, performance start to become an issue, queries get slower, the database incurs in page fragmentation that needs maintenance. Several teams now depend on the same database to integrate data and coordinating all that effort is a pain in the ass. At some point we decide to make a change in the architecture and include a message broker to mitigate these issues.
With the message broker the app publishes a message when a new update is available and the integrator becomes reactive. Now we have a new component in the infra-structure to maintain and a new problem to solve: to ensure the updates in SQL Database AND the message publication are atomic. An option is to save the messages in the same SQL Database, using the same transaction that updates the state. If the publication succeed the messages are deleted after, if the connection to the broker fails it can be retried latter from the messages kept in SQL. We also change the integrator to work with two modes, reacting to notifications but also supporting a bootstrap mode to setup new consumers.
The system got more components but now the application is reactive, we have the guarantee that no message publication is lost and we can extend the integrator when a new data model is needed. But there is still a little problem we didn’t solve. We can have two concurrent operations happening over the same room (imagine a booking immediately followed by a book deactivation), because the broker message publishing is done AFTER the data is changed in the database, it is possible to have a reordering in the messages delivery:
To solve this we have three options: the first is to use only the last version of the entity. We only use notification events to trigger an update in the integrator that always fetch the last available version of the data. The second is to avoid concurrency, by having only one worker processing commands for the same entity (and also only one worker in the consumer side). The third is putting a version number in every message and implement some logic in the subscriber to re-order the messages. The re-ordering needs some kind of buffering mechanism to keep the newest messages while the previous did not come, or query the source (the SQL database) if an unordered message is received, but this would require the messages to be kept in the outbound until consumed.
With the re-ordering issue solved we have all guarantees we need, messages are not lost, are delivered in order, the Integrator is reactive, and we can plug-in new models if needed. The business does not stop and there are new requirements. Some Hotels get almost full booked for some weeks and the business wants to feature those in the auto complete text box. A Hotel is considered to be featured for a given week if at least 90% of the rooms are booked. An email to the Hotel manager should automatically be sent when the Hotel becomes featured.
To implement it, the Hotel entity is extended to include a list of featured weeks. When a booking is created, it is calculated for that week if the Hotel reaches 90% occupation. If so a featured week is added to the list. A new component to handle the e-mails is added to the architecture and a new collection only for featured Hotels is created in Solr.
This new requirement brings a subtle difference to what was being done until now. So far the data from the entities was just replaced in Solr when a new version is available. With this new requirement we need to perform a reasoning about the nature of the change. We need to find what kind of change happened, if the change has to do with a featured Hotel, it is the featured collection that needs to be updated. The same happens with the e-mail, that should be sent only if the Hotel became featured. If we choose to use notification events (generic events that only trigger updates), the integrator would need to have some intelligence to understand what happened, like comparing the current version of the Solr data with the last version in the SQL. If we use domain events with proper semantics it gets easier as it already carries meaning.
Using domain events like HotelBecameFeatured needs to carry also the data relevant for the change, in this case the week when the Hotel has >90% bookings. In practice we will end up having two different write models: the relational model with the current state and an event model with the list of changes in pending to be dispatched to the broker. As we already have to write the events to the outbound queue, it is eventually decided to keep them forever for a very good reasons: it is very good for debugging. As a plus it gives a historic view about the state of the system and allows temporal queries if needed.
We got to a reasonable solution that allows reactive behavior, some level of decoupling between producer and consumers and all the guarantees the system needs. But this is very far from the final version. In part 3 it will be presented the final solution, based on Event Sourcing and discussed the advantages against this solution. It will also be provided some guidelines to implement Event Sourcing systems.