Monday, August 27, 2018

Client Models vs Data Transfer Objects vs Data Access Objects vs entities vs whatever you want to call your data classes

I have been working in software development for a while now, but one of the most distracting aspects of the business is that it reinvents itself almost every year. Yet, as opposed to computer science, which uses principles to build a strategy for understanding and using computers and derive other principles, in business things are mostly defined by fashion. The same concept appears again and again under different names, different languages, different frameworks, with a fanatic user base that is ready to defend it to the death from other, often very similar, fads.

That is why most of the time junior developers do not understand basic data object concepts and where they apply, try to solve a problem that has been solved before, fail, then start hating their own work. Am I the definite authority on data objects? Hell, no! I am the guy that always tries to reinvent the wheel. But I've tried so many times in this area alone that I think I've gotten some ideas right. Let me walk you through them.

The mapping to Hell


One of the most common mistakes that I've seen is the Database first vs Code first debate getting out of hand and somehow infecting an entire application. It somehow assumes there is a direct mapping between what the code wants to do and what the database does. Which is wrong, by the way. This kind of thinking is often coupled with an Object Relational Mapping framework that wants to abstract database access. When doing something simple, like a demo for your favorite ORM, everything works perfectly: The blog has a Blog entity and it has Post children and they are all saved in the database without you having to write a single line of SQL. You can even used simple objects (POCOs, POJOs and other "plain" objects that will bite you in the ass later on), with no methods attached. That immediately leads one to think they can just use the data access objects retrieved from the database as data transfer objects in their APIs, which then doesn't work in a myriad of ways. Attempting to solve every single little issue as it comes up leads to spaghetti code and a complete disintegration of responsibility separation. In the end, you just get to hate your code. And it is all for nothing.

The database first or code first approaches work just fine, but only in regards to the data layer alone. The concepts do not spread to other parts of your application. And even if I have my pet peeves with ORMs, I am not against them in principle. It is difficult not to blame them, though, when I see so many developers trying to translate the ease of working with a database through one such framework to their entire application. When Entity Framework first appeared, for example, all entity objects had to inherit from DbEntity. It was impossible to serialize such an object or use it without having to haul a lot of the EF codebase with it. After a few iterations, the framework knew how to use POCOs, simple objects holding nothing but data and no logic. EF didn't advocate using those objects to pass data around your entire application, but it made it seem easy.

What was the root problem of all that? Thinking that database access and Data Access Objects (DAOs) have a simple and direct bidirectional mapping to Data Transfer Objects (DTOs). DTOs, though, just transfer data from process to process. You might want to generalize that a database is a process and your application another and therefore DAOs are DTOs and you might even be right, but the domain for these objects is different from the domain of DTOs used to transfer data for business purposes. In both cases the idea is to limit costly database access or API calls and get everything you need in one call. Let me explain.

DAOs should be used to get data from the database in an efficient manner. For example you want to get all the posts written by a specific author inside a certain time interval. You want to have the blog entities that are associated with those posts, as well as their metadata like author biography, blog description, linked Facebook account, etc. It wouldn't do to write separate queries for each of these and then combine them in code later. API DTOs, on the other hand, are used to transfer data through the HTTP interface and even if they contain most of the data retrieved from the database, it is not the same thing.

Quick example. Let's say I want to get the posts written in the last year on a blog platform (multiple blogs) and containing a certain word. I need that because I want to display a list of titles that appear as links with the image of the author next to them. In order to get that image, I need the user account for the post author and, if it has no associated image, try to get the image from any linked social media accounts. In this case, the DAO is an array of Post objects that each have properties of type Blog and Author, with Author having a further property holding an array of SocialAccount. Of course, SocialAccount is an abstract class or interface which is implemented differently by persisted entities like FacebookAccount and TwitterAccount. In contrast, what I want as a DTO for the API that gives me the data for the list is an array of objects holding a title, a link URL and an image URL. While there is a mapping between certain properties of certain DAO objects to the properties of DTO objects, it is tiny. The reverse mapping is ridiculous.

Here are some issues that could be encountered by someone who decides they want to use the same objects for database access and API data transfer:
  • serialization issues: Post entities have a Blog property that contains a list of Post entities. When serializing the object in JSON format for API transfer a circular reference is found
  • information leaks: clients using the API get the Facebook and Twitter information of the author of the post, their real name or even the username and password for the blog platform
  • resource waste: each Post entity is way bigger than what was actually requested and the data transferred becomes huge
  • pointless code: since the data objects were not designed with a graphical interface in mind, more client code is required to clean the title name or use the correct image format

Solutions that become their own problems


Using automatic mapping between client models and DTOs is a common one. It hides the use of members from static analysis (think trying to find all the places where a certain property is used in code and missing the "magical" spots where automatic conversion is done), wastes resources by digging through class metadata to understand what it should map and how (reflection is great, but it's slow) and in the cases where someone wants a more complex mapping relationship you get to write what is essentially code in yet another language (the mapper DSL) in order to get things right. It also either introduces unnecessary bidirectional relationships or becomes unbalanced when the conversion from DAO to DTO is different from the one in the other direction.

Another is to just create your own mapping, which takes the form of orphan private methods that convert from one entity to another and become hard to maintain code when you want to add another property to your entity and forget to handle it in all that disparate code. Some try to fix this by inheriting DAOs from DTOs or the other way around. Ugh! A very common solution that is always in search of a non existent problem is adding innumerable code layers. Mapper objects appear that need to have access to both types of data objects and that are really just a wishful thinking brainless replacement of the business logic.

What if the only way to get the image for the author is to access the Facebook API? How does that change your mapping declaration?

Let me repeat this: the business data objects that you use in your user facing application are in a completely different domain (serve a different purpose) than the data objects you use in the internal workings of your code. And there are never just two layers in a modern application. This can also be traced to overengineering and reliance on heaps of frameworks, but it's something you can rarely get away from.

Don't get me wrong, it is impossible to not have code that converts from one object to another, but as opposed to writing the mapping code before you even need to use it, try to write it where you actually have need of it. It's easier to write a conversion method from Post with Blog and Author and Social Account to Title with two links and use it right where you display the data to the user than to imagine an entire fantasy realm when one entity is the reflection of the other. The useful mapping between database and business objects is the role of the business layer. You might see it as a transformation layer that uses algorithms and external services to make the conversion, but get your head out of your code and remember that when you started all of this, you wanted to create an app with a real life purpose. Technically, it is a conversion layer. Logically, it is your software, with a responsibility towards real life needs.

Example number 2: the N-tier application.


You have the Data Access Layer (DAL) which uses a set of entities to communicate with the database. Each such entity is mapped to a database table. This mapping is necessary for the working of the ORM you are using to access said database. The value of the DAL is that you can always replace it with another project that implements the same interfaces and uses the same entities. In order to abstract the specific database you use, or maybe just the communication implementation, you need two projects: the entity and interface project and the actual data code that uses those entities and accesses a specific database.

Then you have the business layer. This is not facing the user in any way, it it just an abstraction of the application logic. It uses the entities of the business domain and, smart as you are, you have built an entire different set of entities that don't (and shouldn't) care how data is accessed or stored in the database. The temptation to map them in a single place is great, since you've designed the database with the business logic in mind anyway, but you've read this blog post that told you it's wrong, so you stick to it and just work it manually. But you make one mistake: you assume that if you have a project for interfaces and DAO entities, you can put the interfaces and entities for the business layer there as well. There are business services that use the business data objects, translate DAOs into business data objects and so on. It all feels natural.

Then you have the user interface layer. The users are what matters here. What they see, what they should see, what they can do is all in this layer. You have yet another set of entities and interfaces, but this is mostly designed with the APIs and the clients that will use it in mind. There are more similarities between TypeScript entities in the client code and this layer's DTOs than between them and the business layer objects. You might even use a mapping software like Protobuff to define your entities for multiple programming languages and that's a good thing, because you actually want to transfer an object to the same type of object, just in another programming language.

It works great, it makes you proud, you get promoted, people come to your for advice, but then you need to build an automated background process to do some stuff using directly the business layer. Now you need to reference the entity and interfaces project in your new background process project and you can't do it without bringing information from both business and data layer. You are pressed for time so you add code that uses business services for their logic but also database entities for accessing data which doesn't have implemented code in the business layer.

Your background process now bypasses the business layer for some little things, which breaks the consistency of the business layer, which now cannot be trusted. Every bit of code you write needs to have extra checks, just in case something broke the underlying data. You business code is more and more concerned with data consistency than business logic. It is the equivalent of writing code assuming the data you use is being changed by a hostile adversary, paranoid code. The worst thing is that it spends resources on data, which should not be its focus. This would not have happened if your data entities were invisible to the new service you've built.

Client vs server


Traditionally, there has always been a difference between client and server code. The server was doing most of the work while the client was lean and fit, usually just displaying the user interface. Yet, as computers have become more powerful and the server client architecture ubiquitous, the focus has shifted to lean server code that just provides necessary data and complex logic on the client machine. If just a few years ago using Model-View-Controller frameworks on the server, with custom DSLs to embed data in web pages rendered there, was state of the art, now the server is supposed to provide data APIs only, while the MVC frameworks have moved exclusively on the browser that consumes those APIs.

In fact, a significant piece of code has jumped the network barrier, but the architecture remained essentially the same. The tiers are still there, the separation, as it was, is still there. The fact that the business logic resides now partly or even completely on the web browser code is irrelevant to this post. However, this shift explains why some of the problems described above happen more often even if they are as old as me. It has to do with the slim API concept.

When someone is designing an application and decides to go all modern with a complex framework on the client consuming a very simple data API on the browser they remember what they were doing a few years back (or what they were taught in school then): the UI, the business logic, the data layer. Since business and UI reside on the browser, it feels as if the API has taken over for the data layer. Ergo, the entities that are sent as JSON through HTTP are something like DAOs, right? And if they are not, surely they can be mapped to them.

The reality of the thing is that the business logic did not cross all the way to the client, but is now spread in both realms. Physically, instead of a three-tier app you now have a four tier app, with an extra business layer on the client. But since both business layers share the same domain, the interfaces and entities from one are the interfaces and entities from the other. They can be safely mapped because they are the same. They are two physical layers, but still just one logical business layer. Think authentication: it has to be done on the server, but most of the logic of your application had moved on the client, where the user interacts directly with it (using their own computer resources). The same business layer spread over client and server alike.

The way


What I advocate is that layers are defined by their purpose in life and the entities that pass between them (transfer objects) are defined even stricter by the necessary interaction between those domains. Somehow, through wishful thinking, framework worshiping or other reasons, people end up with Frankenstein objects that want to perforate these layers and be neither here nor there. Think about it: two different object classes, sharing a definition and a mapping are actually the same object, only mutated across several domains and suffering for it. Think the ending of The Fly 2! The very definition of separation means that you should be able to build an N-layered application with N developers, each handling their own layer. The only thing they should care about are the specific layer entities and interdomain interfaces. By necessity, these should reside in projects that are separated by said layers and used by only the ones that communicate to one other.

I know it is hard. We choose to develop software, not because it is easy, but because it should be! Stop cutting corners and do it right! I know it is difficult to write code that looks essentially the same. The Post entity might have the same form for the data layer and the business layer, but they are not the same in spirit! Once the application evolves into something more complex, the classes will look less like twins and more like Cold War agents from different sides of the Iron Curtain. I know it is difficult to consider that a .NET class should have a one-to-one mapping not to another .NET class, but a Javascript object or whatever the client is written in. It is right, though, because they are the same in how you use them.

If you approach software development considering that each layer of the application is actually a separate project, not physically, but logically as well, with its own management, roadmap, development, you will not get more work, but less. Instead of a single mammoth application that you need to understand completely, you have three or more projects that are only marginally connected. Even if you are the same person, your role as a data access layer developer is different from your role when working on the business layer.

I hope that my thoughts have come through clear and that this post will help you not make the same mistakes I did and see so many others do. Please let me know where you think I was wrong or if you have anything to add.

0 comments: