Software archeology

Software archeology or Data archeology is a term I use to describe the process of extracting meaning and logic from the data an old system has stored in its database.

Traditional enterprise systems are at least 2 layered using a relational database as the means to persist data for a longer time and a GUI layer to display and process the data. When you need to export data from the old system it is unlikely that the old system has an “Export all” button enabling you to get the data. It is much more likely that you will need to either code new logic in the old application or access the data directly from the database.

Building new logic in the old system is often a problem since you are migrating away from the old system for a reason. Often this reason is that maintaining the old system is difficult, so starting a new development project is not the preferred solution.

It is often much more convenient to access the data directly from the relational tables. The problem doing so is that you will miss a lot of the logic buried within the application layer itself.

Common scenarios are delete-marked rows that the application knows has been superseded by a new version with the correct data. Other examples are runtime calculations in the application layer, implicit foreign key relationships, and enumerations stored as integers.

Despite the risks of misinterpreting the data in the database my recommendation is to implement the export from the old system by migrating directly from the database.

You will need to mitigate the risk that you introduce, however. You will need to do a lot of exploration comparing screenshots of the system with data in the tables to establish which tables, columns, and rows are needed to represent the data that you wish to migrate. In the best of worlds, you have access to authoritative reports from the old systems that generate data on most of the data that needs to be exported. These reports can then be used to reconcile and verify that the interpretations you have made are correct.

During the investigation, you will probably discover that only a fraction of the data in the tables is actually relevant. Unless the old system had a very strict database manager there will be lots of data that is now irrelevant. Some columns might be deprecated, but not deleted other columns just there because someone thought they would need them, but never got around to actually storing data in them. There will likely also be a name-gap between entity names in the database and labels on the user’s interface. Such name-gaps should be identified asap as they are the root cause of many specification errors.

If you have access to the code of the old system you will be able to “cheat” and look at the way the system uses the data, but even that is only partially helpful since the code is only the latest version of the code and you are looking at a database where data has been stored by many versions of the code have stored data throughout the lifetime of the system.

Another way to explore the use of the database is to use a database profiler. It can capture the exact SQL statements that are executed on the SQL server. By comparing this SQL with data on the screen it is easy to find the relationship between data in the database and data as seen by the user.

Luckily most systems worth migrating were originally designed with a proper database schema, primary keys on all tables and (if you are lucky) explicit FK relationships. Standard IT practices a few decades back focused heavily on the database design, so most systems in need of migration today will be well organized.

The future of migration

I cannot help however thinking about the challenges that software archeology will face when the systems we create today needs to be examined. The relational model is slowly being phased out in favor of NoSQL technologies. It has become good practice to use persistency transparency and not worry about how data is retrieved into objects. Not that these techniques are wrong, we just have less experience migrating these type of software services.

Focus now is heavily on functionality and agility and less on building a rigid foundation. What will your software look like as a ruin?