Continuous Integration – Revisited

I know this isn't the first article I've done on Continuous integration, but a recent project that involved not one, but three developers, has helped me further understand Continuous Integrations, how it works, and some positives and potential pitfalls that I want to pass along to you.

Why is Continuous Integration Important?

I’ll be honest, when I first started working on Kentico sites, Content Repositories were more of an afterthought.  It started with small 1-2-month Kentico sites, where we may have stored the site in the content repository, but maybe did not (as there was very little customization).  Then as time went on, I started working on larger projects, ones that did more and more customization, and we had a Content Repository for the files, however it was still just myself building it so there was no need for collaboration, and if there was need for collaboration it was often fine to just use a shared database and have two Kentico sites share that same database.

But as time goes on and we are taking on much larger, multi-developer projects, we needed to have a solid method in place that we could build the site, sharing changes, resolving conflict and storing that data.  This is where I’m finding Continuous integration is shining, as it fluently does this.

How does Continuous Integration Work?

First, I wanted to do a deep dive into the back ends of how this tool operates, which can help make some of the behaviors we will be discussing make more sense. 

The CIRepository

The first piece of the puzzle is the App_Data\CIRepository.  This folder contains a file-representation of each database object that is tracked by CI.  Under the CIRepository folder, there is a Global/Site level folder, then the "Class" folders (a folder for each of the CMS Classes), and in them a file that is the xml serialized version of the row in the database for that class.  The file is named after the Code Name of the object, and if no Code Name is present then the GUID.  And if there is any binding data (say a User Setting’s UserId reference field), then it also includes the CodeName/GUID of that referenced object, so it can find the proper ID on other instances of CI.

The CI_MetaFiledata Table

The second piece to this is the CI_metaFiledata table.  This has only 2 fields, FileLocation and FileHash.  This acts as your local instances "memory" of what is already has in the database.  When a new item is created, not only is a file created in the App_Data\CIRepository, but also an entry is made in this database pointing to that file’s location, and the File’s hash (which is a calculated value using a hashing sequence).  The file hash is generated from the file’s content, so if the content changes, then the hash will also change.

The CI Restore

The last piece is the ContinuousIntegration.exe’s -r (restore) executable.  This handles the heavy lifting of interpreting changes in the file system (from other developers) and incorporating them into your own database.  This is how it does it:
  1. It scans the files in the CIRepository
    1. If it finds a matching FileLocation in the CI_MetaFiledata, and the hash of that file is the same, there is no change and It ignores it
    2. If it finds a matching FileLocation in the CI_MetaFileData, but the hash is different, then the object has changed and it deserializes the xml representation of the object and saves the changes to the database.
    3. If it doesn’t find a matching FileLocation, it treats it as a new object, and processes it and creates that object, then adds the FileLocation and FileHash to the CI_MetaFileData.
  2. It scans the CI_MetaFileData
    1. If it finds a FileLocation that doesn’t exist, it considers this object ‘deleted’ and thus it finds the object in your database and deletes it.
While the order of 1 and 2 I’m uncertain, and there seems to be some magic it does to know to processes classes that are referenced by others before it processes itself (so it would process Users and Roles before UserRoles, for example), for the most part this is how it operates.

Custom Module Classes

It’s important to note that Continuous Integration is disabled by default on new Module Classes.  So, when you are generating new Module Classes, I would enable CI right away on the applicable classes, so any changes are tracked.  And as with most things, you need to use Kentico’s API when inserting, deleting, or modifying any objects, otherwise they won’t end up in the CI Repository (unless you do a full serialization through the Continuous Integration UI in Kentico).

Pitfalls – Watch out below!

Now that you understand a bit about how it works, here's some things you must watch out for, and some work arounds.

#1 – Careful about LARGE amounts of Data!

As anything that is tracked by Continuous integration both creates a file and database record (with a file hash), you should exercise caution with extremely large sets of data.  This usually comes in the form of Binding Tables.
Now to explain of a "large" set of data is, I’m talking about anywhere from about 10,000 objects on up.  We were doing an import of large amounts of data that was tracked in CI, about 20,000 objects.  Then those objects each had bindings to another object (multiple), so the binding class had about 200,000 entries in it.  Yikes!  That means 200,000 .xml files were being created.  The result?  Visual Studio couldn’t load the project because there were just too many files within it, and GIT had a FIT about it too.

Another thing that occurred with larger sets of data is a seemingly "freezing" that would occur when running CI restore.  When you run the Continuous Integration's Restore and it processes all the files, it next must insert the hashes into the database table and finish its processing.  This can take a while if there’s a lot of changes, but in the Command Prompt currently doesn't give you any info what it's status is during this particular operation.  Normally it only takes less than a minute, but with hundreds of thousands of objects, it can take upwards of 15 minutes to even an hour+.  This caught us off guard and wasn't until we did some testing did we find it wasn’t frozen at all, but instead was just finishing its tasks.

Solution: Don’t track it!

Sometimes, you just have to say NO to tracking some objects. Setting Continuous Integration to false on these objects (and if you already build the XML files in the CI repository, deleting them) will help remedy the situation.  You can do this through turning CI off on your custom objects, and also using the repository.config to turn off other objects. 

Just keep in mind, that if CI isn't tracking it, and it's not in your repository, your fellow devs also won't be able to get your data, so you have to manually sync things. This can be done through:

  1. Everyone pushes their changes
  2. You pull and restore everyone’s changes
    1. Your Files and Database are now the "Master"
  3. Perform any imports you need to do so your site has the un-tracked changes that everyone needs
  4. Backup your Database and share with your other developers so they can restore it
  5. Now everyone has the non-tracked data, everyone should be good to go.
This luckily shouldn't happen a lot, and it’s not to hard to restore a database, but it seems to be the only work around for this.

#2 – Binding Objects need GUIDs

This is another potential pitfall.  A binding table normally should only need its own RowID and 2 integer fields that reference the object ID’s it’s binding.  However, this doesn't give CI enough to build its serialized file.  You remember me saying that CI names the serialized file after the Codename, and if not that then the GUID?  Well that’s why you need at least a GUID on your binding table, so CI has a file name.  You will get an error if you try to have CI enabled on a class with no GUID or Code Name.  The GUID really doesn't do anything, it really is only there so CI can make a file name for the object.

 Solution: Add a GUID

The solution is simple, just add a unique identifier field to your binding class, and make sure that Field is set in the InfoObject of that binding class.

Now some may rightly say "but that’s adding unneeded bloat to my fast-sleek binding table!" and you are right, a GUID adds I think about 16 bytes of data per record.  If speed is very important, you may want to look to create an Index that just has the 2 ID reference fields, that should bring your performance back to as it was before.

#3 – Switching Connection Strings on Module Classes doesn’t track well.

Before I explain the issue, it may be best to explain why we would have a different connection string on our module classes in the first place.  In some cases, when we create custom modules, we want to make CMS agnostic, that if ever someone would decide to go with some other platform then their module data is in its own database.  Although why someone would want to use another CMS other than Kentico is beyond me.

We started out this way with our current project, however since having data in a separate database does cause some problems when you are trying to join across 2 databases, we decided to go back to having the Module classes within the main Kentico database.

Sadly, when we updated the Module Class’s connection string, it didn't rebuild the table in the new database, so some fields in the database did not match what was in Kentico. This really isn't solely an issue with CI, it's more with this rarely used feature of Custom Modules.

Solution – Change and Save

To get Kentico to re-touch the database, we ended up going for a more manual approach.  We went to the fields in the database, and if any didn't match the class, we changed the class field (to trigger a table update), then changed it back.  For tables that were completely missing, just adding a new field and then deleting it will also cause this to occur.

I’m not sure how many other Kentico developers are even using the Module Class’s Connection String and pointing to other database, so this issue may be only something we encounter.

Other Thoughts: Using GIT on Visual Studio Online

Lastly, while I have an existing blog article on how to operate on Visual Studio Online's version of Team Foundation Server, this project I'm currently on is using GIT.  And I must say that with Visual Studio 2017, the GIT interface is MUST better, and finally allows you to merge conflicts just as easily as you could in VS TFS.  Since this was the only major reason to use TFS over GIT, and that is now resolved, I will say that I now prefer GIT over TFS.

GIT does require a little different thought process.  You want to push smaller changes more frequently, and you must push all your changes (not just some) and must get (fetch + pull) and resolve any conflicts before you can push your code.  You also cannot fetch+pull anything that has a conflict unless you are first trying to Stage/push your files.  It’s a little weird but doesn't take long to get the hang of.


Hopefully this helps you make the decision to give CI Repository a try with your next project, and the couple things we've learned along the way will hopefully prevent you from scratching your head at why you yourself are encountering those issues.  Remember this is just one of multiple possible team development models, and is not always the best, but in many cases i think it will be.

Happy coding everyone!
Blog post currently doesn't have any comments.
Is eight > than three? (true/false)