We collect data in various ways depending on the way we use them later on. For typical “online” scenarios we use the so-called OLTP - Online Transactional Processing. This means that we focus on providing best performance, but at the same time we have transactions that typically modify a limited amount of entities. We also have a big number of parallel transactions. We can think of any on-line application that handles interactive user input.
However, for “off-line” scenarios we have different requirements. These are typically related to bookkeeping like updating the database, recalculating values, or generating reports.These queries take much longer to complete, and they are not time-sensitive. It’s okay to run them during the night, and to wait for them for a couple of hours. We also don’t focus on database normalization to avoid the performance issues when joining tables. There are different designs for “off-line” scenarios like snowflake design, and there is no need to use normal form.
Yet another important term in this area is OLAP - Online Analytical Processing. These are technologies that organize multiple databases to perform analysis for reporting and enterprise management. They need to deal with multiple data sources that serve similar purposes, but are implemented differently under the hood. This may lead to issues that are hard to find and can easily go unnoticed.
Another case is data ingestion that is completely different for warehouses. We typically don’t insert records one by one or within fine-grained transactions. We load them in bulk by importing batches of records that are later added to the table as partitions. We want to avoid locks, performance degradation, and sometimes even consistency checks to make the process faster.
Extract-Transform-Load
Importing data from multiple sources is called Extract-Transform-Load (ETL for short). This process consists of three phases.
First phase is called Extract. We need to get a connection to a remote data source to get the records. This can be a simple connection to the database over a well-known protocol, downloading to a file storage like FTP, or calling a web service exposing data with XML or JSON. There are multiple ways to do that, and these ways depend on the implementation details of the data source. Sometimes it is fast enough to query records one by one, sometimes we prefer to export data to a dump file which we will then load in different place.
Second phase - Transform - aims to translate data from source format to the destination one. This includes changing schemas by joining tables, adding columns, or renaming properties. We also need to transform values between different representations. For instance, we may need to translate dates from MM/DD to ddmm form. We need to handle missing values, placeholders, encoding schemes, and various other things. All of that needs to be done to unify all data sources, so we can load them later into the data warehouse. We also need to handle duplicates, missing records, or broken entries.
Last phase called Load focuses on loading the data into the data warehouse. Since we integrate multiple data sources, we may need to load data in bulk to achieve decent performance. Loading data needs to be performed when the database is not overloaded, so we typically do it outside of working hours. There are other aspects to consider like errors, retries, or archivization.
What can go wrong and how to avoid that?
There are multiple scenarios that can go wrong when dealing with heterogeneous databases. We’ll cover some of them in this section, and think how to make sure we avoid them.
General considerations
The problem with heterogeneous data sources is mostly around identifying issues during the Transform phase of ETL. The problem is they can easily go unnoticed. That’s because values are not “wrong” per se, but they change their meaning in the process.
Another issue is the scale. It’s doable to verify manually one thousand rows and check if all is good. However, it becomes impossible once we run the process on a large scale and we need to verify millions of rows. Since the process doesn’t fail, issues may be hidden in production for a long time.
Let’s go through some categories of issues.
Dates and timestamps
Dates and timestamps can easily become hard to manage and reason about. First thing to notice is the lack of standardized date format. There are multiple out there like DD.MM.YYYY, MM.DD.YYY, ddMMyyy, MM/DD, and others. The issue here is that we may incorrectly change the format on the fly. For instance, data source stores dates in MM/DD format, but we later process them as DD/MM. When we run an automated process we may notice that July 25 (07/25) cannot be represented in DD/MM format so we get an empty value, but we may miss that this problem doesn’t apply to February 4 which becomes April 2.
Similar issue applies to timestamps because they need to include a time zone specification. Data source may store timestamps in local time, but the destination database may use UTC time. If we don’t pay attention, we may miss that the timestamps are transformed incorrectly. This is even more probable if we work with local times in UTC.
How to avoid that? Unit tests may work well, but we would need to think about all the possible edge cases, so that won’t be enough in practice.
We can analyze the histogram. Just count the dates in a given month in the source dataset, and then compare it with the destination one. If the difference is significant, then something went wrong along the way.
Yet another issue is with “tricky” dates like February 29. These dates may get lost, so it’s good to verify them manually.
Decimal separator
Numbers can pose multiple challenges. It is common in some locales to write bigger numbers with separators. For instance, one thousand two hundred thirty four can be written as 1234 or 1.234 or 1,234. Not only is there a separator, but a different character can be used. Some languages prefer dots, some go with commas, yet others with spaces.
Now it’s clear why errors may occur. We may read 1,234 which is one with some fractional part, but use it as if it was 1,234 which is something more than one thousand. Again, this may easily go unnoticed because both values are technically “correct”, but they have different meanings.
How to avoid that? Again, unit tests will be of some help if we’re aware of the issue. It may be required to analyze the data distribution. Just plot the distribution, or calculate some central metrics (like average of sum).
Missing values
How to store the absence of a value in a database? NULL is one option, but there are others as well. ‘NULL’ or ‘NONE' could be a solution, ‘N/A’ or ‘Not applicable’ as well, or maybe just an empty string ''
This gets even more interesting with various languages. For example, “n'est pas applicable” is “Not applicable” in French. Similarly, “Nie dotyczy” is the same in Polish. All these values should be considered the same in the target system.
However, the problem is not only on the input side. It can occur on the output side as well. ETL system may emit empty values when it cannot parse the field for whatever reason. If that’s the case, then the target database will contain far more empty values than the source systems.
How to avoid that? Check distributions, learn other languages, and make sure that you clearly mark fields that can be empty.
Field length
Another issue is the length of the field. Your source dataset may have very long columns that get truncated when loading to the destination system. This can also happen when we handle various encodings. For instance, EBCDIC encoding had a special character ZŁ which represented the Polish currency (just like $ represents the dollar). However, there is no single character for ZŁ in Unicode. If your target system allows for just one character to indicate the currency, then you may get an error.
How to avoid that? Run some analysis of the data size distribution. Make sure that your columns do not get silently truncated, but rather throw an error.
Character escaping
Character escaping can be very hard to reason about because it’s not enough to take a look at the data to understand if the problem is there. For instance, take the following JSON:
Is this a proper JSON? It is if you escape the characters (so you use backslash before each double-quote character). It is not if you don’t escape the characters. However, can you tell just by looking at the value?
To solve this problem we need to see the logic that operates on the data. We need to go and check the source code of all the systems that may use the value. This may get even harder when we take the JSON from above and escape it again:
Regular expressions
Regular expressions are a standard feature of multiple programming languages, databases, and systems. However, there are differences between them. Let’s take these expressions:
They all do the same - capture character x and put it in a group named “name”. However, the actual syntax for naming a group differs between languages. First example is from Python, next two come from C#, and the last one is from Perl.
If you just reuse expressions from other platforms, then you may run into issues that the expressions work (i.e. don’t throw errors), but they return different results.
How to avoid that? We need to understand the code and the internals of the systems we use. There is no mechanical solution that we can apply here.
Heterogeneous distributed databases
Some heterogeneous distributed databases make things even harder because of how they handle the connectors. For instance, HBase or Spark (or any other database working on Hadoop) can be accessed from various technologies like Scala, Java, or Python. However, the code running in that environment often needs to be sent between nodes in some serialized form. For instance, Spark uses Kryo to send the code parts.
The problem may arise when things do not get cached properly, and work differently when running on a single node versus multiple nodes. It’s often the case that cached datasets in Spark may fail the processing because of serialization issues.
Summary
Ingesting data from multiple data sources can be very hard. The biggest issue is not with running the process, but with making sure that we get a correct output. Issues can go unnoticed for months, and the systems that use the transformed data may continue to work correctly. However, the quality of your solutions may decrease. For instance, if you run your machine learning algorithm on data that is invalid and has a completely different distribution, then your model may learn differently and not work in production.