Background


CompareNetworks, Inc. provides media and web platform products for global life science and healthcare companies. One such web platform – an online marketplace named biocompare.com contains millions of products (such as antibodies and biomolecules) targeted at life scientists worldwide. Compare Networks had accumulated a huge amount of data related to the products listed in biocompare.com including academic articles (drawn from providers such as Plos One, Elsevier and Open Access), citations, product matching data etc.

Problem


However, this data was scattered across disparate data sources (mySQL, Solr, Elastic) making analysis and drawing of insights drastically limiting its value.

Solution


Calcey created a data lake and integrated all the data sources into it with appropriate relationships and created a mechanism to incrementally update the data when the source changes. Calcey also provided a data querying mechanism using Amazon Athena, for the client to run custom queries on top of the data lake to understand how to utilize and monetize the available data. AWS Glue was chosen as the ETL platform and Amazon S3 was used as the data storage platform (Apache Parquet file format).

Impact


Different data stores that were previously isolated are now aggregated. This data is also kept up-to-date due to the incremental update mechanism that Calcey introduced. By enabling Compare Networks to easily analyze its data, paving the way to productize and monetize it to its advantage.