August 16, 2023

From MongoDb to Google DataStudio

The best time to establish protocols with your clients is when you onboard them.

Heading

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

At CodeStax, we are always rapidly developing and deploying apps in the cloud. Sometimes the entire cycle from concept to production happens within a span of two weeks. To enable this kind of fast-paced prototyping we usually go with Serverless apps backed by a NoSql database. Amazon’s DynamoDb and MongoDb are the ones we usually default to depending on our needs. While NoSql neatly fits into our requirements of building highly scalable apps with a flexible data model and low latency, it falls woefully short when it comes to analytics.

‍

Enter Google Data Studio!!

It allows you to quickly import data and build interactive reports and dashboards. Yesteryear data intensive applications were designed as separate OLTP(transaction) and OLAP(analytic systems) where the transaction system would routinely dump data into the analytical system which would later be used by management to slice and dice data. So in a similar vein, we began using Google Data Studio as the de-facto reporting solution. Add to this the fact that google analytics(for web and apps) and other google products(like Search Console, Ads etc) can directly push data into Data Studio and you really have a 360 degree view of your web apps. Did I also mention that it’s free?

‍

But, there are no default connectors to Google Data Studio from MongoDb. We were left to our own devices to do it. Our original solution was to just take a dump of the relevant collections from MongoDb, flatten it and push it into a bunch of google sheets. These sheets were directly linked with Data Studio. However, we quickly realized that for large datasets, the solution would be very slow.

‍

Enter — Google BigQuery !!

A Google data warehouse solution. We created a simple python module that pulls data from MongoDb and pushes it into bigquery, which then automatically pushes data into the DataStudio dashboard. Deploy this python script on AWS Lambda(or any other cloud serverless solution) with and event bridge triggering it according to your needs and you have yourself a cost effective and easy to deploy solution. Most of this is just boilerplate code glued together and is a testimonial to the power of cloud services available to us. With minimum fuss and a few search queries on google, we are able create a world class reporting system.

‍

LIBRARIES

We use python for this project, although is should be just as simple for any of the other supported platforms. PyMongo (library) is for connecting with MongoDb and other libraries from google cloud platform for BigQuery and Google Cloud Storage. The python-dotenv package is for loading our credentials from an env file.

‍

CODE

Now that our environment is set up, let’s get down to the nuts and bolts of the solution. To access MongoDb you will need credentials and since this is part of the ETL pipeline it’s advised to create a separate user with read only access. Also you can store the credentials (creds) directly in the env file or you can have a separate json file which houses the creds and the env just points to it. Either way its a good practice to make sure the .env file and any other credentials are not part of your repo. How to manage that is out of the scope of this article. In this project we will be storing the credentials in a json file.

‍

MongoDb - Code

This is fairly straightforward code. The export function basically exports your query results to csv or json. The csv code is slightly elaborate as MongoDb data is notorious for not having a fixed schema, so when you flatten it out you will probably find a lot of collections having different keys which need to be managed(different rows having different columns).

‍

BigQuery - Code

Google allows you to stream data into BigQuery although this is discouraged. What we will illustrate is creating a csv file in google cloud storage bucket and passing the link to the BiqQuery helper function that pushes the data into the table. BigQuery tables conform to SQL standards and you should make sure that the data in your file(which is basically string) strictly conforms to the schema. A self explanatory code for your reference.

‍

A few notes on the GCP environment setup.

We need to create a project under BigQuery and inside that create a dataset under which we will have our tables. Then we need to create a service account which will be given access to the project(with write permission). We will have to download the credentials(json file) and map the location in our env file as “GOOGLE_APPLICATION_CREDENTIALS”. Now we create a cloud storage bucket and give access to this service account to write and read to it. And we are all set.

‍

Our Env file — we store the dataset details under BIG_QUERY_DETAILS

We can now create an python code which imports mongoConnection and bigQuery (classes from the code) and apply our business logic in that code.

‍

WHERE TO NEXT

Typically, ETL pipelines extract data from the transactional db at a particular frequency and pushes it into the reporting solution at particular intervals. This frequency depends on the reporting requirements of the application(how quickly do you respond to market dynamics). At Trinom we usually set this to once a day(usually 1 am in the morning). To do this we need to create an AWS lamda (or Serverless service from a cloud service provider of your choice) which is triggered by an event bridge instance. Also for Big Query upload, we can consider using Parquet format(which is much smaller) or AVRO(which is very strict). We also write functions to verify the schema of the file before the upload starts. What we have shown above is a quick way to cobble something up which can evolve based on our needs.

‍

CodeStax.Ai

Profile

August 16, 2023

min read

Subscribe to our newsletter

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Share this article:

From MongoDb to Google DataStudio

Heading

Enter Google Data Studio!!

Enter — Google BigQuery !!

LIBRARIES

CODE

WHERE TO NEXT

More articles

CodeStax.Ai

Serverless Architectures: Beyond Lambda

Serverless architectures specify a change in our process to produce and execute applications.

CodeStax.Ai

AWS Neptune Demystified: Your Guide to Graph Databases and Gremlin Queries

The knowledge on graph databases is crucial as we live

CodeStax.Ai

Introduction to AWS SAM CLI: Simplify Serverless Development

The Serverless architecture in cloud computing helps developers

CodeStax.Ai

Automating AWS Lambda Version Cleanup with Node.js and AWS SDK

In the realm of serverless computing, AWS Lambda functions

CodeStax.Ai

AWS CodeCommit — Version control for beginners

Nowadays, software development is a field where speed is crucial.

CodeStax.Ai

How to deploy Bun.js in AWS Lambda?

JavaScript is one of the most popular and widely used

CodeStax.Ai

Amazon CodeWhisperer: AI-Powered Suggestions and Security Boost

Amazon CodeWhisperer utilizes machine learning

CodeStax.Ai

Elements on a web page can be located using XML expressions with Selenium’s XPath locator.

S3 is excellent for storing files

CodeStax.Ai

AWS — Log Anomaly Detection and Recommendations

Developers can now more effectively monitor and troubleshoot their applications

CodeStax.Ai

AWS Fargate and AWS Lambda which one to choose for your project?

AWS Fargate and AWS Lambda

CodeStax.Ai

Advanced Queries For AWS Timestream

Window functions in Timestream give you extensive analytical capabilities

CodeStax.Ai

AWS Lambda Foundations

There are three patterns to invoke a Lambda function, called Invocation models. The invocation model to be used depends on the event source

CodeStax.Ai

Automating Reconciliation Using AWS Glue

AWS Glue is a fully managed ETL service that makes it easy to move data

CodeStax.Ai

AWS Lambda with SQS — Setup SQS Trigger to Lambda

AWS Lambda is an event-driven, server-less computing platform provided by Amazon.

CodeStax.Ai

Storing Secure Configuration Data with AWS Parameter Store: A Step-by-Step Tutorial

Amazon Web Services (AWS) Parameter Store is a service that enables you to

CodeStax.Ai

AWS Timestream — Introduction

AWS Timestream is comparable to Graphite and Influx.

CodeStax.Ai

Getting Started With AWS Fargate

Deploying the application to the web is a burden and maintaining the server is also another big task for the DevOps engineers.

CodeStax.Ai

Managing users with AWS Cognito

Cognito is known for authentication, authorization and user management for mobile and web applications

CodeStax.Ai

Streaming QLDB Journal data to Lambda

In this article we’ll discuss how to stream QLDB (Quantum Ledger Database)

CodeStax.Ai

Creating an Automated Deployment Pipeline - CodeCommit to Lambda

“Merge conflict” is one of the worst messages a developer can see in Git.

CodeStax.Ai

Encryption is a way of scrambling data so that only authorized parties can understand the information.

Quantum Ledger Database (QLDB) is a No-SQL (Semi-SQL & Semi-NoSQL)

CodeStax.Ai

Speed up your lambda functions

AWS Lambda is a popular serverless computing service offered by Amazon Web Services (AWS).

CodeStax.Ai

Creating Serverless APIs with DynamoDB and Lambda

This article will teach you how to build a server-less backend API using DynamoDB as the database.

CodeStax.Ai

AWS Lambda technical Constraints

Amazon Web Services Lambda is a Serverless event-driven computing platform that was launched in November 2014