If we think about data storage the first think it comes to our mind is a regular database, this can be any of the most popular ones like Mysql, SQL server, Postgres, Vertica etc, but I noticed no too many have interacted to one of the services Google provides with the same purpose Google Big Query. And maybe it is because of the pricing, but in the end many companies are moving to cloud services and this service seems to be a great fit for them.
In this post I would like to demonstrate in a few steps how we can make a sync job that allows us to describe a Salesforce instance and use a few objects to create a full schema of those objects (tables) into a Google Big Query Dataset. Then with the schema created we should be able to push some data into Bigquery from Salesforce and see it in our Google Cloud Console project.
In order to connect to Salesforce and Google Big Query, there are a few prerequisites we need:
- If you don’t have a salesforce instance, you can create a developer one here,
- From Salesforce side you will need username, password and security token (you can follow this process to get it)
- A developer instance contains a few records, but if you need to have some more data, this will help the process to sync that information over.
GCP (Google Cloud Platform)
- You can sign up here for free. Google gives you $300 for 90 days to test the product (similar to Azure). Also if you already have a google account you can use it for this.
Creating a new project in GCP and setting up our service account key.
Once you sign up for you account on GCP, you should be able to click on New Project option and write a project name, in this example I choose mulesoft
Once a project is created we should be able to go to the menu in the left and we should be able to select IAM & Admin > Service Accounts option.
Now, we should be able to create our service account
“A service account is a special type of Google account intended to represent a non-human user that needs to authenticate and be authorized to access data in Google APIs. Typically, service accounts are used in scenarios such as: Running workloads on virtual machines”
At the top of the page you should be able to see the option to create it, then just you need to specify a Name and click on create and continue,
Next step is to set the permissions, so for this we need to select from the roles combo BigQuery Admin.
Once created, we should be able to select from the three dot menu on the right the option Manage Keys
Then we can create a new Key, in this case one as json should be enough. The key will get downloaded automatically in your computer (Please keep this json key somewhere you can use it later.)
Dataset in Big Query
Datasets are top-level containers that are used to organize and control access to your tables and views. A table or view must belong to a dataset, so you need to create at least one dataset before loading data into BigQuery.
From the left menu we can search for BigQuery and click on it.
That will take us to the Bigquery console, now we can click in the three dots menu and select Create dataset option.
Now we just need to set the name as salesforce and click on “Create Dataset”
Setting up our Mule application.
Since this is a sync job, we don’t need any API specification but totally can fit some scenarios where we have another application that needs to consume specific endpoints / operations.
Let’s then open our Anypoint Studio app (In my case I’m using mac) and let’s use the default template. For this we are going to create five flows:
- Sync. This flow just triggers the process.
- DescribeInstance. This flow will be in charge of calling the describe operation using the Salesforce connector and provide all objects information from the Salesforce instance, also will have a loop that will allow us to process the job for the objects we are going to use.
- DescribeIndividualSalesforceObject. Allows to describe an specific Salesforce object, this will basically will capture the fields and field types (STRING, EMAIL, ID, REFERENCE etc.) and will be on charge to create a payload that BigQuery will recognize in order to get created in GBQ
- BigQueryCreateTable. This flow only will be in charge of creating the table in BigQuery based on the Salesforce object name and the fields.
- QuerySalesforceObject. This flow dynamically will query the Salesforce object and pull the data (For this we are limiting the output to 100 records but in a bigger scale it should be done on a batch process of course.)
- InsertDataIntoBigQuery. This flow will push the data over into BigQuery only
Now let’s grab our json key generated by google and copy the file under src/main/resources folder. The key will let us authenticate against our project and execute the operations
Import the Google Big Query connector.
From Exchange we can search “Big Query” and we should be able to see the connector listed
then we can just use the “Add to project” option and we should be able to see the operations in the Palette
As I mentioned, this is only in charge of triggering the whole application, so we only need one scheduler component and a flow reference to the DescribeInstance flow.
This flow will describe the whole Salesforce instance using the Describe Global operation. Next
steps on this is to use a Dataweave transform to filter to get only the Objects we are interested in, so in this case I’m only pulling three, Accounts, Contacts and a custom object called Project__c. I left in the transformation a few more attributes to only pull the objects that we are able to query.
Finally you need to loop over these three objects and there’s a flow reference for this sample that will call the other flows to be able to continue the process.
The flow basically takes the name of the Salesforce Object and will allow to describe it, the connector only ask for the object name, then we have a pretty interesting DW Transformation
Salesforce data types are not 100% the same as BigQuery, so we need to make a little trick to be able to create the schema in BigQuery seamless as Salesforce so in this case I’ve created an small function to convert some fields like (ID, REFERENCE,TEXTAREA,PHONE, ADDRESS,PICKLIST, EMAIL) to be STRING, in this case the reference or values are not really anything else than a text, for (DOUBLE and CURRENCY) I’m using the value FLOAT and finally for INT fields are changed to be INTEGER
Finally because Location fields are a bit tricky and we are not able to make much with the API on them, I’m removing all location fields.
The output of this is the actual schema we will use to create the table in Google BigQuery.
This flow allows us to create the table in BigQuery; we only need to specify Table, Dataset and Table Fields.
This flow basically query the Object in Salesforce and then maps the data dynamically to prepare the payload for BigQuery.
The query basically comes from a variable “salesforceFields” same field we collected when we described the Object using this script
And finally I’m limiting the result to only 100 records.
Next step is to map the Salesforce result data and map it dynamically using this script:
Thanks so much to Alexandra Martinez for the insights on the utilities for DW 2.0! (https://github.com/alexandramartinez/DataWeave-scripts/blob/main/utilities/utilities.dwl )
This last script basically maps the records and uses the key as field and the value, but the value needs to be replaced as Date in this case for the Strings that are date or date time. So I consider this the best script in this app.
This flow just inserts the data we prepared only, so basically we only need to specify table id , dataset id and the Row Data
Setting up our Mule application.
Now we should be able to run our application and see the new tables and the data over Google big query.
On GCP I can see the tables I selected created:
And if we open any of them we should look into the schema to verify all fields are there
Finally we should be able to query the table in the console or clic on the Preview option to check the data is there.
I think this is kind of a common request we get on the integration space and many tweaks can be implemented if we are thinking of big migrations or setting some jobs that eventually will require tables to be created automatically from Salesforce to GCP.
If you like to try it, I created this GitHub repository. I hope this was useful and I’m open to hear any enhancement / scenario.