Trifacta—data standardization

JJ Johnny A. Uelmen, Jr.
AC Andrew Clark
JP John Palmer
JK Jared Kohler
LD Landon C. Van Dyke
RL Russanne Low
CM Connor D. Mapes
RC Ryan M. Carney
request Request a Protocol
ask Ask a question
Favorite

Within the Azure Cloud, raw data is imported into a service called Trifacta, an application designed for data wrangling of large, raw datasets. Following the Open Geospatial Consortium (OGC) standard for data formatting, each data set is cleaned and restructured to match the OGC’s SensorThings format, enabling cross source data analysis [18]. Every change made to the data structure is tracked and recorded for documenting data provenance. Within Trifacta, raw data are transformed to the OGC standard with each transformation being documented in a “recipe.” Each recipe contains customized formatting of data, including text and number conversion, creation of new fields, calculations, etc. and can be found in Additional file 1. The resulting data set from each recipe is then exported in .json format to an Azure Blob storage container called “Derived Data.” Each recipe is scheduled to run daily after the raw data has been copied into the Azure cloud via Azure Data Factory. Since each dataset follows the OGC standard, all data is stored as nested-objects in .json format. To enable ingestion to the Esri ArcGIS® Online service, datasets are run through one more recipe which presents data as a flat .json data set.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A