McGill Alert / Alerte de McGill

Updated: Thu, 07/11/2024 - 19:00

McGill Alert. The downtown campus will remain partially open on Friday, July 12. See the Campus Safety site for more information.

Alerte de McGill. . Le campus du centre-ville restera partiellement ouvert le vendredi 12 juillet. Complément d’information : Direction de la protection et de la prévention.

What is the DataBank?

The DataBank is the central data repository of the DataSphere Lab and can be accessed by its members, including Desautels professors, students, and industry professionals. It leverages the collective knowledge of Desautels faculty in their work and serves as the backbone of data-driven exploration and innovation, enabling more convenient access to various data by the analytics community at the DataSphere Lab.

What the DataBank Offers

The DataBank provides access to both public and private datasets across multiple disciplines.

  1. Benefits for Academics
    The DataBank offers datasets used by Desautels professors for research, along with supplementary data that enhances related studies.

  2. Benefits for Industry professionals
    The DataBank provides rare, high-value datasets that foster deep dives into academic research paired with industry-specific data for thorough cross-industry and academic analyses.

  3. Additional features for Industry clients
    The DataBank equips industry clients with a ready-to-use data analysis environment, supports Proof of Concept (PoC) projects to tackle business challenges, facilitates rapid data iteration, and validates the feasibility of solutions.

Preliminary Results: Investigating Unique Datasets

Interviewed
8 professors

in Operations Management, Marketing, Information Systems, and Finance

Identified
34 datasets

across Retail, Entertainment, Block Chain, Social Services, and Fashion

Extract, Transform, Load
4 datasets
  1. New York Police Department & Arrest Data
  2. Netflix Movies & TV Shows
  3. Uber Fare Datasets
  4. Amazon Review Data (2018)


How it has been built

High level infrastructure

Our DataBank architecture is designed to aggregate a rich and diverse collection of data sources and datasets from various data providers, first cataloged by data analysts.

Data is sourced in multiple formats, including structured data like CSV, JSON, SQL databases, APIs, and Parquet files, and unstructured data such as PDF documents, then integrated into the DataBank using Microsoft Azure's cloud services. These diverse data types are channeled through connectors into Azure Data Factory. Once data is processed, it is stored in Azure Storage with a tiered structure signifying its processing stage.

Data stored in the DataBank are classified into three stages:


Bronze

Raw data

Silver

Processed data after Extract, Transform, Load (ETL) processes

Gold

Enriched data post-Machine Learning (ML)


The DataBank has the majority of its data in silver ready for descriptive and/or real-time analytics with a few bronze data. Datasets that are processed upon clients’ request through ML will be considered golden data.

How it will be used
  1. Users interact with stored datasets through various technologies and interfaces, facilitated by a suite of Microsoft Azure services. The data flows through Azure Synapse for analytics and Azure Machine Learning for ML processes, with additional tools such as PyTorch, PyCharm, Python, and TensorFlow for further development.
  2. For data querying and management, services like Azure Data Studio and serverless SQL can be used, while Azure Static Web Apps and APIs are used to present the data, possibly via web apps using HTML/CSS.
  3. Data visualization and analysis are performed through Power BI and Tableau, enabling users like project managers, sales and marketing professionals, developers, data scientists, and data analysts to interpret and utilize the data effectively.
  4. Lastly, OpenAI and a ChatGPT-like interface provide another method of interaction, suggesting that users can access and query the datasets conversationally, suitable for users who may not be as technically skilled.
Back to top