clapotiot-logo

Azure Data Fundamentals

Facebook
Twitter
LinkedIn
WhatsApp

Core Data Concepts – Part 1:

We could classify data into three types structured, semi-structured and unstructured.

Structured Data:

Structured data will have fixed schema , represented in table format and stored in database.

Image showing how structured data is represented in tables

Semi-structured Data:

Semi-structured data will have some structure but it will allow some variation between entity instances. For example some instance will have value for some attribute and some might not have value as in the sample below. One common format is JSON.

// Customer 1
{
  "firstName": "Joe",
  "lastName": "Jones",
  "address":
  {
    "streetAddress": "1 Main St.",
    "city": "New York",
    "state": "NY",
    "postalCode": "10099"
  },
  "contact":
  [
    {
      "type": "home",
      "number": "555 123-1234"
    },
    {
      "type": "email",
      "address": "joe@litware.com"
    }
  ]
}

// Customer 2
{
  "firstName": "Samir",
  "lastName": "Nadoy",
  "address":
  {
    "streetAddress": "123 Elm Pl.",
    "unit": "500",
    "city": "Seattle",
    "state": "WA",
    "postalCode": "98999"
  },
  "contact":
  [
    {
      "type": "email",
      "address": "samir@northwind.com"
    }
  ]
}

Unstructured Data:

Unstructured data are documents, images, videos and binary files wont have any structure.

All the data formats will be stored in file store or database.

File formats used to store data vary depends on the factors like type of data, application or service which is using the data and the need for human readability of the data.

File Formats:

Delimited text files are structured data widely used by application and services in a human readable format. Ex. CSV

Java Scrip Object Notation (JSON) a flexible format suitable for both structured and unstructured data. In JSON hierarchical document schema is used to define data entities.

XML (Extensible Markup Language) is a human readable format. Less verbose JSON format superseded XML.

Binary Large Object (BLOB): Data stored in binary format like images, videos, audios and application-specific documents.

Optimized File Formats:

File formats not optimized for compression, indexing, efficient storage and processing structure and semi-structured. Optimized file formats are Avro, Parquet and ORC.

AVRO:

Apache created Avro. It is a row-based format. Each record contains header and data, stored in JSON and binary format respectively. Avro is good for data compression and network bandwidth.

ORC:

Hortonworks developed Optimized Row Columnar Format to organizes data into columns rather than rows for optimized read and write operations in Apache Hive. It contains stripes of data. Each stripe contains data for a column or set of columns.

Parquet:

Cloudera and Twitter created Parquet a columnar data format. It contains row group. Data for each column is stored together in the same row group. Each row group contains one or more chunk of data.  A Parquet file includes metadata that describes the set of rows found in each chunk. It supports very efficient compression and encoding schemes.

References:

https://docs.microsoft.com/en-us/learn/paths/azure-data-fundamentals-explore-core-data-concepts/

Leave a Comment

Your email address will not be published. Required fields are marked *

Copyright © Claypot Technologies 2021