AWS Glue Data Catalog is schema-on-read or schema-on-write
2024

AWS Glue Data Catalog follows a “schema-on-read” approach rather than “schema-on-write.”
Explanation:
- Schema-on-write: In this approach, the schema is defined and enforced when data is written to a database or storage system. If the data does not conform to the predefined schema, an error is returned, and the data is rejected. This is typical for traditional relational databases.
- Schema-on-read: In this approach, the schema is applied when the data is read. The data is stored without strict schema enforcement at the time of writing, and the schema is inferred when the data is accessed or queried. This is common in data lakes, such as those built on top of S3 and other distributed storage systems, where the data format can be flexible.
AWS Glue Data Catalog:
AWS Glue follows a schema-on-read model because:
- It infers the schema when it reads the data (e.g., through Glue Crawlers).
- No strict schema enforcement occurs when writing the data to S3 or other storage systems. If the data has mixed types, Glue tries to infer a schema that can handle the diversity of data (e.g., using
STRING
type for mixed data types). - Glue Crawlers and jobs do not validate or enforce schema consistency when the data is written. The schema is only applied when the data is processed or queried.
Thus, Glue is an example of a schema-on-read system where the schema is flexible and inferred upon data access rather than being enforced at the time of data writing.