Browsing:

Category: Data Eng

Reading Large Datasets in Chunks with Pandas and PySpark using CSV

S3, especially when dealing with 3 GB or more of data. Using Pandas to process data in chunks is a very effective method for handling large files without overwhelming your system’s memory. Here’s how to manage the process step-by-step: 1. Read more…


Data engineering challenges

When data engineers work with an ETL pipeline involving Python, AWS services like S3 (data lake), AWS Glue (Data Catalog), Athena (for querying), and Amazon Redshift (final data storage), they face several common challenges. Below are these challenges and their Read more…


Building better security and data governance without AWS Lake Formation

Building better security and data governance without AWS Lake Formation can still be effectively achieved by leveraging other AWS services and best practices. Here’s how you can ensure robust security and governance for your data lake in S3, while using Read more…


Data engineering principles

Data engineering principles are foundational concepts and best practices used to design, build, and maintain scalable, reliable, and efficient data systems. These principles ensure that data pipelines and infrastructures are robust and can support the growing and changing needs of Read more…


Data-lake optimization (AWS)

Optimizing a data lake on AWS S3 with Athena and Glue Data Catalog involves several best practices to ensure data is efficiently stored, queried, and managed. The combination of S3 for storage, Glue for data cataloging, and Athena for querying Read more…


In PostgreSQL, table partitioning

n PostgreSQL, partitioning means dividing a large table into smaller, more manageable pieces, called partitions, but they all act as part of a single logical table. These partitions are physically separate from each other on disk, which helps optimize performance Read more…


Table partitioning in Redshift

Amazon Redshift does not support native table partitioning in the same way as databases like PostgreSQL or Oracle. However, Redshift provides alternative methods to achieve similar performance benefits as partitioning, primarily through sort keys and distribution styles. These allow for Read more…


Amazon Redshift materialized views

Amazon Redshift materialized views do not automatically refresh when the underlying data changes. You need to manually refresh the materialized view to keep the data up to date. How to Refresh a Materialized View To refresh the data in a Read more…