Data Eng – Page 2 – Data Engineer Labs

Reading Large Datasets in Chunks with Pandas and PySpark using CSV

2024

S3, especially when dealing with 3 GB or more of data. Using Pandas to process data in chunks is a very effective method for handling large files without overwhelming your system’s memory. Here’s how to manage the process step-by-step: 1. Read more…

Data Eng

Data engineering challenges

2024

When data engineers work with an ETL pipeline involving Python, AWS services like S3 (data lake), AWS Glue (Data Catalog), Athena (for querying), and Amazon Redshift (final data storage), they face several common challenges. Below are these challenges and their Read more…

Data Eng

Building better security and data governance without AWS Lake Formation

2024

Building better security and data governance without AWS Lake Formation can still be effectively achieved by leveraging other AWS services and best practices. Here’s how you can ensure robust security and governance for your data lake in S3, while using Read more…

Data Eng

Data engineering principles

Data engineering principles are foundational concepts and best practices used to design, build, and maintain scalable, reliable, and efficient data systems. These principles ensure that data pipelines and infrastructures are robust and can support the growing and changing needs of Read more…

Data Eng

Data-lake optimization (AWS)

2024

Optimizing a data lake on AWS S3 with Athena and Glue Data Catalog involves several best practices to ensure data is efficiently stored, queried, and managed. The combination of S3 for storage, Glue for data cataloging, and Athena for querying Read more…

Data Eng, SQL

In PostgreSQL, table partitioning

n PostgreSQL, partitioning means dividing a large table into smaller, more manageable pieces, called partitions, but they all act as part of a single logical table. These partitions are physically separate from each other on disk, which helps optimize performance Read more…

Data Eng

Table partitioning in Redshift

Amazon Redshift does not support native table partitioning in the same way as databases like PostgreSQL or Oracle. However, Redshift provides alternative methods to achieve similar performance benefits as partitioning, primarily through sort keys and distribution styles. These allow for Read more…

Data Eng

Amazon Redshift materialized views

Amazon Redshift materialized views do not automatically refresh when the underlying data changes. You need to manually refresh the materialized view to keep the data up to date. How to Refresh a Materialized View To refresh the data in a Read more…

Data Engineer Labs

Category: Data Eng

Reading Large Datasets in Chunks with Pandas and PySpark using CSV

Data engineering challenges

Building better security and data governance without AWS Lake Formation

Data engineering principles

Data-lake optimization (AWS)

In PostgreSQL, table partitioning

Table partitioning in Redshift

Amazon Redshift materialized views

Recent Posts

Recent Comments

Archives

Categories