Building Batch Pipelines on Google Cloud Platform

cover

Introduction to Building Batch Data Pipelines

Batch Pipelines Overview

Definition: Pipelines that process a fixed amount of data and terminate.
- Example: Daily processing of financial transactions to balance books and store reconciled data.
ETL vs ELT:
- ELT (Extract, Load, Transform):
  - Load raw data into a target system and perform transformations on the fly as needed.
  - Suitable when transformations depend on user needs or view requirements (e.g., raw data accessible through views).
- ETL (Extract, Transform, Load):
  - Extract data, transform it in an intermediate system, then load into a target system (e.g., Dataflow to BigQuery).
  - Preferred for complex transformations or data quality improvements before loading.
Use Cases:
- EL (Extract and Load):
  - Use when data is clean and ready for direct loading (e.g., log files from cloud storage).
  - Suitable for batch loading and scheduled data loads.
- ELT (Extract, Load, Transform):
  - Start with loading as EL, then transform data on-demand or based on evolving requirements.
  - Useful when transformations are uncertain initially, like processing API JSON responses in BigQuery.

Quality Considerations in Data Pipelines

Data Quality Dimensions:
- Validity: Data conforms to business rules (e.g., transaction amount validation).
- Accuracy: Data reflects objective truth (e.g., precise financial calculations).
- Completeness: All required data is processed (e.g., no missing records).
- Consistency: Data consistency across different sources (e.g., uniform formats).
- Uniformity: Data values within the same column are consistent.
Methods to Address Quality Issues:
- BigQuery Capabilities:
  - Use views to filter out invalid data rows (e.g., negative quantities).
  - Aggregate and filter using HAVING clause for data completeness checks.
  - Handle nulls and blanks using COUNTIF and IF statements for accurate computations.
Ensuring Data Integrity:
- Verification and Backfilling:
  - Verify file integrity with checksums (e.g., MD5).
  - Address consistency issues such as duplicates with COUNT and COUNT DISTINCT.
  - Clean data using string functions to remove extra characters or format inconsistencies.
Documentation and Best Practices:
- Document SQL queries, transformations, and data units clearly to maintain uniformity.
- Use SQL functions like CAST and FORMAT to manage data type changes and indicate units clearly.

Shortcomings of ELT and Use Cases for ETL

Complex Transformations Beyond SQL:
- Examples include:
  - Translating text (e.g., Spanish to English using external APIs).
  - Analyzing complex customer action streams over time windows.
- SQL may not handle these transformations efficiently or feasibly.
Continuous Data Quality Control and Transformation:
- Use ETL when:
  - Raw data requires extensive quality control or enrichment before loading into BigQuery.
  - Transformations are too complex or cannot be expressed in SQL.
  - Continuous streaming data needs processing and integration (e.g., using Dataflow for real-time updates).
Integration with CI/CD Systems:
- ETL is suitable for integrating with continuous integration and continuous delivery (CI/CD) systems.
  - Allows for unit testing of pipeline components and scheduled launches of Dataflow pipelines.

Google Cloud ETL Tools

Dataflow:
- Fully managed, serverless data processing service based on Apache Beam.
- Supports both batch and streaming data processing pipelines.
- Ideal for transforming and enriching data before loading into BigQuery.
Dataproc:
- Managed service for running Apache Hadoop and Apache Spark jobs.
- Cost-effective for complex batch processing, querying, and machine learning tasks.
- Offers features like autoscaling and integrates seamlessly with BigQuery.
Data Fusion:
- Cloud-native enterprise data integration service for building and managing data pipelines.
- Provides a visual interface for non-programmers to create pipelines.
- Supports transformation, cleanup, and ensuring data consistency across various sources.

Considerations for Choosing ETL

Specific Needs that ETL Addresses:
- Low Latency and High Throughput:
  - BigQuery offers low latency (100 ms with BI Engine) and high throughput (up to 1 million rows per second).
  - For even stricter latency and throughput requirements, consider Cloud Bigtable.
- Reusing Existing Spark Pipelines:
  - If your organization already uses Hadoop and Spark extensively, leveraging Dataproc may be more productive.
- Visual Pipeline Building:
  - Data Fusion provides a drag-and-drop interface for visual pipeline creation, suitable for non-technical users.

Metadata Management and Data Governance

Importance of Data Lineage and Metadata:
- Data Catalog provides:
  - Metadata management for cataloging data assets across multiple systems.
  - Access controls and integration with Cloud Data Loss Prevention API for data classification.
  - Unified data discovery and tagging capabilities for organizing and governing data assets effectively.

Executing Spark on Dataproc

Dataproc: Google Cloud’s managed Hadoop service, focusing on Apache Spark.

The Hadoop Ecosystem

Historical Context:
- Pre-2006: Big data relied on large databases; storage was cheap, processing was expensive.
- 2006 Onward: Distributed processing of big data became practical with Hadoop.
Hadoop Components:
- HDFS (Hadoop Distributed File System): Stores data on cluster machines.
- MapReduce: Distributed data processing framework.
- Related Tools: Hive, Pig, Spark, Presto.
Apache Hadoop:
- Open-source software for distributed processing across clusters.
- Main File System: HDFS.
Apache Spark:
- High-performance analytics engine.
- In-memory processing: Up to 100 times faster than Hadoop jobs.
- Data Abstractions: Resilient Distributed Datasets (RDDs), DataFrames.
Cloud vs. On-premises:
- On-premises Hadoop: Limited by physical resources.
- Cloud Dataproc: Overcomes tuning and utilization issues of OSS Hadoop.
Dataproc Benefits:
- Managed hardware/configuration: No need for physical hardware management.
- Version management: Simplifies updating open-source tools.
- Flexible job configuration: Create multiple clusters for specific tasks.

Running Hadoop on Dataproc

Advantages of Dataproc:
- Low Cost: Priced at one cent per virtual CPU per cluster per hour.
- Speed: Clusters start, scale, and shut down in 90 seconds or less.
- Resizable Clusters: Quick scaling with various VM types and disk sizes.
- Open-source Ecosystem: Frequent updates to Spark, Hadoop, Pig, and Hive.
- Integration: Works seamlessly with Cloud Storage, BigQuery, Cloud Bigtable.
- Management: Easy interaction through cloud console, SDK, REST API.
- Versioning: Switch between different versions of Spark, Hadoop, and other tools.
- High Availability: Multiple primary nodes, job restart on failure.
- Developer Tools: Cloud console, SDK, RESTful APIs, SSH access.
- Customization: Pre-configured optional components, initialization actions.

Dataproc Cluster Architecture

Cluster Components:
- Primary Nodes: Run HDFS name node, node and job drivers.
- Worker Nodes: Can be part of a managed instance group.
- Preemptible Secondary Workers: Lower cost but can be preempted anytime.
Storage Options:
- Persistent Disks: Standard storage method.
- Cloud Storage: Recommended over native HDFS for durability.
- Cloud Bigtable/BigQuery: Alternatives for HBase and analytical workloads.
Cluster Lifecycle:
- Setup: Create clusters via cloud console, command line, YAML files, Terraform, or REST API.
- Configuration: Specify region, zone, primary node, worker nodes.
- Initialization: Customize with initialization scripts and metadata.
- Optimization: Use preemptible VMs and custom machine types to balance cost and performance.
- Utilization: Submit jobs through console, command line, REST API, or orchestration services.
- Monitoring: Cloud Monitoring for job and cluster metrics, alert policies for incidents.

Key Features of Dataproc

Low Cost: Pay-per-use with second-by-second billing.
Fast Cluster Operations: 90 seconds or less to start, scale, or shut down.
Flexible and Resizable: Variety of VM types, disk sizes, and networking options.
Integration with Google Cloud: Seamless interaction with other Google Cloud services.
Ease of Use: Minimal learning curve, supports existing Hadoop tools.
High Availability and Reliability: Multiple primary nodes, restart able jobs.
Developer-Friendly: Multiple ways to manage clusters, customizable via scripts and metadata.

Using Google Cloud Storage Instead of HDFS

Network Evolution and Data Proximity

Original Network Speeds: Slow, necessitating data to be close to the processor.
Modern Network Speeds: Petabit networking enables independent storage and compute.
On-Premise Hadoop Clusters: Require local storage on disks as the same server handles compute and storage.

Cloud Migration and HDFS

Lift-and-Shift: Moving Hadoop workloads to Dataproc with HDFS requires no code changes but is a short-term solution.
Long-Term Issues with HDFS in Cloud:
- Block Size: Performance is tied to server hardware, not elastic.
- Data Locality: Storing data on persistent disks limits flexibility.
- Replication: HDFS requires three copies for high availability, leading to inefficiency.

Advantages of Cloud Storage over HDFS

Network Efficiency: Google’s Jupyter network fabric offers over one petabit per second bandwidth.
Elastic Storage: Cloud Storage scales independently of compute resources.
Cost Efficiency: Pay only for what you use, unlike HDFS which requires overprovisioning.
Integration: Cloud Storage integrates seamlessly with various Google Cloud services.

Considerations for Using Cloud Storage

Latency: High throughput but significant latency; suitable for large, bulk operations.
Object Storage Nature:
- Renaming objects is expensive due to immutability.
- Cannot append to objects.
Output Committers: New object store-oriented committers help mitigate directory rename issues.

Data Transfer and Management

DistCp Tool: Essential for moving data; push-based model preferred for known necessary data.
Data Management Continuum:
- Pre-2006: Big databases for big data.
- 2006: Distributed processing with Hadoop.
- 2010: BigQuery launch.
- 2015: Google’s Dataproc for managed Hadoop and Spark clusters.

Ephemeral Clusters

Ephemeral Model: Utilize clusters as temporary resources, reducing costs by not paying for idle compute capacity.
Workflow Example: Use Cloud Storage for initial and final data, with intermediate processing in local HDFS if needed.

Cluster Optimization

Data Locality: Ensure Cloud Storage bucket and Dataproc region are physically close.
Network Configuration: Avoid bottlenecks by ensuring traffic isn’t funneled through limited gateways.
Input File Management:
- Reduce the number of input files (<10,000) and partitions (<50,000) if possible.
- Adjust fs.gs.block.size for larger datasets.

Persistent Disks and Virtual Machines

Disk Size: Choose appropriate persistent disk size to avoid performance limitations.
VM Allocation: Prototype and benchmark with real data to determine optimal VM size.
Job-Scoped Clusters: Customize clusters for specific tasks to optimize resource use.

Local HDFS vs. Cloud Storage

Local HDFS: Better for jobs requiring frequent metadata operations, small file sizes, or heavy I/O workloads with low latency needs.
Cloud Storage: Ideal for the initial and final data source in pipelines, reducing disk requirements and saving costs.

Autoscaling and Templates

Workflow Templates: Automate cluster creation, job submission, and cluster deletion.
Autoscaling: Scale clusters based on YARN metrics, providing flexible capacity and improving resource utilization.
Initial Workers: Set the number of initial workers to ensure basic capacity from the start.

Monitoring and Logging

Cloud Logging: Consolidates logs for easy error diagnosis and monitoring.
Custom Dashboards: Use Cloud Monitoring to visualize CPU, disk, and network usage, and YARN resources.
Application Logs: Retrieve Spark job logs from the driver output, Cloud Console, or Cloud Storage bucket.

Serverless Data Processing with Dataflow

Dataflow Overview:
- Serverless: Dataflow is serverless, eliminating the need to manage clusters.
- Scaling: Scales step by step, fine-grained scaling unlike Data Proc.
- Unified Processing: Supports both batch and stream processing with the same code.
Choosing Between Dataflow and DataProc:
- Recommendation: Use Dataflow for new pipelines due to serverless benefits.
- Existing Hadoop Pipelines: Consider Data Proc for migration and modernization.
- Decision Factors: Depends on existing technology stack and operational preferences.
Unified Programming with Apache Beam:
- Historical Context: Batch processing (1940s) vs. stream processing (1970s).
- Apache Beam: Unifies batch and stream processing concepts.
- Core Concepts: P transforms, P collections, pipelines, and pipeline runners.
Immutable Data and Distributed Processing:
- P collections: Immutable distributed data abstractions in Dataflow.
- Serialization: All data types are serialized as byte strings for efficient network transfer.
- Processing Model: Each transform creates a new P collection, enabling distributed processing.

Why Customers Value Dataflow

Efficient Execution and Optimization:
- Pipeline Execution: Dataflow manages job execution from data source to sink.
- Optimization: Optimizes job graph, fuses transforms, and dynamically rebalances work units.
- Resource Management: On-demand deployment of compute and storage resources per job.
Reliability and Fault Tolerance:
- Continuous Optimization: Ongoing optimization and rebalancing of resources.
- Fault Tolerance: Handles late data arrivals, with automatic restarts and monitoring.
- Auto-Scaling: Scales resources dynamically during job execution, improving job efficiency.
Operational Efficiency and Integration:
- Managed Service: Fully managed and auto-configured service, reducing operational overhead.
- Integration: Acts as a central component connecting various Google Cloud services.
- Use Cases: Supports diverse use cases from BigQuery to Cloud SQL seamlessly.

Building Dataflow Pipelines in Code

Constructing a Dataflow Pipeline

Pipeline Construction:
- Python Syntax: Use pipe symbols (|) to chain P transforms from an input P collection.
- Java Syntax: Use .apply() method instead of pipe symbols.
Branching in Pipelines:
- Multiple Transform Paths: Apply the same input P collection to different transforms.
- Named Outputs: Store results in separate P collection variables (Pcollection_out1, Pcollection_out2).
Pipeline Initialization and Termination:
- Pipeline Start: Begin with a source to fetch the initial P collection (e.g., readFromText).
- Pipeline End: Terminate with a sink operation (e.g., writing to a text file in cloud storage).
- Context Management: Use Python with clause to manage pipeline execution context.

Key Considerations in Pipeline Design

Data Sources and Sinks:
- Reading Data: Utilize beam.io to read from various sources like text files, Pub/Sub, or BigQuery.
- Writing Data: Specify sinks such as BigQuery tables using beam.io.WriteToBigQuery.
Transforming Data with PTransforms:
- Mapping Data: Use Map to apply a function to each element in the P collection.
- Flat Mapping: Apply FlatMap to handle one-to-many relationships, flattening results.
- ParDo Transformation: Custom processing on each element, with serialization and thread safety ensured.
- DoFn Class: Define distributed processing functions, ensuring they are serializable and idempotent.
Handling Multiple Outputs:
- Branching and Outputs: Output different types or formats from a single P collection using ParDo.

Aggregate with GroupByKey and Combine

Dataflow Model: Utilizes a shuffle phase after the map phase to group together like keys.
- Operates on a PCollection of key-value pairs (tuples).
- Groups by common key, returning a key-value pair where the value is a group of values.
Example: Finding zip codes associated with a city.
- Key-Value Pair Creation: Create pairs and group by key.
- Data Skew Issue:
  - Small Scale: Manageable with 1 million items.
  - Large Scale: 1 billion items can cause performance issues.
  - High Cardinality: Performance concerns in queries with billions of records.
Performance Concern:
- Grouping by key causes unbalanced workload among workers.
- Example:
  - 1 million X values on one worker.
  - 1,000 Y values on another worker.
- Inefficiency: Idle worker waiting for the other to complete.
Dataflow Optimization:
- Design applications to divide work into aggregation steps.
- Push grouping towards the end of the processing pipeline.
CoGroupByKey: Similar to GroupByKey but groups results across multiple PCollections by key.
Combine: Used to aggregate values in PCollections.
- Combine Variants:
  - Combine.globally(): Reduces a PCollection to a single value.
  - Combine.perKey(): Similar to GroupByKey but combines values using a specified function.
- Prebuilt Combine Functions: For common operations like sum, min, and max.
- Custom Combine Function: Create a subclass of CombineFn for complex operations.
  - Four Operations:
    1. Create Accumulator: New local accumulator.
    2. Add Input: Add input to the accumulator.
    3. Merge Accumulators: Merge multiple accumulators.
    4. Extract Output: Produce final result from the accumulator.
- Efficiency: Combine is faster than GroupByKey due to parallelization.
- Custom Combine Class: For operations with commutative and associative properties.
Flatten: Merges multiple PCollection objects into a single PCollection, similar to SQL UNION.
Partition: Splits a single PCollection into smaller collections.
- Use Case: Different processing for different quartiles.

Side Inputs and Windows of Data

Side Inputs: Additional inputs to a ParDo transform.
- Use Case: Inject additional data at runtime.
- Example:
  - Compute average word length.
  - Use as side input to determine if a word is longer or shorter than average.
Dataflow Windows:
- Global Window: Default, not useful for unbounded PCollections (streaming data).
- Time-based Windows:
  - Useful for processing streaming data.
  - Example: Sliding Windows - beam.WindowInto(beam.window.SlidingWindows(60, 30)).
    - Capture 60 seconds of data, start new window every 30 seconds.
  - Fixed Windows: Example - group sales by day.

Creating and Re-using Pipeline Templates

Dataflow Templates: Enable non-developers to execute dataflow jobs.
- Workflow:
  - Developer creates pipeline with Dataflow SDK (Java/Python).
  - Separate development from execution activities.
- Benefits:
  - Simplifies scheduling batch jobs.
  - Allows deployment from environments like App Engine, Cloud Functions.
Custom Templates:
- Value Providers: Parse command-line or optional arguments.
- Runtime Parameters: Convert compile-time parameters for user customization.
- Metadata File: Describes template parameters and functions for downstream users.

Manage Data Pipelines with Cloud Data Fusion and Cloud Composer

Cloud Data Fusion provides a graphical user interface and APIs to build, deploy, and manage data integration pipelines efficiently.
Users: Developers, Data Scientists, Business Analysts
- Developers: Cleanse, match, remove duplicates, blend, transform, partition, transfer, standardize, automate, and monitor data.
- Data Scientists: Visually build integration pipelines, test, debug, and deploy applications.
- Business Analysts: Run at scale, operationalize pipelines, inspect rich integration metadata.
Benefits:
- Integration: Connects with a variety of systems (legacy, modern, relational databases, file systems, cloud services, NoSQL, etc.).
- Productivity: Centralizes data from various sources (e.g., BigQuery, Cloud Spanner).
- Reduced Complexity: Visual interface for building pipelines, code-free transformations, reusable templates.
- Flexibility: Supports on-prem and cloud environments, interoperable with open-source software (CDAP).
- Extensibility: Template pipelines, create conditional triggers, manage and create plugins, custom compute profiles.

Components of Cloud Data Fusion

User Interface Components:
- Wrangler UI: Explore datasets visually, build pipelines with no code.
- Data Pipeline UI: Draw pipelines on a canvas.
- Control Center: Manage applications, artifacts, and datasets.
- Pipeline Section: Developer studio, preview, export, schedule jobs, connector, function palette, navigation.
- Integration Metadata: Search, add tags and properties, see data lineage.
- Hub: Available plugins, sample use cases, prebuilt pipelines.
- Administration: Management (services, metrics) and configuration (namespace, compute profiles, preferences, system artifacts, REST client).

Building a Pipeline

Pipeline Representation: Visual series of stages in a Directed Acyclic Graph (DAG).
- Nodes: Different types (e.g., data from cloud storage, parse CSV, join data, sync data).
- Canvas: Area for creating and chaining nodes.
- Mini Map: Navigate large pipelines.
- Pipeline Actions Tool Bar: Save, run, and manage pipelines.
- Templates and Plugins: Use pre-existing resources to avoid starting from scratch.
- Preview Mode: Test and debug pipelines before deployment.
- Monitoring: Track health, data throughput, processing time, and anomalies.
- Tags Feature: Organize pipelines for quick access.
- Concurrency: Set maximum number of concurrent runs to optimize processing.

Exploring Data Using Wrangler

Wrangler UI: For exploring and analyzing new datasets visually.
- Connections: Add and manage connections to data sources (e.g., Google Cloud Storage, BigQuery).
- Data Exploration: Inspect rows and columns, view sample insights.
- Data Transformation: Create calculated fields, drop columns, filter rows using directives to form a transformation recipe.
- Pipeline Creation: Convert transformations into pipelines for regular execution.

Example Pipeline

Twitter Data Ingestion:
- Ingest data from Twitter and Google Cloud.
- Parse tweets.
- Load into various data sinks.
Health Monitoring:
- View start time, duration, and summary of pipeline runs.
- Data throughput at each node.
- Inputs, outputs, and errors per node.
Streaming Data Pipelines: Future modules will cover streaming data pipelines.

Key Metrics and Features

Metrics: Records out per second, average processing time, max processing time.
Automation: Set pipelines to run automatically at intervals.
Field Lineage Tracking: Track transformations applied to data fields.

Example: Campaign Field Lineage

Track every transformation before and after the field.
View the lineage of operations between datasets.
Identify time of last change and involved input fields.

Summary

Cloud Data Fusion: Efficient, flexible, and extensible tool for building and managing data pipelines.
User Interfaces: Wrangler UI for visual exploration, Data Pipeline UI for pipeline creation.
Building Pipelines: Use DAGs, nodes, canvas, templates, and plugins for streamlined pipeline creation.
Wrangler for Exploration: Visual data exploration and transformation using directives.
Monitoring and Automation: Track metrics, automate runs, and manage concurrency for optimized processing.
Field Lineage Tracking: Detailed tracking of data transformations for comprehensive data management.

Orchestrating Work Between Google Cloud Services with Cloud Composer

Overview

Orchestration: Managing multiple Google Cloud services (e.g., data fusion pipelines, ML models) in a specified order.
Cloud Composer: Serverless environment running Apache Airflow for workflow orchestration.

Apache Airflow Environment

Launching Cloud Composer: Can be done via command line or Google Cloud web UI.
Environment Variables: Edited at Apache Airflow instance level, not Cloud Composer level.
Components:
- Airflow Web Server: Access via Cloud Composer to monitor and interact with workflows.
- DAGs Folder: Cloud storage bucket for storing Python DAG files.

DAGs and Operators

DAGs (Directed Acyclic Graphs):
- Represent workflows in Airflow.
- Consist of tasks invoking predefined operators.
Operators:
- BigQuery Operators: Used for data querying and related tasks.
- Vertex AI Operators: Used for retraining and deploying ML models.
- Common Operators: Atomic tasks, often one operator per task.

Sample DAG File

from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
from datetime import datetime

def process_data():
    ## Your data processing code here
    pass

default_args = {
    'start_date': datetime(2023, 1, 1),
}

dag = DAG(
    'example_dag',
    default_args=default_args,
    schedule_interval='@daily',
)

start = DummyOperator(task_id='start', dag=dag)
process = PythonOperator(task_id='process', python_callable=process_data, dag=dag)
end = DummyOperator(task_id='end', dag=dag)

start >> process >> end

Workflow Scheduling

Periodic Scheduling

Description: Set schedule (e.g., daily at 6 a.m.).

Event-Driven Scheduling

Description: Triggered by events (e.g., new CSV files in cloud storage).

Event-Driven Example

Cloud Function: Watches for new files in a Cloud Storage bucket and triggers a DAG.

const { google } = require("googleapis");
const dataflow = google.dataflow("v1b3");
const PROJECT_ID = "your-project-id";
const LOCATION = "your-location";
const TEMPLATE_PATH = "gs://your-template-path";
const BUCKET_NAME = "your-bucket-name";

exports.triggerDAG = async (event, context) => {
  const file = event.name;
  const metadata = {
    projectId: PROJECT_ID,
    location: LOCATION,
    templatePath: TEMPLATE_PATH,
    gcsPath: `gs://${BUCKET_NAME}/${file}`,
  };

  const request = {
    projectId: metadata.projectId,
    location: metadata.location,
    gcsPath: metadata.templatePath,
    requestBody: {
      parameters: {
        inputFile: metadata.gcsPath,
      },
    },
  };

  try {
    const response = await dataflow.projects.locations.templates.launch(
      request
    );
    console.log(`Job launched successfully: ${response.data.job.id}`);
  } catch (err) {
    console.error(`Error launching job: ${err}`);
  }
};

Monitoring and Logging

Monitoring DAG Runs

Check status: Monitor DAGs for success, running, or failure states and troubleshoot accordingly.
Logs: Available for each task and overall workflow.

Google Cloud Logs

Usage: Diagnose errors with Cloud Functions and other services.

Troubleshooting Example

Error Detection: Use logs to identify and correct issues (e.g., missing output bucket).
Case Sensitivity: Ensure correct naming in Cloud Functions.

Introduction to Building Batch Data Pipelines#

Batch Pipelines Overview#

Quality Considerations in Data Pipelines#

Shortcomings of ELT and Use Cases for ETL#

Google Cloud ETL Tools#

Considerations for Choosing ETL#

Metadata Management and Data Governance#

Executing Spark on Dataproc#

The Hadoop Ecosystem#

Running Hadoop on Dataproc#

Dataproc Cluster Architecture#

Key Features of Dataproc#

Using Google Cloud Storage Instead of HDFS#

Network Evolution and Data Proximity#

Cloud Migration and HDFS#

Advantages of Cloud Storage over HDFS#

Considerations for Using Cloud Storage#

Data Transfer and Management#

Ephemeral Clusters#

Cluster Optimization#

Persistent Disks and Virtual Machines#

Local HDFS vs. Cloud Storage#

Autoscaling and Templates#

Monitoring and Logging#

Serverless Data Processing with Dataflow#

Why Customers Value Dataflow#

Building Dataflow Pipelines in Code#

Constructing a Dataflow Pipeline#

Key Considerations in Pipeline Design#

Aggregate with GroupByKey and Combine#

Side Inputs and Windows of Data#

Creating and Re-using Pipeline Templates#

Manage Data Pipelines with Cloud Data Fusion and Cloud Composer#

Components of Cloud Data Fusion#

Building a Pipeline#

Exploring Data Using Wrangler#

Example Pipeline#

Key Metrics and Features#

Example: Campaign Field Lineage#

Summary#

Orchestrating Work Between Google Cloud Services with Cloud Composer#

Overview#

Apache Airflow Environment#

DAGs and Operators#

Sample DAG File#

Workflow Scheduling#

Periodic Scheduling#

Event-Driven Scheduling#

Event-Driven Example#

Monitoring and Logging#

Monitoring DAG Runs#

Google Cloud Logs#

Troubleshooting Example#

Introduction to Building Batch Data Pipelines

Batch Pipelines Overview

Quality Considerations in Data Pipelines

Shortcomings of ELT and Use Cases for ETL

Google Cloud ETL Tools

Considerations for Choosing ETL

Metadata Management and Data Governance

Executing Spark on Dataproc

The Hadoop Ecosystem

Running Hadoop on Dataproc

Dataproc Cluster Architecture

Key Features of Dataproc

Using Google Cloud Storage Instead of HDFS

Network Evolution and Data Proximity

Cloud Migration and HDFS

Advantages of Cloud Storage over HDFS

Considerations for Using Cloud Storage

Data Transfer and Management

Ephemeral Clusters

Cluster Optimization

Persistent Disks and Virtual Machines

Local HDFS vs. Cloud Storage

Autoscaling and Templates

Monitoring and Logging

Serverless Data Processing with Dataflow

Why Customers Value Dataflow

Building Dataflow Pipelines in Code

Constructing a Dataflow Pipeline

Key Considerations in Pipeline Design

Aggregate with GroupByKey and Combine

Side Inputs and Windows of Data

Creating and Re-using Pipeline Templates

Manage Data Pipelines with Cloud Data Fusion and Cloud Composer

Components of Cloud Data Fusion

Building a Pipeline

Exploring Data Using Wrangler

Example Pipeline

Key Metrics and Features

Example: Campaign Field Lineage

Summary

Orchestrating Work Between Google Cloud Services with Cloud Composer

Overview

Apache Airflow Environment

DAGs and Operators

Sample DAG File

Workflow Scheduling

Periodic Scheduling

Event-Driven Scheduling

Event-Driven Example

Monitoring and Logging

Monitoring DAG Runs

Google Cloud Logs

Troubleshooting Example