Airflow 강의 4 - DAG 작성과 Operator 실습

시작하며

4주차 수업에서는 실제 Airflow DAG 코드를 작성하는 방법을 다룬다. 트랜잭션 처리 방법을 먼저 살펴보고, PythonOperator와 Task Decorator를 사용한 DAG 작성 방식을 비교한다. 또한 XCom, Connections, Variables 같은 Airflow의 핵심 기능들도 함께 살펴본다.

트랜잭션 처리

TRANSACTION

Atomic하게 실행되어야 하는 SQL들을 묶어서 하나의 작업처럼 처리하는 방법이다.
Transaction Isolation Level은 Read Committed가 디폴트 설정이다.
Python의 경우 try/except와 함께 사용하는 것이 일반적이며, 끝에 raise를 붙여준다.

Airflow DAG 작성 실습

기본 DAG 코드 (PythonOperator 방식)

Airflow Pipeline은 DAG >> tasks >> Operators 구조를 따른다. DAG는 Dag Object를 만들면서 시작한다.

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
 
# DAG Object 선언
dag = DAG( 
    dag_id = 'HelloWorld',
    start_date = datetime(2022,6,14),
    catchup=False,
    tags=['example'],
    schedule = '0 2 * * *')
 
# 돌릴 함수들 선언
def print_hello():
    print("hello!")
    return "hello!"
 
def print_goodbye():
    print("goodbye!")
    return "goodbye!"
 
# Operator 선언
print_hello = PythonOperator(
    task_id = 'print_hello',
    #python_callable param points to the function you want to run 
    python_callable = print_hello,
    #dag param points to the DAG that this task is a part of
    dag = dag)
 
print_goodbye = PythonOperator(
    task_id = 'print_goodbye',
    python_callable = print_goodbye,
    dag = dag)
 
# Operator 실행 선후행 관계를 표시한다.
# Assign the order of the tasks in our DAG
print_hello >> print_goodbye

DAG를 CLI로 테스트하는 방법은 다음과 같다.

docker exec -it data-engineering-batch13-airflow-scheduler-1 sh
 
airflow dags list
 
airflow tasks list HelloWorld
 
airflow tasks test print_hello 2023-07-01

Task Decorator 방식

@task 데코레이터를 사용하면 더 간결하게 DAG를 작성할 수 있다.

from airflow import DAG
from airflow.decorators import task
from datetime import datetime
 
@task
def print_hello():
    print("hello!")
    return "hello!"
 
@task
def print_goodbye():
    print("goodbye!")
    return "goodbye!"
 
with DAG(
    dag_id = 'HelloWorld_v2',
    start_date = datetime(2022,5,5),
    catchup=False,
    tags=['example'],
    schedule = '0 2 * * *'
) as dag:
 
    # Assign the tasks to the DAG in order
    print_hello() >> print_goodbye()

실제 ETL DAG 예시 (name_gender CSV)

아래는 CSV 파일을 읽어 RedShift에 적재하는 기본 ETL DAG다.

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
import requests
import logging
import psycopg2
 
def get_Redshift_connection():
    host = "learnde.cduaw970ssvt.ap-northeast-2.redshift.amazonaws.com"
    user = "lsyeup1206"  # 본인 ID 사용
    password = "..."  # 본인 Password 사용
    port = 5439
    dbname = "dev"
    conn = psycopg2.connect(f"dbname={dbname} user={user} host={host} password={password} port={port}")
    conn.set_session(autocommit=True)
    return conn.cursor()
 
 
def extract(url):
    logging.info("Extract started")
    f = requests.get(url)
    logging.info("Extract done")
    return (f.text)
 
 
def transform(text):
    logging.info("Transform started")	
    lines = text.strip().split("\n")[1:] # 첫 번째 라인을 제외하고 처리
    records = []
    for l in lines:
      (name, gender) = l.split(",") # l = "Keeyong,M" -> [ 'keeyong', 'M' ]
      records.append([name, gender])
    logging.info("Transform ended")
    return records
 
 
def load(records):
    logging.info("load started")
    """
    records = [
      [ "Keeyong", "M" ],
      [ "Claire", "F" ],
      ...
    ]
    """
    schema = "keeyong"
    # BEGIN과 END를 사용해서 SQL 결과를 트랜잭션으로 만들어주는 것이 좋음
    cur = get_Redshift_connection()
    try:
        cur.execute("BEGIN;")
        cur.execute(f"DELETE FROM {schema}.name_gender;") 
        # DELETE FROM을 먼저 수행 -> FULL REFRESH을 하는 형태
        for r in records:
            name = r[0]
            gender = r[1]
            print(name, "-", gender)
            sql = f"INSERT INTO {schema}.name_gender VALUES ('{name}', '{gender}')"
            cur.execute(sql)
        cur.execute("COMMIT;")   # cur.execute("END;") 
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
        cur.execute("ROLLBACK;")   
    logging.info("load done")
 
 
def etl():
    link = "https://s3-geospatial.s3-us-west-2.amazonaws.com/name_gender.csv"
    data = extract(link)
    lines = transform(data)
    load(lines)
 
 
dag_second_assignment = DAG(
	dag_id = 'name_gender',
	catchup = False,
	start_date = datetime(2023,4,6), # 날짜가 미래인 경우 실행이 안됨
	schedule = '0 2 * * *')  # 적당히 조절
 
task = PythonOperator(
	task_id = 'perform_etl',
	python_callable = etl,
	dag = dag_second_assignment)

개선 1: PostgresHook, XCom, Variable 활용

PostgresHook으로 Connection을 관리하고, xcom_pull로 태스크 간 데이터를 전달하도록 개선한다.

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.models import Variable
from airflow.providers.postgres.hooks.postgres import PostgresHook
 
from datetime import datetime
from datetime import timedelta
from plugins import slack
 
import requests
import logging
import psycopg2
 
 
 
def get_Redshift_connection(autocommit=True):
    hook = PostgresHook(postgres_conn_id='redshift_dev_db')
    conn = hook.get_conn()
    conn.autocommit = autocommit
    return conn.cursor()
 
 
def extract(**context):
    link = context["params"]["url"]
    task_instance = context['task_instance']
    execution_date = context['execution_date']
 
    logging.info(execution_date)
    f = requests.get(link)
    return (f.text)
 
 
def transform(**context):
    logging.info("Transform started")
    # xcom_pull을 통해, 앞서 실행되었던 extract 의 id를 가지고 리턴값을 읽어온다.
    # xcom을 통해 읽어온 데이터는 postgres 에 저장된다. 결과적으로 큰 데이터는 못넘기고
    # S3 같은 데이터의 경로를 넘긴다.
    text = context["task_instance"].xcom_pull(key="return_value", task_ids="extract")
    if text is None:
        print("++++++++++++++++++++++++++++++")
    lines = text.strip().split("\n")[1:] # 첫 번째 라인을 제외하고 처리
    records = []
    for l in lines:
      (name, gender) = l.split(",") # l = "Keeyong,M" -> [ 'keeyong', 'M' ]
      records.append([name, gender])
    logging.info("Transform ended")
    return records
 
 
def load(**context):
    logging.info("load started")    
    schema = context["params"]["schema"]
    table = context["params"]["table"]
    
    records = context["task_instance"].xcom_pull(key="return_value", task_ids="transform")    
    """
    records = [
      [ "Keeyong", "M" ],
      [ "Claire", "F" ],
      ...
    ]
    """
    # BEGIN과 END를 사용해서 SQL 결과를 트랜잭션으로 만들어주는 것이 좋음
    cur = get_Redshift_connection()
    try:
        cur.execute("BEGIN;")
        cur.execute(f"DELETE FROM {schema}.name_gender;") 
        # DELETE FROM을 먼저 수행 -> FULL REFRESH을 하는 형태
        for r in records:
            name = r[0]
            gender = r[1]
            print(name, "-", gender)
            sql = f"IINSERT INTO {schema}.name_gender VALUES ('{name}', '{gender}')"
            cur.execute(sql)
        cur.execute("COMMIT;")   # cur.execute("END;") 
    except (Exception, psycopg2.DatabaseError) as error:
        print("Error Msg", error)
        cur.execute("ROLLBACK;")
        raise  
    logging.info("load done")
 
 
dag = DAG(
    dag_id = 'name_gender_v4',
    start_date = datetime(2023,4,6), # 날짜가 미래인 경우 실행이 안됨
    schedule = '0 2 * * *',  # 적당히 조절
    max_active_runs = 1,
    catchup = False,
    default_args = {
        'retries': 1,
        'retry_delay': timedelta(minutes=3),
        'on_failure_callback': slack.on_failure_callback,
    }
)
 
 
extract = PythonOperator(
    task_id = 'extract',
    python_callable = extract,
    params = {
        'url':  Variable.get("csv_url")
    },
    dag = dag)
 
transform = PythonOperator(
    task_id = 'transform',
    python_callable = transform,
    params = { 
    },  
    dag = dag)
 
load = PythonOperator(
    task_id = 'load',
    python_callable = load,
    params = {
        'schema': 'keeyong',   ## 자신의 스키마로 변경
        'table': 'name_gender'
    },
    dag = dag)
 
extract >> transform >> load

개선 2: Task Decorator + PostgresHook 통합

from airflow import DAG
from airflow.models import Variable
from airflow.providers.postgres.hooks.postgres import PostgresHook
from airflow.decorators import task
 
from datetime import datetime
from datetime import timedelta
 
import requests
import logging
 
 
def get_Redshift_connection(autocommit=True):
    hook = PostgresHook(postgres_conn_id='redshift_dev_db') # Airflow Connection ID
    conn = hook.get_conn()
    conn.autocommit = autocommit
    return conn.cursor()
 
 
@task
def extract(url):
    logging.info(datetime.utcnow())
    f = requests.get(url)
    return f.text
 
 
@task
def transform(text):
    lines = text.strip().split("\n")[1:] # 첫 번째 라인을 제외하고 처리
    records = []
    for l in lines:
      (name, gender) = l.split(",") # l = "Keeyong,M" -> [ 'keeyong', 'M' ]
      records.append([name, gender])
    logging.info("Transform ended")
    return records
 
 
@task
def load(schema, table, records):
    logging.info("load started")    
    cur = get_Redshift_connection()   
    """
    records = [
      [ "Keeyong", "M" ],
      [ "Claire", "F" ],
      ...
    ]
    """
    # BEGIN과 END를 사용해서 SQL 결과를 트랜잭션으로 만들어주는 것이 좋음
    try:
        cur.execute("BEGIN;")
        cur.execute(f"DELETE FROM {schema}.name_gender;") 
        # DELETE FROM을 먼저 수행 -> FULL REFRESH을 하는 형태
        for r in records:
            name = r[0]
            gender = r[1]
            print(name, "-", gender)
            sql = f"INSERT INTO {schema}.name_gender VALUES ('{name}', '{gender}')"
            cur.execute(sql)
        cur.execute("COMMIT;")   # cur.execute("END;") 
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
        cur.execute("ROLLBACK;")   
    logging.info("load done")
 
 
with DAG(
    dag_id='namegender_v5',
    start_date=datetime(2022, 10, 6),  # 날짜가 미래인 경우 실행이 안됨
    schedule='0 2 * * *',  # 적당히 조절
    max_active_runs=1,
    catchup=False,
    default_args={
        'retries': 1,
        'retry_delay': timedelta(minutes=3),
        # 'on_failure_callback': slack.on_failure_callback,
    }
) as dag:
 
    url = Variable.get("csv_url")
    schema = 'keeyong'   ## 자신의 스키마로 변경
    table = 'name_gender'
 
    lines = transform(extract(url))
    load(schema, table, lines)

Yahoo Finance API DAG

from airflow import DAG
from airflow.decorators import task
from airflow.providers.postgres.hooks.postgres import PostgresHook
from datetime import datetime
from pandas import Timestamp
 
import yfinance as yf
import pandas as pd
import logging
 
 
def get_Redshift_connection(autocommit=True):
    hook = PostgresHook(postgres_conn_id='redshift_dev_db')
    conn = hook.get_conn()
    conn.autocommit = autocommit
    return conn.cursor()
 
 
@task
def get_historical_prices(symbol):
    ticket = yf.Ticker(symbol)
    data = ticket.history()
    records = []
 
    for index, row in data.iterrows():
        date = index.strftime('%Y-%m-%d %H:%M:%S')
 
        records.append([date, row["Open"], row["High"], row["Low"], row["Close"], row["Volume"]])
 
    return records
 
@task
def load(schema, table, records):
    logging.info("load started")
    cur = get_Redshift_connection()
    try:
        cur.execute("BEGIN;")
        cur.execute(f"DROP TABLE IF EXISTS {schema}.{table};")
        cur.execute(f"""
CREATE TABLE {schema}.{table} (
    date date,
    "open" float,
    high float,
    low float,
    close float,
    volume bigint
);""")
        # DELETE FROM을 먼저 수행 -> FULL REFRESH을 하는 형태
        for r in records:
            sql = f"INSERT INTO {schema}.{table} VALUES ('{r[0]}', {r[1]}, {r[2]}, {r[3]}, {r[4]}, {r[5]});"
            print(sql)
            cur.execute(sql)
        cur.execute("COMMIT;")   # cur.execute("END;")
    except Exception as error:
        print(error)
        cur.execute("ROLLBACK;")
        raise
 
    logging.info("load done")
 
 
with DAG(
    dag_id = 'UpdateSymbol',
    start_date = datetime(2023,5,30),
    catchup=False,
    tags=['API'],
    schedule = '0 10 * * *'
) as dag:
 
    results = get_historical_prices("AAPL")
    load("keeyong", "stock_info", results)

Docker 모듈에 yfinance를 설치하는 방법: docker-compose yaml에서 pip additional req에서 설치한다. yaml에서 :- 문법을 사용한다.
docker에 pip3로 yfinance 모듈을 설치할 때 worker-node에도 같이 설치해주어야 한다.

Connections and Variables

Airflow UI에서 Connections와 Variables를 관리한다.

airflow.cfg 주요 설정

 
1. DAGs 폴더는 어디에 지정되는가?
    [core]
    # The folder where your airflow pipelines live, most likely a
    # subfolder in a code repository. This path must be absolute.
    dags_folder = /opt/airflow/dags
 
2. DAGs 폴더에 새로운 Dag를 만들면 언제 실제로 Airflow 시스템에서 이를 알게 되나? 이 스캔 주기를 결정해주는 키의 이름이 무엇인가?
    
    # How often (in seconds) to scan the DAGs directory for new files. Default to 5 minutes.
    dag_dir_list_interval = 300
 
3. 이 파일에서 Airflow를 API 형태로 외부에서 조작하고 싶다면 어느 섹션을 변경해야하는가?
 
    https://airflow.apache.org/docs/apache-airflow/stable/security/api.html
 
    enable_experimental_api = False -> true
    # Comma separated list of auth backends to authenticate users of the API. See
    # https://airflow.apache.org/docs/apache-airflow/stable/security/api.html for possible values.
    # ("airflow.api.auth.backend.default" allows all requests for historic reasons)
    auth_backends = airflow.api.auth.backend.session -> airflow.api.auth.backend.default
        - 1) Open the airflow.cfg file and set the api_auth_backend configuration option to airflow.api.auth.backend.default. 
        - 2) Also, set the api_experimental configuration option to true. Save the file and restart the Airflow webserver.
 
4. Variable에서 변수의 값이 encrypted가 되려면 변수의 이름에 어떤 단어들이 들어가야 하는데 이 단어들은 무엇일까? 
    https://medium.com/@sukul.teradata/apache-airflow-variables-summary-2281fdf18846
    => password, secret, passwd, authorization, api_key, apikey, access_token
 
5. 이 환경설정 파일이 수정되었다면 이를 실제로 반영하기 위해서 해야 하는 일은?
    => 스케줄러 노드와 웹서버 노드를 재시작해야 한다.
 
6. Metadata DB의 내용을 암호화하는데 사용되는 키는 무엇인가?
    # Secret key to save connection passwords in the db
    fernet_key =

세계 나라 정보 API 사용 DAG 작성

 
from airflow import DAG
from airflow.decorators import task
from airflow.providers.postgres.hooks.postgres import PostgresHook
from datetime import datetime
from pandas import Timestamp
 
import pandas as pd
import logging
import requests
import json
 
 
 
def get_Redshift_connection(autocommit=True):
    hook = PostgresHook(postgres_conn_id='redshift_dev_db')
    conn = hook.get_conn()
    conn.autocommit = autocommit
    return conn.cursor()
 
@task
def get_countries_info(url):
    response = requests.get(url)
    rest_countries_list = json.loads(response.text)
    records = []
    
    for country in rest_countries_list:
        print(country['name']['official'], country['population'], country['area'] ,end = "\n")
        records.append([country['name']['official'], country['population'], country['area']])
    return records
 
 
def _create_table(cur, schema, table, drop_first):
    if drop_first:
        cur.execute(f"DROP TABLE IF EXISTS {schema}.{table};")
    cur.execute(f"""
        CREATE TABLE IF NOT EXISTS {schema}.{table} (
            country varcher(255),
            population varcher(255),
            area varcher(255)
        );""")
 
 
@task
def load(schema, table, records):
    logging.info("load started")
    cur = get_Redshift_connection()
    try:
        cur.execute("BEGIN;")
        # 원본 테이블이 없으면 생성 - 테이블이 처음 한번 만들어질 때 필요한 코드
        _create_table(cur, schema, table, False)
        # 임시 테이블로 원본 테이블을 복사
        cur.execute(f"CREATE TEMP TABLE t AS SELECT * FROM {schema}.{table};")
        for r in records:
            sql = f"INSERT INTO t VALUES ({r[0]}, {r[1]}, {r[2]});"
            print(sql)
            cur.execute(sql)
 
        # 원본 테이블 생성
        _create_table(cur, schema, table, True)
        # 임시 테이블 내용을 원본 테이블로 복사
        cur.execute(f"INSERT INTO {schema}.{table} SELECT DISTINCT * FROM t;")
        cur.execute("COMMIT;")   # cur.execute("END;")
    except Exception as error:
        print(error)
        cur.execute("ROLLBACK;") 
        raise
    logging.info("load done")
 
 
with DAG(
    dag_id = 'CountryInfo_v1',
    start_date = datetime(2023,5,30),
    catchup=False,
    tags=['API'],
    schedule = '30 6 * * 6' #DAG는 UTC로 매주 토요일 오전6시 30분에 실행
) as dag:
 
    source_url = 'https://restcountries.com/v3.1/all'
    results = get_countries_info(source_url)
    load("lsyeup1206", "country_info", results)

정리하며

PythonOperator 방식에서 Task Decorator 방식으로 진화하면서 코드가 간결해진다. XCom을 통해 태스크 간 데이터를 전달할 수 있으나 큰 데이터는 S3 경로를 넘기는 방식을 사용한다. airflow.cfg의 주요 설정값을 이해하면 운영 환경에서 Airflow를 효과적으로 관리할 수 있다.

Daily Record

탐색기

Airflow 강의 4 - DAG 작성과 Operator 실습

시작하며

트랜잭션 처리

TRANSACTION

Airflow DAG 작성 실습

기본 DAG 코드 (PythonOperator 방식)

Task Decorator 방식

실제 ETL DAG 예시 (name_gender CSV)

개선 1: PostgresHook, XCom, Variable 활용

개선 2: Task Decorator + PostgresHook 통합

Yahoo Finance API DAG

Connections and Variables

airflow.cfg 주요 설정

세계 나라 정보 API 사용 DAG 작성

정리하며

최근 게시글

토스증권으로 이직, 그리고 회고

Home K8S Cluster etcd Panic Error

JVM SnapShot Cache 구현하기

DB Max Connection Error

Strimzi 로 kubernetes 에 kafka 배포하기

GCP - Professional Cloud Architect 후기

Obsidian 으로 블로그 포스팅 파이프라인 만들기

Kafka, ZooKeeper에서 KRaft로

Spark on Kubernetes 사용해보기

Spark on Kubernetes 알아보기

Feature Selection 기법 알아보기

Apache Hive 성능 개선 톺아보기 2탄

조금 이른 회고, 그리고 방향성

Apache Hive 성능 개선 톺아보기 1탄

WireGuard로 간편하게 VPN 환경 구성하기

빅데이터분석기사 취득기

Flutter 모바일 청첩장 제작기

Flutter 수익형 앱 개발 기초 수강 후기

JVM GC 알고리즘 개선기 - Parallel GC vs G1GC

Java 멀티스레딩, 병행성, 성능 최적화 강의 후기

그래프 뷰

목차