These files must be scanned often in order to maintain consistency between the on-disk source of truth for each workload and its in-database representation. A well defined strategy for file access ensures that the scheduler can process DAG files quickly and keep your jobs up-to-date.Īirflow keeps its internal representation of its workflows up-to-date by repeatedly scanning and reparsing all the files in the configured DAG directory. File Access Can Be Slow When Using Cloud Storageįast file access is critical to the performance and integrity of an Airflow environment. As a result of this rapid growth, we have encountered a few challenges, including slow file access, insufficient control over DAG (directed acyclic graph) capabilities, irregular levels of traffic, and resource contention between workloads, to name a few.īelow we’ll share some of the lessons we learned and solutions we built in order to run Airflow at scale. As adoption increases within Shopify, the load incurred on our Airflow deployments will only increase. This environment averages over 400 tasks running at a given moment and over 150,000 runs executed per day. In our largest environment, we run over 10,000 DAGs representing a large variety of workloads. Shopify’s usage of Airflow has scaled dramatically over the past two years. At the time of writing, we are currently running Airflow 2.2 on Kubernetes, using the Celery executor and MySQL 8. At Shopify, we’ve been running Airflow in production for over two years for a variety of workflows, including data extractions, machine learning model training, Apache Iceberg table maintenance, and DBT-powered data modeling. Apache Airflow is an orchestration platform that enables development, scheduling and monitoring of workflows.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |