``E`` module ("expressions") ================================ The expressions module in the ``etl_toolkit`` is imported as ``E``. This module contains many functions to generate common but complex pyspark columns for use in data transformations. The functions should be seen as core building blocks that can greatly simplify code that handles case statements, condition logic, regex parsing, datetime parsing, and more. .. note:: ETL toolkit expressions are functions that return a single pyspark ``Column``. Use these when selecting or filtering columns on a pyspark ``DataFrame``. .. code-block:: python :linenos: :caption: Using expressions from the etl_toolkit from etl_toolkit import E, F df = ( spark.table("yd_production.etl_toolkit_static.states") .where(E.any( F.col("state_name").like("New%"), F.col("region") == "Northeast", )) ) -------------------------- Aggregate Expressions -------------------------- These expressions can be used to simplify writing complex aggregation expressions. .. caution:: Aggregate expressions must be passed into a ``.agg`` method call when using ``DataFrame.groupBy``. Use these functions when performing some group by transformation. .. autoapifunction:: etl_toolkit.E.sum_if .. autoapifunction:: etl_toolkit.E.avg_if -------------------------- Boolean Expressions -------------------------- These expressions can be used to simplify writing complex and/or conditions. .. tip:: It is highly recommended to use these boolean expressions where applicable to make code for complex filters, joins, and enrichment columns as easy to read as possible. These functions also make it easier to add new logic given you don't have to worry about parenthesis and order of operations that come with built-in pyspark boolean operators. .. autoapifunction:: etl_toolkit.E.any .. autoapifunction:: etl_toolkit.E.all .. autoapifunction:: etl_toolkit.E.between -------------------------- Calculation Expressions -------------------------- These expressions can be used to simplify writing complex numerical calculations like growth rates. .. autoapifunction:: etl_toolkit.E.growth_rate_by_lag -------------------------- ID Expressions -------------------------- These expressions are used to generate primary keys or ID fields when developing datasets. .. autoapifunction:: etl_toolkit.E.uuid5 -------------------------- Mapping Expressions -------------------------- These expressions can be used to simplify writing complex case statement expressions. .. tip:: It is highly recommended to use these mapping expressions where applicable to make enrichment columns as easy to read as possible. These functions also make it easier to dynamically build mapping columns where a long list of conditions needs to be considered or complex boolean conditions need to be met. These conditions work well with the boolean expressions found in the ETL toolkit. .. autofunction:: etl_toolkit.E.chain_assigns .. autoapifunction:: etl_toolkit.E.chain_cases -------------------------- Normalization Expressions -------------------------- These expressions are used to standardize the format of various column types. This is a common cleaning technique to make columns easier to analyze by removing edge cases related to formatting inconsistencies. .. autoapifunction:: etl_toolkit.E.normalize_text .. autoapifunction:: etl_toolkit.E.try_cast -------------------------- Regex Expressions -------------------------- These expressions can be used to simplify writing complex regex logic. .. autoapifunction:: etl_toolkit.E.rlike_any .. autoapifunction:: etl_toolkit.E.rlike_all -------------------------- Time Expressions -------------------------- These expressions can be used to simplify writing complex datetime logic. .. autoapifunction:: etl_toolkit.E.normalize_date .. autoapifunction:: etl_toolkit.E.normalize_timestamp .. autoapifunction:: etl_toolkit.E.parse_all_dates .. autoapifunction:: etl_toolkit.E.parse_date_range .. autoapifunction:: etl_toolkit.E.date_trunc .. autoapifunction:: etl_toolkit.E.date_end .. autoapifunction:: etl_toolkit.E.next_complete_period .. autoapifunction:: etl_toolkit.E.quarter_label