``A`` module ("analyses") ================================ The analyses module in the ``etl_toolkit`` is imported as ``A``. This module contains several functions that accept a dataframe, apply complex transformations, and return new dataframes. The functions should be used as larger or routine components of pipelines when possible. These transformations include deduping, complex filtering, enrichment, and aggregations. .. tip:: It is highly recommended to use these functions to cut down on large pieces of repetitive logic. Steps like parsing, deduplication, and generating KPIs can be greatly simplified using these functions. It is also clearer to a teammate or a reviewer of the intention of a transformation when using these functions. .. tip:: Many analysis functions that do some filtering or aggregation have a **QA mode** that when enabled the output dataframe will include additional columns and/or records to make investigation of the transformation or data easier. Functions that have a QA mode will include a ``qa`` boolean argument. See each function's documentation for details on if it supports QA mode. -------------------------- Calculation Analyses -------------------------- Suite of functions that perform common calculations to enrich existing dataframes, including lagged values or percent of total. It is recommended these functions are used rather than implementing the calculations in native pyspark as they apply special tecniques to be performant and are easier to read. .. autoapifunction:: etl_toolkit.A.add_lag_columns .. autoapifunction:: etl_toolkit.A.add_percent_of_total_columns -------------------------- Card Analyses -------------------------- Suite of functions that perform common adjustments on card (Skywalker, Yoda, etc.) datasets derived dataframes, including lag adjustments, weighting, and paneling. It is recommended these functions are used rather than implementing manually as they are common enhancements to normalize the data trends typically found in these datasets. .. autoapifunction:: etl_toolkit.A.add_card_date_adjustment .. autoapifunction:: etl_toolkit.A.add_card_day_of_week_lag_adjustment .. autoapifunction:: etl_toolkit.A.add_card_paneling .. autoapifunction:: etl_toolkit.A.add_card_paneling_reweighted .. autoapifunction:: etl_toolkit.A.get_card_panels .. autoapifunction:: etl_toolkit.A.source_card_transactions .. autoapifunction:: etl_toolkit.A.card_coverage_metrics .. autoapifunction:: etl_toolkit.A.coverage_metric -------------------------- E-Receipt Analyses -------------------------- .. autoapifunction:: etl_toolkit.A.source_ereceipts -------------------------- Dedupe Analyses -------------------------- Suite of functions to handle deduplication operations on dataframes. The returned output will be a dataframe with unique rows based on some condition. These functions can be easier to read/write than trying to handle deduping in native pyspark. .. autoapifunction:: etl_toolkit.analyses.dedupe.dedupe_by_condition .. autoapifunction:: etl_toolkit.analyses.dedupe.dedupe_by_row_number -------------------------- Index Analyses -------------------------- Suite of functions to generate index metrics that can reveal the trajectory of a company KPI without directly benchmarking to it. These index columns can be tricky to calculate in native pyspark, so these functions can assist in the standardization of how those are calculated. .. autoapifunction:: etl_toolkit.A.index_from_rolling_panel -------------------------- Investor Standard Analyses -------------------------- Suite of functions to generate common analyses used for investor research. It is recommended to use these functions as they follow specific standards for research reports and integrate with standard visualizations for charts to simplify publishing workflows. .. autoapifunction:: etl_toolkit.A.revenue_mix_by_income_bucket -------------------------- Investor Reporting Analyses -------------------------- Suite of functions specific to the investor reporting process. It is recommended to use these functions as they follow specific standards for customer-facing assets and reduce time spent on otherwise manual processes. .. autoapifunction:: etl_toolkit.A.earnings_results_from_gsheet .. autoapifunction:: etl_toolkit.A.earnings_rankings_from_gsheet .. autoapifunction:: etl_toolkit.A.earnings_results_with_backtests .. autofunction:: etl_toolkit.A.backtest_configuration .. autofunction:: etl_toolkit.A.add_unified_consensus_column -------------------------- Ordering Analyses -------------------------- Suite of functions to handle re-arranging dataframes columns and rows for common scenarios. These are helpful to use as they will be optimized for performance and/or avoid duplicating columns. .. autoapifunction:: etl_toolkit.A.shift_columns -------------------------- Parser Analyses -------------------------- Suite of functions to filter and extract important values from string columns to enrich dataframes. These are powerful functions to simplify and organize complex regex or conditional logic that are usually the critical first steps of product pipelines to filter out unrelated data. .. autoapifunction:: etl_toolkit.A.parse_records .. autofunction:: etl_toolkit.A.parser -------------------------- Scalar Analyses -------------------------- Suite of functions to genenerate scalars (i.e. python literal values) for common pyspark operations. These can be useful to retrieve values into python from dataframes and re-use them in pipeline code. .. note:: Unlike other analyses functions in the toolkit, these scalar functions return python literal values, (ex: int, float, etc.) instead of dataframes. .. autoapifunction:: etl_toolkit.A.get_aggregates -------------------------- Time Analyses -------------------------- Suite of functions to manage date and timestamp operations. Use these functions for filling date ranges and grossing up data to higher periodicities. .. autoapifunction:: etl_toolkit.A.fill_periods .. autoapifunction:: etl_toolkit.A.periods -------------------------- Comparison Analyses -------------------------- Suite of functions to compare DataFrames by schema or row content. These are useful for data validation, testing pipeline outputs, and ensuring data quality. The functions help identify differences between two DataFrames either at the schema level or by comparing individual rows. .. autoapifunction:: etl_toolkit.A.get_schema_comparison .. autoapifunction:: etl_toolkit.A.get_rows_comparison -------------------------- Investor Standard Metrics Analyses -------------------------- Suite of functions specific to the standard metrics experience. These are used to generate unified KPI analyses and downstream dashboard and feed tables for a metric. .. autoapifunction:: etl_toolkit.A.standard_metric_unified_kpi .. autoapifunction:: etl_toolkit.A.standard_metric_unified_kpi_derived .. autoapifunction:: etl_toolkit.A.standard_metric_data_download .. autoapifunction:: etl_toolkit.A.standard_metric_live_feed .. autoapifunction:: etl_toolkit.A.standard_metric_feed .. autoapifunction:: etl_toolkit.A.standard_metric_daily_growth .. autoapifunction:: etl_toolkit.A.standard_metric_quarter_month_pivot .. autoapifunction:: etl_toolkit.A.standard_metric_trailing_day_pivot .. autoapifunction:: etl_toolkit.A.standard_metric_net_adds .. autoapifunction:: etl_toolkit.A.standard_metric_weekly_qtd_progress .. autoapifunction:: etl_toolkit.A.standard_metric_monthly_qtd_progress .. autoapifunction:: etl_toolkit.A.standard_metric_daily_qtd_progress .. autoapifunction:: etl_toolkit.A.standard_metric_half_year_progress .. autoapifunction:: etl_toolkit.A.standard_metric_ui_metadata -------------------------- Calendar Analyses -------------------------- Suite of functions and classes to generate and manage calendar data with support for custom fiscal periods, holidays, and business day calculations. These calendar utilities standardize period definitions and holiday handling across an organization. .. autoapifunction:: etl_toolkit.A.calendar .. autoapifunction:: etl_toolkit.A.Holiday .. autoapifunction:: etl_toolkit.A.Period