Metaflow

Overview

Metaflow is a framework created by Netflix for creating and running ML workflows.

This integration lets users apply decorators to Metaflow steps and flows to automatically log parameters and artifacts to W&B.

Decorating a step will enable or disable logging for certain types within that step.
Decorating the flow will enable or disable logging for every step in the flow.

Quickstart

Notebook
Command Line

!pip install -Uqqq metaflow fastcore wandb

import wandb
wandb.login()

pip install -Uqqq metaflow fastcore wandb
wandb login

Decorate your flows and steps

Step
Flow
Flow and Steps

Decorating a step will enable or disable logging for certain types within that Step.

In this example, all datasets and models in start will be logged

from wandb.integration.metaflow import wandb_log

class WandbExampleFlow(FlowSpec):
    @wandb_log(datasets=True, models=True, settings=wandb.Settings(...))
    @step
    def start(self):
        self.raw_df = pd.read_csv(...).    # pd.DataFrame -> upload as dataset
        self.model_file = torch.load(...)  # nn.Module    -> upload as model
        self.next(self.transform)

Decorating a flow is equivalent to decorating all the constituent steps with a default.

In this case, all steps in WandbExampleFlow will log datasets and models by default -- the same as decorating each step with @wandb_log(datasets=True, models=True)

from wandb.integration.metaflow import wandb_log

@wandb_log(datasets=True, models=True)  # decorate all @step 
class WandbExampleFlow(FlowSpec):
    @step
    def start(self):
        self.raw_df = pd.read_csv(...).    # pd.DataFrame -> upload as dataset
        self.model_file = torch.load(...)  # nn.Module    -> upload as model
        self.next(self.transform)

Decorating the flow is equivalent to decorating all steps with a default. That means if you later decorate a Step with another @wandb_log, you will override the flow-level decoration.

In the example below:

start and mid will log datasets and models, but
end will not log datasets or models.

from wandb.integration.metaflow import wandb_log

@wandb_log(datasets=True, models=True)  # same as decorating start and mid
class WandbExampleFlow(FlowSpec):
  # this step will log datasets and models
  @step
  def start(self):
    self.raw_df = pd.read_csv(...).    # pd.DataFrame -> upload as dataset
    self.model_file = torch.load(...)  # nn.Module    -> upload as model
    self.next(self.mid)

  # this step will also log datasets and models
  @step
  def mid(self):
    self.raw_df = pd.read_csv(...).    # pd.DataFrame -> upload as dataset
    self.model_file = torch.load(...)  # nn.Module    -> upload as model
    self.next(self.end)

  # this step is overwritten and will NOT log datasets OR models
  @wandb_log(datasets=False, models=False)
  @step
  def end(self):
    self.raw_df = pd.read_csv(...).    
    self.model_file = torch.load(...)

Where is my data? Can I access it programmatically?

You can access the information we've captured in three ways: inside the original Python process being logged using the wandb client library, via the web app UI, or programmatically using our Public API. Parameters are saved to W&B's config and can be found in the Overview tab. datasets, models, and others are saved to W&B Artifacts and can be found in the Artifacts tab. Base python types are saved to W&B's summary dict and can be found in the Overview tab. See our guide to the Public API for details on using the API to get this information programmatically from outside .

Here's a cheatsheet:

Data	Client library	UI
`Parameter(...)`	`wandb.config`	Overview tab, Config
`datasets`, `models`, `others`	`wandb.use_artifact("{var_name}:latest")`	Artifacts tab
Base Python types (`dict`, `list`, `str`, etc.)	`wandb.summary`	Overview tab, Summary

`wandb_log` kwargs

kwarg	Options
`datasets`	`True`: Log instance variables that are a dataset `False`
`models`	`True`: Log instance variables that are a model `False`
`others`	`True`: Log anything else that is serializable as a pickle `False`
`settings`	`wandb.Settings(...)`: Specify your own `wandb` settings for this step or flow `None`: Equivalent to passing `wandb.Settings()` By default, if: `settings.run_group` is `None`, it will be set to `{flow_name}/{run_id}` `settings.run_job_type` is `None`, it will be set to `{run_job_type}/{step_name}`

Frequently Asked Questions

What exactly do you log? Do you log all instance and local variables?

wandb_log only logs instance variables. Local variables are NEVER logged. This is useful to avoid logging unnecessary data.

Which data types get logged?

We currently support these types:

Logging Setting	Type
default (always on)	`dict, list, set, str, int, float, bool`
`datasets`	`pd.DataFrame` `pathlib.Path`
`models`	`nn.Module` `sklearn.base.BaseEstimator`
`others`	Anything that is pickle-able and JSON serializable

Examples of logging behavior

Kind of Variable	behavior	Example	Data Type
Instance	Auto-logged	`self.accuracy`	`float`
Instance	Logged if `datasets=True`	`self.df`	`pd.DataFrame`
Instance	Not logged if `datasets=False`	`self.df`	`pd.DataFrame`
Local	Never logged	`accuracy`	`float`
Local	Never logged	`df`	`pd.DataFrame`

Does this track artifact lineage?

Yes! If you have an artifact that is an output of step A and an input to step B, we automatically construct the lineage DAG for you.

For an example of this behavior, please see this notebook and its corresponding W&B Artifacts page

Metaflow

Overview​

Quickstart​

Install W&B and login​

Decorate your flows and steps​

Where is my data? Can I access it programmatically?​

wandb_log kwargs​

Frequently Asked Questions​

What exactly do you log? Do you log all instance and local variables?​

Which data types get logged?​

Examples of logging behavior​

Does this track artifact lineage?​