tensorflowonspark.dfutil module

A collection of utility functions for loading/saving TensorFlow TFRecords files as Spark DataFrames.

fromTFExample(iter, binary_features=[])[source]

mapPartition function to convert an RDD of serialized tf.train.Example bytestring into an RDD of Row.

Note: TensorFlow represents both strings and binary types as tf.train.BytesList, and we need to disambiguate these types for Spark DataFrames DTypes (StringType and BinaryType), so we require a “hint” from the caller in the binary_features argument.

Args:
iter

the RDD partition iterator

binary_features

a list of tf.train.Example features which are expected to be binary/bytearrays.

Returns:

An array/iterator of DataFrame Row with features converted into columns.

infer_schema(example, binary_features=[])[source]

Given a tf.train.Example, infer the Spark DataFrame schema (StructFields).

Note: TensorFlow represents both strings and binary types as tf.train.BytesList, and we need to disambiguate these types for Spark DataFrames DTypes (StringType and BinaryType), so we require a “hint” from the caller in the binary_features argument.

Args:
example

a tf.train.Example

binary_features

a list of tf.train.Example features which are expected to be binary/bytearrays.

Returns:

A DataFrame StructType schema

isLoadedDF(df)[source]

Returns True if the input DataFrame was produced by the loadTFRecords() method.

This is primarily used by the Spark ML Pipelines APIs.

Args:
df

Spark Dataframe

loadTFRecords(sc, input_dir, binary_features=[])[source]

Load TFRecords from disk into a Spark DataFrame.

This will attempt to automatically convert the tf.train.Example features into Spark DataFrame columns of equivalent types.

Note: TensorFlow represents both strings and binary types as tf.train.BytesList, and we need to disambiguate these types for Spark DataFrames DTypes (StringType and BinaryType), so we require a “hint” from the caller in the binary_features argument.

Args:
sc

SparkContext

input_dir

location of TFRecords on disk.

binary_features

a list of tf.train.Example features which are expected to be binary/bytearrays.

Returns:

A Spark DataFrame mirroring the tf.train.Example schema.

saveAsTFRecords(df, output_dir)[source]

Save a Spark DataFrame as TFRecords.

This will convert the DataFrame rows to TFRecords prior to saving.

Args:
df

Spark DataFrame

output_dir

Path to save TFRecords

toTFExample(dtypes)[source]

mapPartition function to convert a Spark RDD of Row into an RDD of serialized tf.train.Example bytestring.

Note that tf.train.Example is a fairly flat structure with limited datatypes, e.g. tf.train.FloatList, tf.train.Int64List, and tf.train.BytesList, so most DataFrame types will be coerced into one of these types.

Args:
dtypes

the DataFrame.dtypes of the source DataFrame.

Returns:

A mapPartition function which converts the source DataFrame into tf.train.Example bytestrings.