tensorflowonspark.dfutil module
A collection of utility functions for loading/saving TensorFlow TFRecords files as Spark DataFrames.
- fromTFExample(iter, binary_features=[])[source]
mapPartition function to convert an RDD of serialized tf.train.Example bytestring into an RDD of Row.
Note: TensorFlow represents both strings and binary types as tf.train.BytesList, and we need to disambiguate these types for Spark DataFrames DTypes (StringType and BinaryType), so we require a “hint” from the caller in the
binary_features
argument.- Args:
- iter
the RDD partition iterator
- binary_features
a list of tf.train.Example features which are expected to be binary/bytearrays.
- Returns:
An array/iterator of DataFrame Row with features converted into columns.
- infer_schema(example, binary_features=[])[source]
Given a tf.train.Example, infer the Spark DataFrame schema (StructFields).
Note: TensorFlow represents both strings and binary types as tf.train.BytesList, and we need to disambiguate these types for Spark DataFrames DTypes (StringType and BinaryType), so we require a “hint” from the caller in the
binary_features
argument.- Args:
- example
a tf.train.Example
- binary_features
a list of tf.train.Example features which are expected to be binary/bytearrays.
- Returns:
A DataFrame StructType schema
- isLoadedDF(df)[source]
Returns True if the input DataFrame was produced by the loadTFRecords() method.
This is primarily used by the Spark ML Pipelines APIs.
- Args:
- df
Spark Dataframe
- loadTFRecords(sc, input_dir, binary_features=[])[source]
Load TFRecords from disk into a Spark DataFrame.
This will attempt to automatically convert the tf.train.Example features into Spark DataFrame columns of equivalent types.
Note: TensorFlow represents both strings and binary types as tf.train.BytesList, and we need to disambiguate these types for Spark DataFrames DTypes (StringType and BinaryType), so we require a “hint” from the caller in the
binary_features
argument.- Args:
- sc
SparkContext
- input_dir
location of TFRecords on disk.
- binary_features
a list of tf.train.Example features which are expected to be binary/bytearrays.
- Returns:
A Spark DataFrame mirroring the tf.train.Example schema.
- saveAsTFRecords(df, output_dir)[source]
Save a Spark DataFrame as TFRecords.
This will convert the DataFrame rows to TFRecords prior to saving.
- Args:
- df
Spark DataFrame
- output_dir
Path to save TFRecords
- toTFExample(dtypes)[source]
mapPartition function to convert a Spark RDD of Row into an RDD of serialized tf.train.Example bytestring.
Note that tf.train.Example is a fairly flat structure with limited datatypes, e.g. tf.train.FloatList, tf.train.Int64List, and tf.train.BytesList, so most DataFrame types will be coerced into one of these types.
- Args:
- dtypes
the DataFrame.dtypes of the source DataFrame.
- Returns:
A mapPartition function which converts the source DataFrame into tf.train.Example bytestrings.