The code block shown below should store DataFrame transactionsDf on two different executors, utilizing the executors' memory as much as possible, but not writing anything to disk. Choose the answer that correctly fills the blanks in the code block to accomplish this.
1.from pyspark import StorageLevel 2.transactionsDf.__1__(StorageLevel.__2__).__3__
A. 1. cache
2.
MEMORY_ONLY_2
3.
count()
B. 1. persist
2.
DISK_ONLY_2
3.
count()
C. 1. persist
2.
MEMORY_ONLY_2
3.
select()
D. 1. cache
2.
DISK_ONLY_2
3.
count()
E. 1. persist
2.
MEMORY_ONLY_2
3.
count()
Which of the following statements about stages is correct?
A. Different stages in a job may be executed in parallel.
B. Stages consist of one or more jobs.
C. Stages ephemerally store transactions, before they are committed through actions.
D. Tasks in a stage may be executed by multiple machines at the same time.
E. Stages may contain multiple actions, narrow, and wide transformations.
Which of the following describes characteristics of the Dataset API?
A. The Dataset API does not support unstructured data.
B. In Python, the Dataset API mainly resembles Pandas' DataFrame API.
C. In Python, the Dataset API's schema is constructed via type hints.
D. The Dataset API is available in Scala, but it is not available in Python.
E. The Dataset API does not provide compile-time type safety.
Which of the following code blocks reorders the values inside the arrays in column attributes of DataFrame
itemsDf from last to first one in the alphabet?
1.+------+-----------------------------+-------------------+
2.|itemId|attributes |supplier |
3.+------+-----------------------------+-------------------+
4.|1 |[blue, winter, cozy] |Sports Company Inc.|
5.|2 |[red, summer, fresh, cooling]|YetiX |
6.|3 |[green, summer, travel] |Sports Company Inc.|
7.+------+-----------------------------+-------------------+
A. itemsDf.withColumn('attributes', sort_array(col('attributes').desc()))
B. itemsDf.withColumn('attributes', sort_array(desc('attributes')))
C. itemsDf.withColumn('attributes', sort(col('attributes'), asc=False))
D. itemsDf.withColumn("attributes", sort_array("attributes", asc=False))
E. itemsDf.select(sort_array("attributes"))
Which of the following describes properties of a shuffle?
A. Operations involving shuffles are never evaluated lazily.
B. Shuffles involve only single partitions.
C. Shuffles belong to a class known as "full transformations".
D. A shuffle is one of many actions in Spark.
E. In a shuffle, Spark writes data to disk.
The code block displayed below contains an error. The code block should trigger Spark to cache DataFrame transactionsDf in executor memory where available, writing to disk where insufficient
executor memory is available, in a fault-tolerant way. Find the error.
Code block:
transactionsDf.persist(StorageLevel.MEMORY_AND_DISK)
A. Caching is not supported in Spark, data are always recomputed.
B. Data caching capabilities can be accessed through the spark object, but not through the DataFrame API.
C. The storage level is inappropriate for fault-tolerant storage.
D. The code block uses the wrong operator for caching.
E. The DataFrameWriter needs to be invoked.
The code block shown below should return only the average prediction error (column predError) of a random subset, without replacement, of approximately 15% of rows in DataFrame transactionsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__, __3__).__4__(avg('predError'))
A. 1. sample
2.
True
3.
0.15
4.
filter
B. 1. sample
2.
False
3.
0.15
4.
select
C. 1. sample
2.
0.85
3.
False
4.
select
D. 1. fraction
2.
0.15
3.
True
4.
where
E. 1. fraction
2.
False
3.
0.85
4.
select
Which of the following code blocks adds a column predErrorSqrt to DataFrame transactionsDf that is the square root of column predError?
A. transactionsDf.withColumn("predErrorSqrt", sqrt(predError))
B. transactionsDf.select(sqrt(predError))
C. transactionsDf.withColumn("predErrorSqrt", col("predError").sqrt())
D. transactionsDf.withColumn("predErrorSqrt", sqrt(col("predError")))
E. transactionsDf.select(sqrt("predError"))
Which of the following code blocks concatenates rows of DataFrames transactionsDf and transactionsNewDf, omitting any duplicates?
A. transactionsDf.concat(transactionsNewDf).unique()
B. transactionsDf.union(transactionsNewDf).distinct()
C. spark.union(transactionsDf, transactionsNewDf).distinct()
D. transactionsDf.join(transactionsNewDf, how="union").distinct()
E. transactionsDf.union(transactionsNewDf).unique()
The code block shown below should return a one-column DataFrame where the column storeId is converted to string type. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__.__3__(__4__))
A. 1. select
2.
col("storeId")
3.
cast
4.
StringType
B. 1. select
2.
col("storeId")
3.
as
4.
StringType
C. 1. cast
2.
"storeId"
3.
as
4.
StringType()
D. 1. select
2.
col("storeId")
3.
cast
4.
StringType()
E. 1. select
2.
storeId
3.
cast
4.
StringType()