A data scientist has developed a random forest regressor rfr and included it as the final stage in a Spark MLPipeline pipeline. They then set up a cross-validation process with pipeline as the estimator in the following code block:

Which of the following is a negative consequence of includingpipelineas the estimator in the cross-validation process rather thanrfras the estimator?
A. The process will have a longer runtime because all stages of pipeline need to be refit or retransformed with each mode
B. The process will leak data from the training set to the test set during the evaluation phase
C. The process will be unable to parallelize tuning due to the distributed nature of pipeline
D. The process will leak data prep information from the validation sets to the training sets for each model
A data scientist has created a linear regression model that useslog(price)as a label variable. Using this model, they have performed inference and the predictions and actual label values are in Spark DataFramepreds_df.
They are using the following code block to evaluate the model:
regression_evaluator.setMetricName("rmse").evaluate(preds_df)
Which of the following changes should the data scientist make to evaluate the RMSE in a way that is comparable withprice?
A. They should exponentiate the computed RMSE value
B. They should take the log of the predictions before computing the RMSE
C. They should evaluate the MSE of the log predictions to compute the RMSE
D. They should exponentiate the predictions before computing the RMSE
Which of the following tools can be used to parallelize the hyperparameter tuning process for single-node machine learning models using a Spark cluster?
A. MLflow Experiment Tracking
B. Spark ML
C. Autoscaling clusters
D. Autoscaling clusters
E. Delta Lake
A data scientist learned during their training to always use 5-fold cross-validation in their model development workflow. A colleague suggests that there are cases where a train-validation split could be preferred over k-fold cross-validation
when k > 2.
Which of the following describes a potential benefit of using a train-validation split over k-fold cross-validation in this scenario?
A. A holdout set is not necessary when using a train-validation split
B. Reproducibility is achievable when using a train-validation split
C. Fewer hyperparameter values need to be tested when usinga train-validation split
D. Bias is avoidable when using a train-validation split
E. Fewer models need to be trained when using a train-validation split
A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API.
Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?
A. import pyspark.pandas as ps df = ps.DataFrame(spark_df)
B. import pyspark.pandas as ps df = ps.to_pandas(spark_df)
C. spark_df.to_sql()
D. import pandas as pd df = pd.DataFrame(spark_df)
E. spark_df.to_pandas()
A data scientist has produced two models for a single machine learning problem. One of the models performs well when one of the features has a value of less than 5, and the other model performs well when the value of that feature is greater than or equal to 5. The data scientist decides to combine the two models into a single machine learning solution.
Which of the following terms is used to describe this combination of models?
A. Bootstrap aggregation
B. Support vector machines
C. Bucketing
D. Ensemble learning
E. Stacking
A health organization is developing a classification model to determine whether or not a patient currently has a specific type of infection. The organization's leaders want to maximize the number of positive cases identified by the model.
Which of the following classification metrics should be used to evaluate the model?
A. RMSE
B. Precision
C. Area under the residual operating curve
D. Accuracy
E. Recall
A data scientist is using the following code block to tune hyperparameters for a machine learning model:

Which change can they make the above code block to improve the likelihood of a more accurate model?
A. Increase num_evals to 100
B. Change fmin() to fmax()
C. Change sparkTrials() to Trials()
D. Change tpe.suggest to random.suggest
A machine learning engineer is converting a decision tree from sklearn to Spark ML. They notice that they are receiving different results despite all of their data and manually specified hyperparameter values being identical.
Which of the following describes a reason that the single-node sklearn decision tree and the Spark ML decision tree can differ?
A. Spark ML decision trees test every feature variable in the splitting algorithm
B. Spark ML decision trees automatically prune overfit trees
C. Spark ML decision trees test more split candidates in the splitting algorithm
D. Spark ML decision trees test a random sample of feature variables in the splitting algorithm
E. Spark ML decision trees test binned features values as representative split candidates
What is the name of the method that transforms categorical features into a series of binary indicator feature variables?
A. Leave-one-out encoding
B. Target encoding
C. One-hot encoding
D. Categorical
E. String indexing