Printable PDF
Want to pass your Databricks Certified Machine Learning Associate DATABRICKS-MACHINE-LEARNING-ASSOCIATE exam in the very first attempt? Try Pass2lead! It is equally effective for both starters and IT professionals.
VCE
A data scientist has replaced missing values in their feature set with each respective feature variable's median value. A colleague suggests that the data scientist is throwing away valuable information by doing this.
Which of the following approaches can they take to include as much information as possible in the feature set?
A. Impute the missing values using each respective feature variable's mean value instead of the median value
B. Refrain from imputing the missing values in favor of letting the machine learning algorithm determine how to handle them
C. Remove all feature variables that originally contained missing values from the feature set
D. Create a binary feature variable for each feature that contained missing values indicating whether each row's value has been imputed
E. Create a constant feature variable for each feature that contained missing values indicating the percentage of rows from the feature that was originally missing
A machine learning engineer is using the following code block to scale the inference of a single-node model on a Spark DataFrame with one million records:

Assuming the default Spark configuration is in place, which of the following is a benefit of using anIterator?
A. The data will be limited to a single executor preventing the model from being loaded multiple times
B. The model will be limited to a single executor preventing the data from being distributed
C. The model only needs to be loaded once per executor rather than once per batch during the inference process
D. The data will be distributed across multiple executors during the inference process
A data scientist has developed a machine learning pipeline with a static input data set using Spark ML, but the pipeline is taking too long to process. They increase the number of workers in the cluster to get the pipeline to run more efficiently. They notice that the number of rows in the training set after reconfiguring the cluster is different from the number of rows in the training set prior to reconfiguring the cluster.
Which of the following approaches will guarantee a reproducible training and test set for each model?
A. Manually configure the cluster
B. Write out the split data sets to persistent storage
C. Set a speed in the data splitting operation
D. Manually partition the input data