Homework 1

IMPORTANT Homework Submission Instructions

1. All homeworks must be submitted in one PDF file to Gradescope.

2. Please make sure to select the corresponding HW pages on Gradescope for each question

3. For all coding components, complete the solutions with a Jupyter notebook/Google Colab, and export

the notebook (including both code and outputs) into a PDF file. Concatenate the theory solutions

PDF file with the coding solutions PDF file into one PDF file which you will submit.

4. Failure to adhere to the above submission format may result in penalties.

1

1 Concepts of Learning – Theory (Stephen)

The goal of this question is to get you familiarized with the conceptual taxonomy of different types of machine

learning tasks and to get you thinking about how ML could be applied in real life.

Some common types of machine learning tasks/problems are listed below:

classification

regression

ranking

clustering

conditional probability estimation

density estimation

pattern mining

You are given a list of situations below. Assign one machine learning task from the list above to each

situation below.

The situations are deliberately designed to simulate real-life applications and hence are open-ended. For each

situation, there may be more than one answer that could be appropriate, depending on your interpretation

of the data and task. You need only give ONE reasonable answer for each situation to get full

credit. Please limit any justification to at most two sentences.

1. You are developing a multiplayer online real-time strategy game and you need to find a way to match

players of similar skill against one another. Ideally, the players will be consistently challenged, not

bored by less experienced players or destroyed by more experienced players. You have the players play

a few games so you can retrieve features related to their playing ability.

2. You are a medical doctor trying to determine where the different segments of brain tumors are located

in MRI images. You have 1,000 images of patients diagnosed with brain tumors. Using your expert

knowledge, you hand label 100 images yourself by highlighting the appropriate pixels corresponding to

edema, necrosis, or enhancing segments. You would like an ML algorithm to do this work for you.

3. You are a composer building a system that automatically generates ragtime piano pieces. Part of this

process involves coming up with a harmonic progression for individual phrases. You have analyzed

the harmonic progressions of 20 ragtime pieces. You know that phrases are goal-oriented, ending on

either a “I” or “V” harmony, so you decide to generate harmonic progressions backwards from the ends

of phrases. You need to find a distribution over possible harmonies for time t given the harmony at

time t + 1. This way, you can sample from the possible harmonies and avoid using the most common

progression every time.

4. You are the Director of Pricing Algorithms & Data Science for Petco. You are trying to determine the

pricing algorithm for cat food. Specifically, you need to estimate how the prices of certain items should

change over time at specific stores. You have access to a database containing years of customer, store,

and product data (store location, Petco product price over time, competitor product price over time,

number of customers/day, etc.).

5. You work for Discover bank and need to create an algorithm to detect fraudulent transactions. You have

historic transaction data including transaction amount, time of transaction, location, etc. Based on

this data, you need to build an understanding of the underlying distribution of standard transactions.

You can then monitor new transactions to see how well they fit into the expected distribution, flagging

unusual transactions as potentially fraudulent.

Page 2

6. You are working on the election campaign for your favorite senator. You must determine for each

potential voter whether they are “strong supporters,” “undecided,” “swing voters,” or “unlikely to

vote.” With this information, your team can focus its campaign towards winning over the undecided

and swing voter categories rather than wasting time and money on strong supporters or unlikely voters.

You have demographic data on potential voters including age, gender, location, party affiliation, etc.

7. You work for CarGurus and need to design a generalized search algorithm. Given the users’ search

words you are tasked with finding the most relevant cars and presenting them in order of relevance.

You are provided with a dataset of queries and their corresponding search results. Each result is labeled

with a relevance score describing how well it matches the search query.

8. You are the owner of a restaurant that is famous for your vegetable soup. You are trying to determine

how many pounds of vegetables to buy for next week. If you buy too much, the leftover vegetables

go to waste. If you buy too little, you will run out of vegetables prematurely and disappoint your

customers. You have data about all past weeks (how many customers you had, whether there was a

holiday, number of rainy days, etc.).

9. You are a marketing analyst at Express. You are trying to determine the public opinion on an experimental

line of green suits. You develop a natural language processing algorithm to read Threads

and Twitter/X posts with the appropriate hashtags and determine whether each post is “positive” or

“negative.”

10. You work for Spotify, improving their music recommendation system. You have access to millions of

users’ listening data. Your goal is to find common connections between the genres and songs/pieces

that these users are listening to. Using what is learned, the system can recommend music that is likely

to interest a user based on their listening/search history.

2 Model Selection – Theory (Yiyang)

Consider yourself as a data scientist working for a healthcare company. Your team has been tasked with

developing a predictive model to identify the risk of patients getting strokes (−1 for healthy patients and 1

for stroke patients) based on various factors such as age, lifestyle, genetic markers, and medical history.

To solve this task, you have developed two models, M1 and M2, both of which have similar mean predictive

accuracy on the training set. However, M1 is a high-degree polynomial classifier with 10 parameters, while

M2 is a linear classifier with 5 parameters.

(a) Based on the current result, which model would you expect to generalize better to the test set, and

why? Explain your reason in one sentence.

(b) Suppose you have developed a third model, M3, which is a polynomial regression model with 100

parameters. This model has significantly better predictive accuracy than both M1 and M2 on the training

dataset. However, when you test M3 on a test dataset, its performance significantly drops. What might be

the reason for this drop in performance?

(c) What can you do to avoid the problem of performance dropping in Problem (b)? How does that change

M3’s model complexity?

Now, you look closer into your collected dataset and find that the proportion of stroke patients and nonstroke

patients in both your training and validation dataset is 1:8, and the confusion matrix for M1 and

M2 for the validation dataset is shown below:

Page 3

M1

True

Stroke

True

Healthy

Predicted

Stroke

85 10

Predicted

Healthy

15 890

M2

True

Stroke

True

Healthy

Predicted

Stroke

70 8

Predicted

Healthy

30 892

(d) Calculate the sensitivity, specificity, and accuracy performance for M1 and M2. Would your answer

to the first question change? Why or why not? If you want to train a good classifier using the same dataset,

what kind of techniques can be employed at training time to overcome challenges working with this imbalanced

dataset?

3 Regularization (Jon)

Let’s try to determine whether adding an additional regularization term to a model’s objective function

reduces the model’s complexity.

Consider 0-1 loss, and models (parameterized by θ) that use m(θ) variables, where we regularize the number

of variables. So the objective is:

min

θ

L(θ) + λm(θ),

where L(θ) is 1/n times the number of misclassifications. n is the number of data points.

1. What is the largest value of λ that cannot affect the accuracy of the optimal solution θ0? In other words,

if λ < N for some number N, then for any θ0 ∈ arg minθ L(θ) and any θλ ∈ arg minθ L(θ) + λm(θ), L(θ0) = L(θλ). What is the largest possible N? Hint: this answer relies on using the 0-1 loss function. N will depend on the number of variables in θ0. There are two parts for this proof: showing that if λ < N, the condition holds, and if λ > N, it is possible for the condition not to hold.

2. Consider θ1 and θ2, which both have the same optimal objective value, L(θ1)+λm(θ1) = L(θ2)+λm(θ2).

We have that model θ2 has 2 fewer variables than θ1. Express θ1’s training error L(θ1) in terms of θ2’s

training error L(θ2) and λ.

3. We again have two models, θ3 and θ4 which were optimized using objectives that had different regularization

parameters λ3 and λ4 where λ3 > λ4. We know that the second one is more accurate

on the training set, L(θ4) < L(θ3) + ϵ. We also know that the objective of θ3 equals that of θ4, L(θ3) + λ3m(θ3) = L(θ4) + λ4m(θ4). We also know that θ3 is not too much smaller than θ4 in that it uses at most z fewer variables. Then, it is true that θ4 is at most a certain size, specifically: m(θ4) < function(λ3, λ4, ϵ, z). What is this function? Page 4 4 Classifiers and Metrics - Coding (Stark) Age like Rowing Experience Income Y 20 1 0 20 0 18 1 1 33 0 11 0 1 21 1 31 0 0 9 1 22 1 1 7 1 21 1 0 10 0 13 1 0 23 1 15 1 1 16 0 16 0 1 15 1 17 1 0 6 0 You are given the dataset above with feature vector x including Age, likeRowing, Experience, and Income, and the binary label Y , whether the student is accepted to the Stanford rowing team. You are also given a linear classifier g(x) = θ⊤x + θ0 and a non-linear classifier f(x) = tanh(θ⊤x + θ0), where θ = (0.05,−3, 2.1, 0.008), θ0 = 0.3, and “tanh” function tanh(z) = ez−e−z ez+e−z . (In this question, you are expected to write functions from scratch, but packages including Matplotlib and NumPy are allowed.) (a) First calculate the value of g(x) for each data point. What is the largest threshold value that would minimize (mis)classification error? (b) Calculate the value f(x) for each data point. What is the largest threshold value that would minimize (mis)classification error? Compute the confusion matrix, precision, recall, and F1 score for one such threshold. (c) For classifiers f(x) and g(x), plot the ROC curves. Please plot each ROC curve as a continuous, connected set of lines. Plot all the points on the ROC curve that represent decision points with the minimum classification error. (d) For the ROC curves in (c), calculate the AUC from scratch using only the numpy package (do not use sklearn or similar packages). 5 K-Nearest Neighbors with Parameter Tuning - Coding (Harry & Eric) In this problem, you will implement from scratch the k-NN algorithm on the breast cancer dataset. You will also implement your own cross-validation algorithm in order to tune your model. You should not use a pre-existing k-NN algorithm or cross-validation algorithm such as from Sklearn. If you are unsure whether a package is allowed, feel free to ask in EdDiscussion. The breast cancer dataset contains 30 feature variables and a target variable which you are trying to predict. The data has already been split into a training and test set for you. (a) Is accuracy or F1 score a more appropriate performance metric to use for this task? Why? Page 5 (b) Implement a k-NN algorithm from scratch to classify the dataset. Use k = 31 and the Euclidean distance, and make sure to normalize the data. Report your model’s F1 score on the test set. Note: See the sklearn article on the MinMaxScaler for more information on how to perform the normalization without leaking information between the train and test sets. Make sure to still implement it from scratch. (c) Use cross-validation with 5 folds and the F1 score to tune the value of k and the distance function used (possible distance functions to use could be Euclidean distance, Manhattan distance, or cosine similarity). Make sure to use at least five values of k between 1 and 63, and try at least two distance functions. For each distance function, show a plot where the x-axis depicts the value of k and the y-axis depicts the average F1 score for that value of k during cross-validation. Which pair of parameters performed the best? Note: You may find the array split method from NumPy helpful when implementing cross-validation. (d) Using the best parameters determined in part (c), report the performance of a k-NN classifier on the test set. Compare this to your model in part (b). Is it as you expected? Page 6 Requirements: finish perfectly by latex and jupyterbook | Python

**WE OFFER THE BEST CUSTOM PAPER WRITING SERVICES. WE HAVE DONE THIS QUESTION BEFORE, WE CAN ALSO DO IT FOR YOU.**- Assignment status: Already Solved By Our Experts
*(USA, AUS, UK & CA PhD. Writers)***CLICK HERE TO GET A PROFESSIONAL WRITER TO WORK ON THIS PAPER AND OTHER SIMILAR PAPERS, GET A NON PLAGIARIZED PAPER FROM OUR EXPERTS**

**QUALITY: 100% ORIGINAL PAPER ****– ****NO ChatGPT.NO PLAGIARISM**** – ****CUSTOM PAPER**

Looking for unparalleled custom paper writing services? Our team of experienced professionals at AcademicWritersBay.com is here to provide you with top-notch assistance that caters to your unique needs.

We understand the importance of producing original, high-quality papers that reflect your personal voice and meet the rigorous standards of academia. That’s why we assure you that our work is completely plagiarism-free—we craft bespoke solutions tailored exclusively for you.

**Why Choose AcademicWritersBay.com?**

- Our papers are 100% original, custom-written from scratch.
- We’re here to support you around the clock, any day of the year.
- You’ll find our prices competitive and reasonable.
- We handle papers across all subjects, regardless of urgency or difficulty.
- Need a paper urgently? We can deliver within 6 hours!
- Relax with our on-time delivery commitment.
- We offer money-back and privacy guarantees to ensure your satisfaction and confidentiality.
- Benefit from unlimited amendments upon request to get the paper you envisioned.
- We pledge our dedication to meeting your expectations and achieving the grade you deserve.

Our Process: Getting started with us is as simple as can be. Here’s how to do it:

**Click on the “Place Your Order”**tab at the top or the “Order Now” button at the bottom. You’ll be directed to our order form.- Provide the specifics of your paper in the “PAPER DETAILS” section.
- Select your academic level, the deadline, and the required number of pages.
- Click on
**“CREATE ACCOUNT & SIGN IN”**to provide your registration details, then**“PROCEED TO CHECKOUT.”** - Follow the simple payment instructions and soon, our writers will be hard at work on your paper.

AcademicWritersBay.com is dedicated to expediting the writing process without compromising on quality. Our roster of writers boasts individuals with advanced degrees—Masters and PhDs—in a myriad of disciplines, ensuring that no matter the complexity or field of your assignment, we have the expertise to tackle it with finesse. Our quick turnover doesn’t mean rushed work; it means efficiency and priority handling, ensuring your deadlines are met with the excellence your academics demand.