Academic Project
Data Analysis
Python
scikit-learn
pandas
NumPy
Matplotlib
Static Data Visualization

Created as my final project for Cornell's INFO 1998: Introduction to Machine Learning course in Spring 2022, this Jupyter Notebook explores using Python's scikit-learn library to build a simple machine learning model that predicts the flowering dates of cherry blossoms in Kyoto, Japan using temperature data.

I chose this topic after seeing several news articles about how cherry trees are flowering earlier and earlier in many places due to climate change. After looking into studies on the topic, I found an impressive dataset that contained cherry blossom full-flowering dates as well as reconstructed average March temperatures in Kyoto - dated all the way from 2010 back to 800 A.D.

After doing exploratory data analysis and looking at some static visualizations of the dataset, I initialized, trained, and tested a linear regression model using scikit-learn. Surprisingly, despite the simplicity of the approach, the model produced a score of 0.97, meaning that it was able to explain 97% of the variance in flowering dates solely based on March temperature data.

After seeing this result, I was curious about how flowering dates might be associated with weather factors other than temperature. This involved having to find another dataset and do some cleaning and preprocessing - I eventually found a suitable dataset from the Japan Meteorological Agency that contained monthly averages for a variety of weather factors from 1880 to the present.

I then built a correlation matrix to quickly get an overview of how strongly each weather factor - humidity, precipitation, cloud cover, and sunshine duration - was associated with flowering dates. After seeing how models built with these factors performed, I found that no factor was nearly as strongly associated with flowering dates as temperature.

This project was completed back when I was a second year at Cornell, and at the time, I was still getting the hang of using Python for data analysis and visualization. Now that I've learned more advanced machine learning techniques and spent time better understanding how to interpret model results, I'd be interested in revisiting this project to see how I could improve my work.

That said, this was a fun project introducing me to the basics of machine learning and scikit-learn, and the process of finding datasets related to interesting real-world topics, cleaning them up, and then exploring them for unexpected insights is something I've enjoyed doing many times over since then.