SUMMARY
A key feature of this course is that students should be using their own data whenever possible. This is critical to forging a learning experience that is customized to each student’s aspirations and the eccentricities of their chosen research domain. This assignment begins the process of helping you identify the data sets with which you want to work, and aligns with the notion of understanding your audience(s) and your communication goal(s).
ASSIGNMENT
The basic structure of this assignment is for you to identify, import, describe, and host a data set. I’ll break down the specifics for each of these actions below.
Identify a Data Set
The main criteria is that the data set has to matter to you in some way. Often, this will mean that it is your data set. It was collected by you and has a central role in your current or past graduate research. Awesome! Another scenario is that the data you want to use comes from your current job. Maybe it isn’t part of a research project, but you are motivated to learn how to work with the data or you are very interested in learning more about it. Also Awesome!
Some of you might not have your own data. Perhaps you have just started your graduate training. Maybe your job doesn’t yet have data that you need to work with. No Problem!
It is perfectly fine to find publicly available data sets online. As long as the data set is interesting to you! You just need to make sure that the data:
- Are publicly available.
- Are not restricted by some kind of license or copyright.
- Do not contain private information.
- Are not covered by HIPPA, FERPA, CMMC, or other federal regulations related to data.
If you need help finding a data set, just let me know.
Some fun potential categories for data sources include:
- Sports Analytics from your favorite sport or team.
- Publicly available genomics data bases.
- Keggle.
- The movie data base.
- Classic data sets from your field.
Import the Data Set
This one is probably straightforward if your data set comes from your own research and lives on your local hard drive already.
Describe the Data Set
This is the bulk of the assignment. I want you to use the framework described in Dr. Munzner’s textbook to understand your data set and describe it to someone who is unfamiliar with your work. The basis of this approach is described in this figure from the textbook I use in BCB 520. It summarizes the kinds of data types, data set types, and attribute types you might have in your data:
Host your Data Set
Ultimately, we are moving toward each of you hosting your assignments within an online repository that can serve as your data science portfolio. For this course, we are going to assume this is GitHub. Create a new blog post for this assignment
Consider a Communication Scenario
Let’s start with a 5 minute presentation to the class. Due… I don’t know… One week from today (Tuesday September 17, 2024). I want a five minute presentation that uses your data. At the beginning of the presentation, I want you to clearly specify:
The characteristics of your audience that you consider important, and for which you must prepare.
Your goals for the presentation. I’d like one overarching (or longer term), one communication goal, and one more goal that you are free to specify. Have a look at the lists from the previous lecture for inspiration.
RESOURCES
A YouTube Video from Posit on Building your Data Science Portfolio
A fun Spotify example from TidyTuesday by Kaylin Pavlik.