This assignment provides you the opportunity to synthesize all of the concepts we’ve covered in the course to date. The basic framework is that you will create a COMPLETE data visualization BLOG post that is suitable as a showcase component of your Data Science Portfolio. The point is to SHOW people your skills.
STRUCTURE
The basic formatting guidelines for this assignment are:
Include code fold or code tools options (or both) that allow users to view and copy your code while maintaining overall readability of your post.
Suppress all output and warnings that might distract from your visualizations and writing.
Properly title your assignment. The main title should be “BCB 520 - Midterm Portfolio Post”, and the subtitle should be a descriptive title related to the topic.
Include author, date, categories, and a description in your YAML header.
Write clear, complete sentences for a target audience with some scientific background but little training in formal data science. Imagine your report is intended for high level University leadership (presidents and vice presidents) and the State Board of Education.
Use the header hierarchy and create a sensible document outline with white space. Format for readability! Use bold and italic fonts to emphasize things! Use color by customizing your .css file!
In addition to the above formatting guidelines, your portfolio post must contain the following sections:
Preamble
Write a brief paragraph describing the primary questions or purpose of the post. Use clear, concise language to justify the post to a naive reader. Why is the question or purpose important?
Data
Write a summary of the data sources you have used. Include a Data Dictionary table that fully describes each individual data source used. You may supplement the data supplied to you with additional sources if you like, but this is not required.
Visualizations
Create your visualizations in response to the questions and prompts. Answer the questions and prompts directly. Draw a conclusion or inference related to each. Identify limitations and the types of data you would need to mitigate those limitations. Also include text that explains any steps or design choices you considered while exploring the vizualization options [this normally wouldn’t appear in the report to the vice president, but I’d like a window into your process]. Be sure to include clearly labeled axes and a concise but complete figure caption for each visualization. Make deliberate choices for color palettes, point marks, line types, etc. Demonstrate that you understand the concepts we have covered!
Conclusions or Summary
Summarize your results. What new questions have emerged as a result of your visualizations? What interesting next steps have emerged?
RUBRIC
I will evaluate the following for your portfolio post:
1. Clarity of writing (15%): Complete, clear sentences. Good Grammar. Understandable to target audience. Logical flow of ideas.
2. Adherence to format (15%): Did you follow directions?
3. Viz Execution (40%): Are the visualizations effective? Do they adhere to the principles of effectiveness? Are choices for idiom, marks, channels, etc made deliberately and well justified?
5. Creativity (30%): Did you push your boundaries and learn new techniques? Is the overall post compelling and interesting? Are the visualizations inspiring, creative, unique, and generally impressive? If I were recruiting a new data scientist (and I often am), would this portfolio post impress me, or would it damage your candidacy during review?
TASK
I am assigning you a real task that has multiple paths to a high quality response. It is related to one of the projects within IIDS, in which we are helping the UI Office of Research and Economic Development leverage institutional data for strategic decisions.
In this case, we’ll work with award (grant) data from four federal sponsors:
The National Science Foundation
The National Institutes of Health
The Department of Energy
The US Department of Agriculture
THE DATA
Department of Agriculture (NIFA)
Their award database is here, and I’ve downloaded the awards to the University of Idaho and put the file (USDAtoUI) in the post directory.
Department of Energy
I’ve downloaded their whole award data set from here and put the file (DOEawards.xlsx) in the github directory for this post.
National Institutes of Health
We can grab data from the NIH API, which is well described here.
National Science Foundation
The NSF also has an API, and the following code will pull down awards to the University of Idaho into a data frame called NSFtoUI.
Code
library(httr)library(jsonlite)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ purrr::flatten() masks jsonlite::flatten()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(readxl)# Base URL for the APIbase_url <-"https://www.research.gov/awardapi-service/v1/awards.json?awardeeName=%22regents+of+the+university+of+idaho%22"printFields <-"rpp,offset,id,agency,awardeeCity,awardeeCountryCode,awardeeDistrictCode,awardeeName,awardeeStateCode,awardeeZipCode,cfdaNumber,coPDPI,date,startDate,expDate,estimatedTotalAmt,fundsObligatedAmt,ueiNumber,fundProgramName,parentUeiNumber,pdPIName,perfCity,perfCountryCode,perfDistrictCode,perfLocation,perfStateCode,perfZipCode,poName,primaryProgram,transType,title,awardee,poPhone,poEmail,awardeeAddress,perfAddress,publicationResearch,publicationConference,fundAgencyCode,awardAgencyCode,projectOutComesReport,abstractText,piFirstName,piMiddeInitial,piLastName,piEmail"# Initialize an empty data frame to store resultsNSFtoUI <-tibble()# Number of results per page (as per API settings)results_per_page <-25# Variable to keep track of the current page numbercurrent_page <-1# Variable to control the loopkeep_going <-TRUEwhile(keep_going) {# Calculate the offset for the current page offset <- (current_page -1) * results_per_page +1# Construct the full URL with offset url <-paste0(base_url, "&offset=", offset, "&printFields=", printFields)# Make the API call response <-GET(url)# Check if the call was successfulif (status_code(response) ==200) {# Extract and parse the JSON data json_data <-content(response, type ="text", encoding ="UTF-8") parsed_data <-fromJSON(json_data, flatten =TRUE)# Extract the 'award' data and add to the all_awards data frame awards_data <- parsed_data$response$award NSFtoUI <-bind_rows(NSFtoUI, as_tibble(awards_data))# Debug: Print the current page number and number of awards fetchedprint(paste("Page:", current_page, "- Awards fetched:", length(awards_data$id)))# Check if the current page has less than results_per_page awards, then it's the last pageif (length(awards_data$id) < results_per_page) { keep_going <-FALSE } else { current_page <- current_page +1 } } else {print(paste("Failed to fetch data: Status code", status_code(response))) keep_going <-FALSE }}
Provide a visualization that shows our active awards from each sponsor. I need to see their start date and end date, the amount of the award, and the name of the Principal Investigator. I’m really interested in seeing how far into the future our current portfolio will exist. Are there a bunch of awards about to expire? Are there a bunch that just got funded and will be active for a while? Does this vary across sponsors?
QUESTION 2:
What is the proportional representation of new awards to the UI from these various sources over the past 5 to 10 years? Are there any trends that are encouraging or discouraging?
QUESTION 3:
How is UI performing with these sponsors when compared to the following peer institutions?
Boise State University
Idaho State University
Montana State University
University of Montana
Washington State University
Note that “performing” can mean a variety of different things. You must choose your metrics of performance and justify them.
Source Code
---title: "Midterm Assignment"author: "Barrie Robison"date: "2024-03-07"categories: [Assignment, DataViz, Tables, Project]image: "Cthulhuhockeycard.png"code-fold: truecode-tools: truedescription: "Are we good at grants and stuff?"---## OVERVIEWThis assignment provides you the opportunity to synthesize all of the concepts we've covered in the course to date. The basic framework is that you will create a COMPLETE data visualization BLOG post that is suitable as a showcase component of your Data Science Portfolio. The point is to [SHOW]{.red} people your skills.## STRUCTUREThe basic formatting guidelines for this assignment are:1. Include `code fold` or `code tools` options (or both) that allow users to view and copy your code while maintaining overall readability of your post.2. Suppress all output and warnings that might distract from your visualizations and writing.3. Properly title your assignment. The main title should be **"BCB 520 - Midterm Portfolio Post"**, and the subtitle should be a descriptive title related to the topic.4. Include author, date, categories, and a description in your YAML header.5. Write clear, complete sentences for a target audience with some scientific background but little training in formal data science. Imagine your report is intended for high level University leadership (presidents and vice presidents) and the State Board of Education.6. Use the header hierarchy and create a sensible document outline with white space. Format for readability! Use **bold** and *italic* fonts to emphasize things! Use [color]{.red} by customizing your `.css` file!**In addition to the above formatting guidelines, your portfolio post must contain the following sections:**### PreambleWrite a brief paragraph describing the primary questions or purpose of the post. Use clear, concise language to justify the post to a naive reader. Why is the question or purpose important?### DataWrite a summary of the data sources you have used. Include a `Data Dictionary` table that fully describes each individual data source used. You may supplement the data supplied to you with additional sources if you like, but this is not required. ### VisualizationsCreate your visualizations in response to the questions and prompts. Answer the questions and prompts directly. Draw a conclusion or inference related to each. Identify limitations and the types of data you would need to mitigate those limitations. Also include text that explains any steps or design choices you considered while exploring the vizualization options [this normally wouldn't appear in the report to the vice president, but I'd like a window into your process]. Be sure to include clearly labeled axes and a concise but complete figure caption for each visualization. Make deliberate choices for color palettes, point marks, line types, etc. Demonstrate that you understand the concepts we have covered!### Conclusions or SummarySummarize your results. What new questions have emerged as a result of your visualizations? What interesting next steps have emerged?## RUBRICI will evaluate the following for your portfolio post:**1. Clarity of writing (15%):** Complete, clear sentences. Good Grammar. Understandable to target audience. Logical flow of ideas.**2. Adherence to format (15%):** Did you follow directions?**3. Viz Execution (40%):** Are the visualizations effective? Do they adhere to the principles of effectiveness? Are choices for idiom, marks, channels, etc made deliberately and well justified?**5. Creativity (30%):** Did you push your boundaries and learn new techniques? Is the overall post compelling and interesting? Are the visualizations inspiring, creative, unique, and generally impressive? If I were recruiting a new data scientist (and I often am), would this portfolio post impress me, or would it damage your candidacy during review?# TASKI am assigning you a real task that has multiple paths to a high quality response. It is related to one of the projects within IIDS, in which we are helping the UI Office of Research and Economic Development leverage institutional data for strategic decisions.In this case, we'll work with award (grant) data from four federal sponsors:1. The National Science Foundation2. The National Institutes of Health3. The Department of Energy4. The US Department of Agriculture## THE DATA### Department of Agriculture (NIFA)Their award database is [here](https://portal.nifa.usda.gov/lmd4/recent_awards), and I've downloaded the awards to the University of Idaho and put the file (USDAtoUI) in the post directory.### Department of EnergyI've downloaded their whole award data set from [here](https://pamspublic.science.energy.gov/WebPAMSExternal/Interface/Awards/AwardSearchExternal.aspx?controlName=ContentTabs) and put the file (DOEawards.xlsx) in the github directory for this post.### National Institutes of HealthWe can grab data from the NIH API, which is well described [here](https://api.reporter.nih.gov/documents/Data%20Elements%20for%20RePORTER%20Project%20API_V2.pdf).### National Science FoundationThe NSF also has an API, and the following code will pull down awards to the University of Idaho into a data frame called `NSFtoUI`.```{r}library(httr)library(jsonlite)library(tidyverse)library(readxl)# Base URL for the APIbase_url <-"https://www.research.gov/awardapi-service/v1/awards.json?awardeeName=%22regents+of+the+university+of+idaho%22"printFields <-"rpp,offset,id,agency,awardeeCity,awardeeCountryCode,awardeeDistrictCode,awardeeName,awardeeStateCode,awardeeZipCode,cfdaNumber,coPDPI,date,startDate,expDate,estimatedTotalAmt,fundsObligatedAmt,ueiNumber,fundProgramName,parentUeiNumber,pdPIName,perfCity,perfCountryCode,perfDistrictCode,perfLocation,perfStateCode,perfZipCode,poName,primaryProgram,transType,title,awardee,poPhone,poEmail,awardeeAddress,perfAddress,publicationResearch,publicationConference,fundAgencyCode,awardAgencyCode,projectOutComesReport,abstractText,piFirstName,piMiddeInitial,piLastName,piEmail"# Initialize an empty data frame to store resultsNSFtoUI <-tibble()# Number of results per page (as per API settings)results_per_page <-25# Variable to keep track of the current page numbercurrent_page <-1# Variable to control the loopkeep_going <-TRUEwhile(keep_going) {# Calculate the offset for the current page offset <- (current_page -1) * results_per_page +1# Construct the full URL with offset url <-paste0(base_url, "&offset=", offset, "&printFields=", printFields)# Make the API call response <-GET(url)# Check if the call was successfulif (status_code(response) ==200) {# Extract and parse the JSON data json_data <-content(response, type ="text", encoding ="UTF-8") parsed_data <-fromJSON(json_data, flatten =TRUE)# Extract the 'award' data and add to the all_awards data frame awards_data <- parsed_data$response$award NSFtoUI <-bind_rows(NSFtoUI, as_tibble(awards_data))# Debug: Print the current page number and number of awards fetchedprint(paste("Page:", current_page, "- Awards fetched:", length(awards_data$id)))# Check if the current page has less than results_per_page awards, then it's the last pageif (length(awards_data$id) < results_per_page) { keep_going <-FALSE } else { current_page <- current_page +1 } } else {print(paste("Failed to fetch data: Status code", status_code(response))) keep_going <-FALSE }}```## QUESTION 1:Provide a visualization that shows our active awards from each sponsor. I need to see their start date and end date, the amount of the award, and the name of the Principal Investigator. I'm really interested in seeing how far into the future our current portfolio will exist. Are there a bunch of awards about to expire? Are there a bunch that just got funded and will be active for a while? Does this vary across sponsors?## QUESTION 2:What is the proportional representation of new awards to the UI from these various sources over the past 5 to 10 years? Are there any trends that are encouraging or discouraging?## QUESTION 3: How is UI performing with these sponsors when compared to the following peer institutions?1. Boise State University2. Idaho State University3. Montana State University4. University of Montana5. Washington State UniversityNote that "performing" can mean a variety of different things. You must choose your metrics of performance and justify them.