Call/WhatsApp/Text: +44 20 3289 5183

# Question: Cornwall and Devon are areas in the Southwest region of England known to have topsoil heavily contaminated with arsenic; You are tasked with assessing the burden of environmental contamination in these areas.

01 Nov 2023,6:21 PM

## Question 1

Cornwall and Devon are areas in the Southwest region of England known to have topsoil heavily contaminated with arsenic. You are tasked with assessing the burden of environmental contamination in these areas.

Instructions: Your full UCL student ID number represents the total number of topsoil samples taken from residential garden soils across the region, and the last 4 digits of your UCL ID represents the number of topsoil samples with elevated arsenic concentrations exceeding the UK soil acceptable limits. Use this information to answer question 1a and 1b.

Note: Your student ID number contains eight digits, and it should look something akin to these examples: 18020105 or 19012500. Using 19012500 as a motivating example to explain the above instruction: 19012500 (full ID) will represent the total number of topsoil samples; and 2500 (last four digits) is the number of samples exceeding the acceptable limits.

If the last four digits of your ID begins with a zero – for instance 0105 from 18020105. You can choose to use the last three (105) or five digits (20105) instead to arrive to a number not starting with 0

1. What is the prevalence of soil samples detected to exceed the UK acceptable limits (express as %)? [1]

1. Develop a function (you can name the function by yourself) in R that calculates the prevalence of arsenic contamination. The function must express the result in percentage [2]

15 random garden soil samples were studied for arsenic concentrations by multiplying the values to a factor variable using the last 3-digit values of your UCL ID number.

 Garden ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Factor 0.07 0.41 0.73 0.28 0.25 0.34 0.39 0.26 0.16 0.33 0.30 0.66 0.56 0.17 0.48 Soil arsenic (mg/kg)

1. Calculate the summary statistics for the soil arsenic (mg/kg) and provide the interpretation of these descriptive measures? [3]

1. What are the best approaches for visualising the above data and provide a justification for your answer? Write R codes to generate the appropriate plots [4]

10 marks

Question 2

An England-wide campaign was launched to target residential gardens to bring contamination levels of arsenic below the acceptable limits of 32 mg/kg, hence a geographical study design was used to comparing the environmental soil arsenic contamination levels from 30, 46 and 32 gardens (mg/kg) selected from West Midlands, East Midlands, and East of England, respectively.

1. Use the “Question_2.csv” to create a variable to answer the following questions 2b, 2c, 2d and

2e.

Use your full UCL ID number in the set.seed() function to begin creating a personalized column representing the arsenic level from the ‘sample’ variable in the Question_2.csv dataset. From a uniform distribution using the function runif() with the following parameters specified (n = 111, min = 1 & max = 5) to generate random values, and then subtract the generated values from the variable called “Sample” to create personalised values for the arsenic level. [1]

1. State the appropriate hypothesis for comparing the distributions across the three regions? [1]

1. What is the best methodology for testing this hypothesis? State the correct statistical test and provide a justification for choosing it? [4]

1. Write down the correct R code to compute the statistical test and p-value. [2]

1. Are there any differences across the three regions, and what conclusion can you draw from this analysis? [2]

10 marks

## Question Three

100 patients were admitted to Charing Cross Hospital – upon admission – their condition was critical as it turned out they were symptomatic cases of COVID-19. On the spot, the patients’ symptoms were cared for and monitored round the clock on a 3-hourly basis until their condition became stable after a week. Blood samples were taken on a 3-hourly basis to monitor viral loads of infection – these were examined on the spot, and a week after to see if there was a reduction (an indicator that a patient is recovering well).

The lab readings for viral loads from the serologic analysis are stored in “Question_3.csv”, if you multiply the viral load readings with the last 2-digits of your UCL ID – the values become standardised.

On a patient-level, you want to assess whether these patients are recovering well.

1. What is the hypothesis for determining whether patients are making a recovery? [2]

1. Write the code for personalising the dataset, briefly discuss some of the issues with the records in “Question_3.csv” and suggest what can be done to mitigate the issues. Apply the appropriate data cleaning to derive the desired format to answer the 3c accordingly [10]

1. Use the most appropriate methods for testing the hypothesis in 3a) and provide justifications for selecting the method. Write out the full R script for analysing the data and performing the statistical test [5]

1. What conclusions can you arrive with regards to these cohorts of patients – provide a full interpretation [3]

20 marks

## Question Four

A study was launched to assess the mean Body Mass Index (BMI) of inhabitants of villages across Zambia, Zimbabwe, and Malawi to determine the impact of environmental levels of aridity (i.e., dryness) in the villages as well as farmers who supplies foodstuff experiencing food shortages in those villages on BMI.

Use the dataset ‘Question_4.csv’ to examine the relationship between village-level BMI and Ardity index. It contains the following independent variables: Farmers affected by food shortages (categorical with 0 = “Affected” and 1 = “Not Affected”) and Aridity Index (continuous whereby a high value means higher dryness and vice versa). The dependent variable village-level BMI estimated as a mean.

To apply the personalization, use the following steps:

• Use full UCL number in the set.seed() function ensure your data is reproducible and personalised
• Create a personalised column using a normal distribution with n = 7,201, mean = 0 and standard deviation = 1.5 using the rnorm() function
• Replace the “EstimatedBMI” variable with the sum of the personalized normal column and the original “EstimatedBMI” variable

1. Personalize the dataset based on the instruction given abve, write the code to perform a multivariable linear regression model in R using the estimated BMI at village-level as the dependent variable against the presence of food shortage and aridity index as the independent variables.

Show the FULL results for model output and include the 95% confidence intervals, provide a screenshot of the output. [6]

1. Provide a FULL interpretation for the regression coefficients of the presence of food shortage and aridity index and include the 95% confidence and whether this relation is statistically significant or not. [10]

1. Construct the multivariable linear regression model. What are the predicted levels of mean BMI in villages where there are farms with no food shortages, and the aridity index is 5.5? [4]

1. In your opinion - is this a good, poor or invalid model? Justify your answer [5] 25 marks

Question Five

A subset of the villages from Zambia which were impacted by food shortage were selected to assess the direct impacts of environmental levels of aridity in a village on village-level estimated BMI.

Use the data “Question_5.csv”. To apply the personalization, use the following steps:

• Use full UCL number in the set.seed() function
• Create a personalized column from a normal distribution with n = 506, mean = 0 and standard deviation = 0.1053 using rnorm() function, updating the personalised column by adding the personalized data to the existing “EstimatedBMI” column in data “Question_5.csv”.

1. Create the personalized column and describe its overall relationship aridity index. Is there anything peculiar about these two variables? [5]

1. Use a univariable regression model to assess the relationship between village-level BMI and aridity index (Hint: consider whether you need data transformation and give justifications). Also, use the appropriate parameters to construct regression model [15]

1. Provide the approach interpretation for the regression parameter for aridity index [5]

1. Use a non-linear regression model with the inclusion of a quadratic term and compare the model performance with the model in 5b.

Question Six

Select the study design accordingly to answer this question. There are broadly 4 different study design types listed as Pilot, Ecological, Cross-sectional, and Longitudinal.

0 – 1 = Pilot

2 – 3 = Ecological study

4 – 6 = Cross-sectional study

7 – 9 = Longitudinal study

Instructions: Use your UCL student ID number to select two study designs to answer 6a. Using this ID number (18020155) as a motivating example – the fourth and sixth digits should fall in one of the defined ranges for the different study design types. For instance, the fourth digit in the above ID is ‘2’, select Ecological study. The sixth digit is ‘1’, therefore select Pilot study.

1. Use the fourth and sixth digits of your UCL ID number to select two study design types to discuss five differences (if numbers give the same study – move to the next digit). Construct a table to contrast the selected study design types. [5]

1. Use the seventh digit of your UCL ID number to select a study design. Write a short proposal with 250 words for an outline for a quantitative study that explores the following topic:

“Impact of heatwave risk on vulnerable residents in an urban area in Europe” [10]

1. Discuss the five problems that can arise from this type of study in question 6b? [10]

25 marks

This Question Hasn’t Been Answered Yet! Do You Want an Accurate, Detailed, and Original Model Answer for This Question?