Call/WhatsApp/Text: +44 20 3289 5183

Question: Implementing a big data solution

22 May 2024,6:08 AM

Implement a big data solution (i.e. you must use the tools taught in this course, pySpark DataFrames, RDDs or pySpark SQL, or AWS services presented in the course or a combination thereof). Note that solutions based on pandas, excel or similar non parallel approaches will NOT be awarded any implementation marks.

1. Programmatically confirm that all papers have unique IDs and output the number of papers in the file.

2. What is the average number of authors per paper?

3. How many different journals were the papers published in?

4. Find the 5 authors with the highest number of publications. Give their names along with the number of publications they contributed to.

5. To gain some additional information about publication quality, you’d like to join the paper information with the journal information you have. Following this, find the top 5 authors with the highest cummulative impact factor (notice that journals have different impact factors listed in the journal file in the IF column). Output both the author information and the cummulative impact factor.

6. You’d like some additional information about publication trends. How many publications with impact factor > 1 were published in each of the years between 2010-2020? Ensure that your answer for each year is visible in your report.


You will be using two datasets in this assignment. The first is a dataset of academic publications, a subset of the S2ORC dataset ( The second dataset contains some journal information. This file, journal information.csv is available in CSV format from Blackboard. The file has a header, and each non empty row contains information about a single journal. The full filepath on Databricks for this file should be: /FileStore/tables/journal information.cs


You should write a 3,000 word structured report that presents and explains your solutions, justifies your choice of techniques and discusses the implications of your answers for your client along with exploring any further investigations.

Expert answer

This Question Hasn’t Been Answered Yet! Do You Want an Accurate, Detailed, and Original Model Answer for This Question?


Ask an expert


Stuck Looking For A Model Original Answer To This Or Any Other

Related Questions

What Clients Say About Us

WhatsApp us