DATA8001: Maeve, a manager at a financial institution has contacted you. She is asking you for assistance in assessing the creditworthiness of future potential: Data Science & Analytics Report, MTU, Ireland
University | Munster Technological University (MTU) |
Subject | DATA8001: Data Science & Analytics |
Question
Maeve, a manager at a financial institution has contacted you. She is asking you for assistance in assessing the creditworthiness of future potential customers. She has a data set of 807 past loan customer cases, with 14 attributes for each case, including attributes such as financial standing, the reason for the loan, employment, demographic information, foreign national, years of residence in the district, and the outcome/label variable Credit Standing – classifying each case as either a good loan or bad loan.
- Exploratory Data Analysis (EDA): – Carry out some EDA on the data set; carry out at least one trivariate analysis; do you notice anything unusual or any patterns with the data set? Detail these and outline any actions you propose to take before you start model building in part b).
- Split the dataset into 75% training and 25% test set using set. seed, set. seed(ABC) where ABC is the last 3 digits of your student no.
- Using the code for entropy and information gain given in the labs, using only the categorical type predictor variables show which predictor variable should be used for the root node split. Use only the training set from b) to do this and you are not constrained to binary splits.
- Now redo part c) but now you are constrained to only binary splits, i.e. a split with only 2 possible outcomes. Design your splitting method before coding; explain your method and then implement your code, analyze your results, and comment.
- Now include the continuous numeric predictor variables, again use only a binary split. Which is now the root node split? Analyze your results and comment.
- Now investigate the next level of the split, i.e. which predictor variable(s) should be used to split the first split found in part e). Only binary splits are allowed again here. The detail in words the approach you are going to use.
- Now write the R code for part f). Analyze your results and comment. 10 marks
- Use the tree function from the package tree, or equivalent, to build a decision tree and compare the results to those in g) and comment. If you use pruning here you should explain all the methodologies you use.
- Now see if you can improve your results by using a random forest model. Give your results and explain and comment.
- Due to GDPR you are no longer allowed to use the following variables to build your model Age, Personal. Status and Foreign. National. Now redo your work for parts h) and i). Give your results and comments.
- Maeve’s company uses a process that is a mixture of a grading system and human input to grade each past loan as good or bad. Maeve is suspicious that during a particular time that this process performed very poorly and produced inaccurate results. The ID numbers can be taken as timestamp values. Develop a strategy to find a series of consecutive ID numbers, i.e. where these gradings show a higher-than-normal pattern of suspiciously incorrect gradings. Maeve can then pinpoint the time using the ID numbers. Detail how you go about your investigation and how you find this pattern.
Are You Searching Answer of this Question? Request Ireland Writers to Write a plagiarism Free Copy for You.
Ireland Assignments can be a great way to get assignment help you need! With our reliable staff of writers, we have top-notch professionals who have the experience and knowledge to help you get your homework done quickly and efficiently. We make sure that our solutions are tailored to your exact needs, making sure that no detail is overlooked or forgotten.