Contents
data preparation (Zhang, 2003) is the method of cleaning and transforming rough data into an understandable form that is optimized for data exploration. It is an important step as it involves the rearrangement of data, improving data format, and merging data sets to enrich data quality. This process includes filling blank data, reformation of categorical data, etc.
This process can be concluded in three steps:
- Loading the data
- Filling missing values
- Reformatting values
The data set provided with the task consists of a CSV file with the separator being ‘#’ symbol. The data is loading in the python notebook by using the pandas library which consists of a read_csv function that enables us to import CSV data into a data frame that is understandable by the machine.
Fig 1: Loading the dataset
The read_csv function requires the name of the columns and the separator to load the data into the data frame.
Total missing data in the initially loaded data was found to be 65.
The missing data (Zhang, 2003) is handled individually by taking up the column in question and updating and filling the data with appropriate values.
To get all the similar columns the numerical and the object type column names were stored in a list by using the select_dtypefunctionfrom the panda’s library.
Fig 2: Storing all names of all numerical columns
The list is then iterated through and the fillnafunction is used to fill all the missing values in the numerical columns with the mean of the column data.
To fill the missing data in the object columns the name of the columns is stored using the select_dtype function.
The list is then iterated through and the unique values in the columns are printed to check whether the columns consist of any missing data.
The columns consisted of some similar values in different formats that were processed using the replace function which replaces similar data with a single value.
Fig 3: Replacing the data with the correct value
The data is then reformatted using the strip and the lower function which removes all the leading and trailing spaces from the values and also lowering the data using the lower()and the strip()function.
Fig 4: Reformatting the data
The final data is free from all missing values and the data consists of precise values that are ready for exploration.
Data exploration (Zuur, 2010) is the process of exploring the dataset to find out similarities and reactions between the columns of the dataset. This helps us to derive the conclusion of the hypothesis. This gives a clear idea of the data present in the data set.
To get an initial idea a correlation matrix (Kohonen, 1972) is calculated and the seaborn library is used to plot a heatmap.
Fig 5: Correlation matrix
From the correlation matrix, it is evident that some of the columns have a greater influence over the pricing of the car. Using this data, we carry out the following:
Fig 6: Car make: bar graph
The Make column from the data set is used to show the frequency of the cars present in the dataset. From the above figure, it shows that Volvo is the most popular carmaker in the dataset.
Fig 7: Fuel-type: Pie chart
The data from the Fuel-type is used to show the usage of gas and diesel in the data set using a pie chart, from which it is evident that gas is the most preferred fuel type with 88.24%.
Fig 8: Length of the car: Box plot
The length of the car column is used to plot the box plot graph this gives us the average length and the out liars. This graph signifies that the length of the car is an average of 170 to 190.
Fig 9: Width vs price: scatter plot
A scatter plot used to plot the graph between the width of the car and the price of the car.
This graph is plotted to check the dependence of width of the car on the price. This gave the impression that the width of the car has a linear relation with the price as the width increases the price of the car generally increases.
- Stroke vs Peak RPM graph
Fig 10: Stroke vs peak RPM scatter plot
This plot is based on the relation of the number of strokes and the peak RPM of the car. This gives the idea that the most of the cars have strokes range from 3 to 2.7 and the range of RPM in most of the cars is 4700 to 5500 RPM.
Fig 11: Drive wheel vs price bar graph
The bar graph between the drive wheel and the price gave the fact that double the cars in the dataset are based on the rear-wheel-drive than the cars that are based on forward-wheel drive.
3. Subsection 3
Fig 12: Scatter matrix of all the numerical columns
From the above scatter matrix, it is evident that most of the columns have a linear relation with the price of the car such as wheelbase, length, height, engine size.
The city mpg and the highway mpg have an exponential relation with the price of the car.
Reference
Stieglitz, Stefan, et al. “Social media analytics–Challenges in topic discovery, data collection, and data preparation.” International journalofinformationmanagement39 (2018): 156-168.
Zuur, Alain F., Elena N. Ieno, and Chris S. Elphick. “A protocol for data exploration to avoid common statistical problems.” Methods inecology andevolution1.1 (2010): 3-14.
Kohonen, Teuvo. “Correlation matrix memories.” IEEEtransactionsoncomputers 100.4 (1972): 353-359.
Dziuban, Charles D., and Edwin C. Shirkey. “When is a correlation matrix appropriate for factor analysis? Some decision rules.” Psychologicalbulletin81.6 (1974): 358.
Fixler, Dennis, and Kimberly D. Zieschang. “Measuring the nominal value of financial services in the national income accounts.” Economic Inquiry 29.1 (1991): 53-68.
Get solved or fresh solutions on Data Exploration Analysis Assignment and many more. 24X7 help, plag free solution. Order online now!