There are several datasets available for this assignment. These have been installed as blob storage on Microsoft Azure. Further data is available on data.gov.uk should you wish more detail.
The data sets are:
- all_crimes21_hdr.txt.gz (2.32 GiB Compressed, 43×10^6 records)
- LSOA_pop_v2.csv (2.4MB uncompressed)
- postcodes.gz (0.6GB Compressed)
- posttrans.csv (23.5 kB uncompressed)
All the files are csv format and may be compressed with gzip. Spark natively understands this compression format, so you may use the files just as CSV files.
Location measurement
In these datasets location is specified in several ways:
- ‘crime data’ uses ‘anonymized’ longitude and latitude AND LSOA code,
- The LSOA dataset uses LSOA code.
- Postcodes uses area centred longitude and latitude, postcode, LSOA code, Map Grid reference and many other measures
When considering which to use, you should bear in mind the level of detail needed:
- The LSOA covers a mean of 1500 people. This means that LSOA will not have the same level of accuracy as the other measures. (but will be easier to handle)
- Longitude and Latitude are accurate to 2-3m. you will need to convert these into postocdes.
- A full postcode (e.g NE1 8ST, NE2 1XE) corresponds to approximately 6 households. You can generate a larger area by using the summing data over the first part (NE1, NE2)
The Crimes Data
all_crimes18_hdr.txt.gz contains about 43million reported and logged crimes from 2010-2017. The data were downloaded. This site offers data by month, and by force. Consequently, they have been merged into one file for this assignment.
You can find out more about the data here. Only ‘street’ files have been included. Outcomes are included. licensed under the Open Government Licence v.3.0
The header row of the crimes data is:
‘Crime ID’, ‘Month’, ‘Reported by’, ‘Falls within’, ‘Longitude’, ‘Latitude’, ‘Location’, ‘LSOA code’, ‘LSOA name’, ‘Crime type’, ‘Last outcome category’
Note that Longitude and Latitude are anonymized as described on the police web site here. Since the police use around 750,000 ‘anonymous’ map points it is unlikely that these coincide with the longitudes and latitudes given in the postcode dataset. For this reason, you may prefer to use LSOA (Lower Layer Super Output Area, UK Office for National Statistics ) as a region indication.
The file posttrans.csv will allow the translation of crimes’ longitude and latitude into actual postcodes.
Location Data.
The headers of the LSOA_pop_v2.csv file are:
“date”,”geography”,”geography code”,”Rural Urban”,”Variable: All usual residents; measures: Value”,”Variable: Males; measures: Value”,”Variable: Females; measures: Value”,”Variable: Lives in a household; measures: Value”,”Variable: Lives in a communal establishment; measures: Value”,”Variable: Schoolchild or full-time student aged 4 and over at their non term-time address; measures: Value”,”Variable: Area (Hectares); measures: Value”,”Variable: Density (number of persons per hectare); measures: Value”
Postcodes Data.
The headers of the postcodes.gz file are:
‘Postcode’,’InUse?’,’Latitude’,’Longitude’,’Easting’,’Northing’,’GridRef’,’County’,’District’,’Ward’,’DistrictCode’,’WardCode’,’Country’,’CountyCode’,’Constituency’,’Introduced’,’Terminated’,’Parish’,’NationalPark’,’Population’,’Households’,’BuiltUpArea’,’Builtupsubdivision’,’Lowerlayersuperoutputarea’,’Rural/urban’,’Region’,’Altitude’
Copyright and Licenses
You may re-use this information (not including logos or Northern Ireland data) free of charge in any format or medium, under the terms of the relevant data owners’ licence. In addition, the following attribution statements must be acknowledged or displayed whenever the owners data is used:
Contains Ordnance Survey data © Crown copyright and database right 2021
Contains Royal Mail data © Royal Mail copyright and database right 2021
Source: Office for National Statistics licensed under the Open Government Licence v.3.0
Posttrtans Data.
This file allows you to take a lon, lat from the crimes data and to identify the corresponding postcode. The data was calculated by dead reckoning.
The headers of the posttrans.csv file are:
Postcode,Lon,Lat
Get expert help for Data Sets and Formats for the Assignment and many more. 24X7 help, plag-free solution. Order online now!