Assumption
Use Strategy B (Latest Date-renew) if:
You’re importing current data for active system use (e.g., active
memberships, current customers).
You assume the most recently renewed record is the most up-to-
date and accurate.
Assumption:
We assume the most recently renewed record is the most up-to-
date and accurate so we give priority to date renew not date
joined.
In data we found with the same name have multiple records and
have same address and phone number so assume all the person
same and basis of date renew pick entry
Third assumption we keep multiple records by assumption that they
lisving on same address
Fourth assumption we delete provisional entry and give priority to
other category by assuming provisional entry is contractual.
Fifth assumption some of the date renew entry same so we prioritise
full time first then associate and then member
With the name of achim same address assigned for 4 members so I removed 3
entries and pick with 1 only on the basis of date renew (400-31)
2. 5 records I have with the same address and first name Adam was also same
so on the basis of date renew pick 400-34 and also on the basis of class because
this is full time
3. 3 records was there with the same name and address so on the basis of date
renew choose 400-111
4. on the basis of date renewel choose 400-145
5. on the basis of date renewel choose 400-132
6. on the basis of date renewel choose 400-182
7. on the basis of full time choose 400-165
8. renewal date is same so on the basis of full time choose this one 400-173
9. on the basis of date renel choose 400-159
10. on the basis of full time choose 400-80
11. renewal date same so on the basis of full time choose 400-101
12. on the basis of date renewal 400-146
13. on the basis of date renewal latest entry is 400-177 but tile is missing but we
refer the title with the other entry with the name and address that is herr so will
update that record. And other entry address value is missing so not taking that
14. on the bais of full time choose this 400-192
15. on the basis of date renewal 400-92
16. based on the date renew 400-152
17. on the basis of date renewal 400-102
18. on the basis of date renew 400-108 phone number is missing but have
address to contact
19. choose 400-164, because of date renew and also this is associate not
provisional as provisional renew date was latest but its temporary so we deleted
this 400-74.
and we also consider this 400-170 as well assumption they both live on same
address.
20. address value is missing for this 400-189 and also date renewal is latest sp
we choose this 400-180
21. we deleted this 400-181 because it is provisional and chose this 400-183, in
400-183 in address was human unreadable so we corrected this Neuenborg.
22. on the basis of date renew choose 400-163 and deleted this 400-117
because of provisional
23. on the basis of date renew choose 400-171 and delete this and delete this
400-32 as its provisional
24. on the basis of date renew choose 400-97 delete 400-91 as its provisional
25. date renew is same but for this 400-150 it is full time and other one 400-186
is provisional and also address value missing
26. on the basis of date renew choose 400-103
27. deleted this 400-190 as it is provisional so choose this 400-33
28/ on the basis of date renew choose this 400-81
29. on the basis of date renew 400-28 although number is missing but have
address to contact
30. On the basis of date renew so choose this 400-24 400-27 and 400-6 will be
deleted due to provisional
31. 400-131 and 400-26 have same name and contact details so on the basis of
renew we pick 400-131 and Gender we update on the basis of 400-26.
32. on the basis of date renew choose 400-151 and we deleted 400-96 as its
provisional entry
33. on the basis of date renew we choose 400-21 other entries we deleted as
they have same name and contact details.
34. on the basis of date we chose 400-7 and other entries deleted
35. on the basis of date we choose 400-106 and deleted all other entries
400-106 add 2 was human unreadable so we correct this to St. Leon-Rot.
36. on the basis of date renew consider this 400-38
37. on the basis of date renew choose 400-129
38. on the basis of date renew we choose 400-193 and other entry deleted and
gender we assume male
39. as this 400-1 entry is provisional so we deletd this and choose this and on
the basis of other similar entry we assume gender female
40. on the basis of date renew we choose this 400-142 although country detail is
missing but on the basis of other entry we assume country will also same as they
same address so we updated that US- ENGLISH
41. ON THE BASIS OF DATE RENEW WE CHOOSE THIS 400-85 AND DELETED THIS
400-2 AS IT IS PROVISIONAL ENTRY.
20.
Analysis of Assumptions and Data Cleaning Actions
Overview of Assumptions
The assumptions guiding the data cleaning process are:
1. Primary Assumption (Strategy B): The most recently renewed
record (based on "Date-renew") is the most up-to-date and
accurate, so it takes priority over "Date-joined" for active system
use (e.g., active memberships).
2. Duplicate Identification: Records with the same name, address,
and phone number are assumed to belong to the same person, and
the entry with the latest "Date-renew" is chosen.
3. Third Assumption: Multiple records with the same address are
retained, assuming they represent different individuals living at the
same address.
4. Fourth Assumption: Provisional entries are deleted, prioritizing
other categories (e.g., full-time, associate, member), as provisional
entries are assumed to be contractual or temporary.
5. Fifth Assumption: When "Date-renew" is the same for multiple
entries, prioritize based on membership class: full-time first, then
associate, then member.
Critical Analysis of Assumptions
1. Primary Assumption (Strategy B: Prioritize Latest Date-
renew)
o Analysis: Prioritizing the latest "Date-renew" for active
system use is a reasonable approach, as it assumes the most
recent update reflects the most accurate information. This is
particularly relevant for fields like address or phone number,
which may change over time. However, this assumption
doesn’t account for potential errors in the most recent entry.
For example, if an employee’s latest "Date-renew" record
contains a typo (e.g., an incorrect address), this error will be
propagated, and earlier correct data might be discarded.
o Implication: While this method generally improves data
accuracy, it risks retaining incorrect data if the latest entry is
flawed. It also assumes "Date-renew" is consistently updated,
which may not always be the case.
o Suggestion: Validate the "Date-renew" field for consistency
(e.g., check for outliers or missing values) before using it as
the primary criterion. Flag records with significant
discrepancies between entries for manual review.
2. Duplicate Identification (Same Name, Address, and Phone
Number)
o Analysis: Assuming records with the same name, address,
and phone number belong to the same person is practical but
has limitations. Names may not be unique (e.g., multiple
"Adams" at the same address could be different people, such
as family members). Phone numbers can change or be shared,
and addresses may be entered inconsistently (e.g., "123 Main
St" vs. "123 Main Street"). Relying on "Date-renew" to pick the
correct entry assumes the latest record is always the most
accurate, which, as noted, may not hold true.
o Implication: This method may incorrectly merge distinct
individuals (e.g., family members living at the same address)
or miss duplicates if data entry errors exist in the name,
address, or phone number fields.
o Suggestion: Use fuzzy matching to account for minor
variations in names or addresses. If available, incorporate a
unique identifier like an employee ID to confirm duplicates.
Cross-reference with other fields (e.g., email, date-joined) to
reduce false positives.
3. Third Assumption: Retain Multiple Records with the Same
Address
o Analysis: Assuming multiple records with the same address
represent different individuals (e.g., family members or
roommates) is reasonable, especially if other fields like name
or phone number differ. However, this contradicts the second
assumption, which merges records with the same name and
address. For example, in record 19, you retained both 400-164
and 400-170, assuming they live at the same address, but in
record 1, you removed three entries for "Achim" at the same
address, assuming they are the same person. This
inconsistency in applying the assumptions could lead to
errors.
o Implication: Retaining multiple records at the same address
risks keeping duplicates if they are actually the same person,
while merging them risks losing data if they are distinct
individuals.
o Suggestion: Clarify the criteria for determining whether
same-address records are duplicates. For example, if names
differ, retain them as separate individuals; if names are the
same, use additional fields (e.g., phone number, membership
class) to confirm before merging.
4. Fourth Assumption: Delete Provisional Entries
o Analysis: Deleting provisional entries and prioritizing other
categories (e.g., full-time, associate) assumes provisional
entries are temporary or contractual, which aligns with
common HR practices. However, this risks losing valuable data
if the provisional entry contains unique or more recent
information (e.g., a correct phone number not present in the
full-time record). For example, in record 19, you deleted 400-
74 (provisional) despite its later "Date-renew," which might
have contained updated information.
o Implication: This approach ensures consistency by
prioritizing permanent records but may discard useful data
from provisional entries, especially if they were recently
updated.
o Suggestion: Before deleting provisional entries, compare
their data with the retained record. If the provisional entry has
more recent or unique information (e.g., a new address),
merge that data into the permanent record rather than
deleting it outright.
5. Fifth Assumption: Prioritize by Membership Class When
Date-renew is the Same
o Analysis: When "Date-renew" is the same, prioritizing full-
time over associate, then member, assumes that higher
membership classes have more accurate or reliable data. This
is a practical tiebreaker, as full-time employees may have
more stable or verified records. However, this assumption
doesn’t account for cases where a lower-class record (e.g.,
member) might have more accurate data. For example, in
record 8 (400-173), you chose the full-time entry over another
with the same "Date-renew," but the other entry might have
had a more accurate address or phone number.
o Implication: This method may improve data quality for
higher-status employees but risks discarding valid data from
lower-status records.
o Suggestion: Use membership class as a secondary criterion
after "Date-renew." If conflicts persist, flag these records for
manual review to ensure no valid data is lost.
Analysis of Specific Actions (Records 1–41)
The 41 actions taken to clean the dataset involve removing duplicates,
deleting provisional entries, and resolving conflicts using "Date-renew"
and membership class. Below, I’ll highlight key patterns, potential issues,
and suggestions:
Consistent Use of Date-renew: Most records (e.g., 1, 3–6, 9, 12,
15–17, 26, 28–30, 32–37) use the latest "Date-renew" to select the
correct entry, which aligns with the primary assumption. This is a
strong approach for fields that change over time, but as noted, it
risks retaining errors in the latest entry.
Provisional Entry Deletion: Records 19, 21–24, 27, 30, 32, 39,
and 41 delete provisional entries, which aligns with the fourth
assumption. However, in record 19, the provisional entry (400-74)
had a later "Date-renew," suggesting it might have contained
updated information that was discarded.
Membership Class as Tiebreaker: Records 7, 8, 10, 11, 14, and
25 use membership class (e.g., full-time over associate or
provisional) when "Date-renew" is the same, which aligns with the
fifth assumption. This is a practical tiebreaker but risks losing data
from lower-class records.
Same Address Handling: Record 19 retains both 400-164 and
400-170, assuming they live at the same address, which aligns with
the third assumption. However, this contradicts actions like record 1,
where three entries for "Achim" at the same address were merged.
This inconsistency needs resolution.
Missing Data Handling: Records 13, 18, 20, 29, and 40 address
missing data (e.g., title, phone number, address, country). For
example, in record 13, you inferred the title "Herr" for 400-177
based on another entry, and in record 40, you assumed the country
as "US-ENGLISH" based on a similar entry. While this fills gaps, it
risks introducing errors if the assumptions are incorrect (e.g., the
country might differ despite the same address).
Gender Inference: Records 31, 38, and 39 infer gender based on
other entries or assumptions (e.g., 400-193 assumed male, 400-1
assumed female). This is risky, as names and titles may not
accurately reflect gender, and it doesn’t account for non-binary or
transgender individuals.
Human-Unreadable Data Correction: Records 21 (corrected
"Neuenborg") and 35 (corrected "St. Leon-Rot") address human-
unreadable address fields, which is a good practice to improve data
quality. However, the correction process isn’t detailed, so it’s unclear
how these corrections were validated.
Potential Issues in the Actions
1. Inconsistent Handling of Same-Address Records: The third
assumption (retaining multiple records at the same address) is
applied inconsistently. For example, record 19 retains two records,
while record 1 merge four records for "Achim." This inconsistency
could lead to either duplicates or data loss.
2. Risk of Data Loss from Provisional Deletions: Deleting
provisional entries (e.g., 400-74 in record 19) without merging
useful data risks losing recent updates, especially if the provisional
entry has a later "Date-renew."
3. Gender and Country Inference Risks: Inferring gender (e.g.,
records 38, 39) and country (e.g., record 40) based on other entries
or assumptions introduces potential errors, particularly for diverse
populations.
4. Missing Data Handling: Records with missing phone numbers
(e.g., 18, 29) or addresses (e.g., 20) are retained if they have a
recent "Date-renew," which is practical but doesn’t address how to
contact these individuals if needed.
Recommendations for Improvement
1. Resolve Inconsistency in Same-Address Handling: Clarify the
criteria for merging or retaining same-address records. If names are
the same, merge based on "Date-renew" and other fields (e.g.,
phone number). If names differ, retain as separate individuals unless
other evidence suggests they are duplicates.
2. Merge Data from Provisional Entries: Before deleting provisional
entries, compare their data with the retained record and merge any
unique or more recent information (e.g., a new phone number or
address).
3. Avoid Gender and Country Inference: Unless critical, avoid
inferring gender or country. If needed, add self-reported fields in
future data collection. For existing records, leave ambiguous entries
as "unknown" to avoid bias.
4. Validate Data Corrections: Document how human-unreadable
data (e.g., "Neuenborg," "St. Leon-Rot") was corrected (e.g., using
external sources like maps or HR records) to ensure accuracy.
5. Flag Ambiguous Cases for Manual Review: Records with
missing data (e.g., 18, 20, 29) or same "Date-renew" (e.g., 8, 11)
should be flagged for manual review to ensure no valid data is lost.
6. Use Fuzzy Matching for Duplicates: Apply fuzzy matching to
account for variations in names or addresses, reducing the risk of
missing duplicates due to typos.
Conclusion
The assumptions and actions taken provide a structured approach to
cleaning the dataset, with a focus on prioritizing recent "Date-renew"
entries, deleting provisional records, and using membership class as a
tiebreaker. However, inconsistencies in handling same-address records,
potential data loss from provisional deletions, and risks from
gender/country inference highlight areas for improvement. By refining the
assumptions, incorporating data merging for provisional entries, and
adding validation steps, the cleaning process can be enhanced to ensure a
more accurate and reliable dataset.