We all have used Ms Excel. Or still do. Companies love Ms Excel or store information in databases.
A common challenge is: How to find duplicate row items?
Like names that occur multiple times in an administration or other elements you collected. Sometimes it makes sense or it is a MUST DO to find duplicates in a collection. With Ms Excel, Google Sheet or LibreOffice this simple thing turns out not to be simple at all!
Finding duplicates is and should be a solved problem. However many stubborn software engineers will not hesitate to solve it without using proven methods. Methods that are proven since these problems by theory and years of practice and failures.
Creating failure proof software and IT problems is complex. Even for a simple problem such as finding duplicates in a collection.
Simple ways to do this in Python are for example:
- Use the Python ‘set’ method.
- Make use of the fact that in Python dictionaries, keys must be unique.
- Use the pandas.DataFrame.duplicated API call. This call does a lot of heavy lifting for you!
- Use a built-in method that comes with your FOSS database.
Creating a function to report duplicates is simple. But it gets more complex if the following requirements apply:
- It must be fail-safe. Power of systems and even Cloud environments have interruptions. So use an algorithm that is resilient against hardware failures and software failures.
- Performance can be a thing. E.g. if you need to check more than 100M rows within 2 seconds. Yes: These types of requirements do exist!
- Security and Privacy: Often data is not meant to be seen or used by everyone. So making sure this keeps this way, even with data cleaning tasks, is often not simple.
- Requirements will change: First you check for duplicate rows, then you are asked to check for only duplicate row fields, like ‘names’ or addresses.
- ACID requirements can become important.
Most AI tools will give you quick solutions. However: These solutions are seldom good enough and function correct for edge cases. And edge cases will happen, most data collections have edge cases that are undetected by bad software. And be aware: system failures are inevitable and infrastructure you use (e.g. networks, SaaS environments) will break. So always keep thinking!