A good CSV import or spreadsheet import experience is rare. Here are some hints and best practices that will make your users happy as well as your developers.
1. Provide a sample file
Correct data is crucial. To ensure that, you can of course write documentation or create video tutorials. However, one of the best and easiest solutions is frequently overlooked: Provide a file with example content. It shows the different columns and how the data within needs to be formatted.
Don’t hide the file somewhere in the documentation, but provide a link to it near the import button. That way, whenever someone needs the import functionality, s*he can look up how it works. Additionally the example file improves developer experience. Put the file in the same code repository as the importer, so your tests can use it. You can track every change and the reasons behind it, due to version control. Developers don’t have to remember to update external documentation (which may get out-of-sync with the actual functionality).
A surprising amount of users will not be able to resist the urge to download this file and immediately import it (on production, of course) again. I don’t know where this urge comes from, but it’s there. If you find an easy way to prevent importing the sample file, do it.
For the cherry on top, provide one sample and one empty template with just the column headers.
2. File format
CSV import is comparatively easy to implement, for instance in Java with OpenCSV. You still have to provide instructions how a file can be exported to CSV, which delimiter and quote characters must be selected, etc. Supporting XLSX (Excel) as well is more user friendly, but you will have to fight with different Excel versions, bugs that only occur with specific locale settings, etc.
Don’t use the position of the columns, but the column header for mapping imported data. That allows users to switch columns around and it allows developers to provide optional columns.
Think about date and number format and internationalization.
3. Validate entire CSV import file first, then process
The structure of your importer’s code is very important. Always validate the entire file, before anything is persisted. In case of errors, it is often easier to fix them and import the entire file again, than trying to work only with the failing rows.
Another reason for splitting validation and processing phases is, that you should provide feedback about wrong data as quickly as possible to the user. Nothing is more annoying than having an import run for hours and then finally getting back „Oh, there was an error on line 24.567, you probably have to do the whole thing again!“
It’s tempting to fail immediately after the first error, or the first category of errors. While this will provide feedback quicker than validating the entire file, it’s a much more frustrating user experience, as the user will fix and upload again and again and again.
In an architecture where you have to process the data in several systems (microservices, external systems), you could provide validation endpoints in the external systems, if you control them, to ensure that the split in validation and processing is possible. If you don’t have control over the external systems, it may make sense to replicate their validation logic in your importer.
You will have to decide whether the import is usually fast and you can provide direct feedback in your UI (progress bar, success message), or whether it makes more sense to run the import entirely in the background and send an e-mail or other notification after the import has finished. If possible, the best user experience is getting the validation result instantly (so you don’t have to check your emails for import errors) but the processing phase, which often takes a long time, is done in the background and reports via e-mail.
4. The feedback loop
Importing a file, getting an error message saying there is a wrong currency code in line 34.567, having to scroll through the entire file to find it, importing again, getting another error… Sadly this is the current state of a lot of import experiences. It could be so much simpler!
Only present errors like the following in your UI:
- Import is not in CSV / XLS format but JPEG. Please use a supported format. Download example file here.
- File is too large (2,6GB). The maximum file size is 2GB. Find instructions how to split the file here.
Everything related to the content of the CSV file should stay directly in a CSV file. Add a column “Error message” as last column and put errors corresponding to this row in there. Make sure that the error message is clear and contains all errors that affect the corresponding row. If you import from XLSX (Excel), you can do even better and mark the failing fields visually (e.g. red background). Automatically download or present this file to the user, if the imported data contains errors.
The user then can fix these errors with the same tools s*he used to create the file in the first place (e.g. Excel). Make sure that, this „Error message“ column is ignored by the importer (if present). If the validation phase takes very long, you may also send the result and the CSV file with the data & error messages via e-mail. Your users will be very grateful for such a smooth experience.
Using such a feedback loop also works well in cases where you want to import successful rows immediately and only want to return the problematic data to the user.
Additionally to a UI with import button, you may want to provide an e-mail address form-45a-imports@example.org that imports the attached file(s) and replies via e-mail.
5. Huge imports
If you work with a lot of data, you have to ensure that every part of the system can handle it. From web server settings, over e-mail inbox quotas. You may need instructions or tools to split large files. You may reach the point where you focus your efforts on interfaces between possible data sources (software from other companies) and your system instead of imports.
Especially in your processing phase, measure your CPU and memory usage and see whether you need to optimize.
You will need to chunk your data and think about scaling the importer. With huge data, there is also more room for errors, so you will need functionality to automatically or manually re-try failing rows. If working with Java, check out Spring Batch and Spring Data Flow.
6. If in doubt, avoid the CSV import altogether
Of course you can always look at external tools like flatfile.io etc. as well. You can make data entry so smooth and easy that imports aren’t necessary. Or you can directly connect your system to all other possible data sources via an API or other means of integrations.
That’s what I have learned about data import. Feel free to tell me your thoughts in the comments!