Data management#

Effective data management is essential for successful data science projects. It ensures reproducibility, efficiency, and collaboration. Here are some key practices:

Clear directory structure#

Keep a clear and consistent directory structure for your project. Here is an example from 1. Feel free to adapt it to your needs.

|- notebooks/
   |- 01-first-logical-notebook.ipynb
   |- 02-second-logical-notebook.ipynb
   |- prototype-notebook.ipynb
   |- archive/
      |- no-longer-useful.ipynb
|- projectname/
   |- projectname/
      |- __init__.py
      |- config.py
      |- data.py
      |- utils.py
   |- setup.py
|- README.md
|- data/
   |- 01_raw/
   |- 02_interim/
   |- 03_final/
|- scripts/
   |- script1.py
   |- script2.py
   |- archive/
      |- no-longer-useful.py
|- environment.yml

Organize Your output data#

Organize your data using a clear folder structure. For example, separate raw data, processed data, and results into different folders, e.g.

|- data/
   |- 01_raw/
   |- 02_interim/
   |- 03_final/

Use descriptive and consistent file naming conventions. For example, include relevant details such as date, version, or data type in the filename (e.g., routes_10am.feather, routes_shortest.feather).

Config files#

Config files are used to store parameters and settings for your scripts or applications (e.g., .yaml, .json). They allow you to change the behavior of your code without modifying the source code itself. Also if you put them in your output folder, you will know which parameters you used to generate the output.

Documentation and Workflow Tracking#

Keep a README or documentation file in your data folder. Describe your data sources, processing steps, and file naming conventions. Document any assumptions or decisions made during data cleaning and analysis.

Version Control and Backups#

Use version control (e.g., Git) for your scripts and config files. Avoid storing large raw data files in version control; use data repositories or cloud storage for these files. Regularly back up your data and ensure sensitive data is stored securely.

Use GitLab/GitHubs tools for project management#

  • Issues: Create issues for every open task or bug. This helps you keep track of what needs to be done and what has been completed.

  • KanBan Boards: Let you prioritise tasks and keep an overview of the project status.

Ressources#

[1] Some of this content takeen from this blog post: How to organize your Python data science project. It is a bit outdated, but still gives good explanations of what is important.

Here are some more recent packages and tools that can help you setup a directory structure for your data science project. You don’t have the stick to them exactly as theyare, but they can serve as a good starting point: