Guide to Python Project Structure and Packaging

Published on

Photo by Kostiantyn Li on Unsplash

TIPS:

  1. Decide on a project layout: src or flat
  2. Create a virtual environment.
  3. Package your project source code folder.
  4. Install your newly packaged project (editable install).

CAVEATS

The way to organize your python project will likely depend on the platform that runs your codes. From my experience, the one presented in this article is quite suitable for a local and “dockerized” ways of running. You may find the need to adjust the structure to fit to different methods, such as in an Azure function, or an Azure ML pipeline.

Nevertheless, once you’re comfortable with the general behaviors of Python importing, you should find it intuitive to adapt to other platforms.

TL;DR

  • Structuring Python projects is very important for proper internal working, as well as for distribution to other users in the form of packages.
  • Personally, to do proper packaging, a good project structure is crucial.
  • If you are new to the modules / packages concept, and the internal working of the import operation, here is a good guide.
  • There are two popular structures: src layout, flat layout.
  • In the past, setup.py was commonly used to configure the package building process, but the way forward when using non-standard libraries like setuptools is via setup.cfg an pyproject.toml.

Update: Using hyphens to name the project package (e.g., mel-housing instead of melhousing) seems to fail the setup process somewhere, leading to the same ModuleNotFoundErroralthough the package is still able to be installed (shown running command pip list). I have not figured out the reason so I just avoid using hyphens for now, please comment if you know why.

I. INTRODUCTION

As I moved away from the Jupyter Notebook environment and adopt more flexible, albeit more complicated, project structuring approaches, I immediately encountered many questions, especially those regarding the locations to put the main source codes, unit tests, top-level scripts, etc., such that my scripts can correctly import and run the custom objects created in the source file. In a longer run viewpoint, this is also very important for project distribution to other users via packaging.

After days of (painfully) researching, I put together a guide mainly for my personal use, and hope that this can help some of you facing the same situations as me.

There are two main general structures: the flat layout vs the src layout as clearly explained in the official Python packaging guide here.

The “flat layout” refers to organising a project’s files in a folder or repository, such that the various configuration files and import packages are all in the top-level directory.

.
├── README.md
├── noxfile.py
├── pyproject.toml
├── setup.py
├── awesome_package/
│ ├── __init__.py
│ └── module.py
└── tools/
├── generate_awesomeness.py
└── decrease_world_suck.py

The “src layout” deviates from the flat layout by moving the code that is intended to be importable (i.e. import awesome_package, also known as import packages) into a subdirectory. This subdirectory is typically named src/, hence “src layout”.

# the src-layout
.
├── README.md
├── noxfile.py
├── pyproject.toml
├── setup.py
├── src/
│ └── awesome_package/
│ ├── __init__.py
│ └── module.py
└── tools/
├── generate_awesomeness.py
└── decrease_world_suck.py

After some days researching to see pros and cons of these two structures, referencing some well known repositories, trials and errors, I decided to adopt the src layout as it seems more intuitive for me personally to understand and use. Also I am using setuptools to build the package. In terms of project specifics, I’m using a sample dataset extracted from the Melbourne Housing Market dataset on Kaggle. You can find the project repository here.

II. PROJECT STRUCTURE

Main source codes for distribution / installation are inside melhousing/ sub-folder. It contains sub-modules: classes/, constants/

Other folders not to be packaged with the source code for distribution:

  • The sample data is inside data/
  • Tests inside tests/ . this is “Tests outside application code” as specified in pytest guide.
  • Experiments using source codes (by scripts and notebooks) are inside scripts/ within the root folder (top-level scripts). It has its own configuration package for one-stop configuring how the experiments are run (with hyper-parameters/meta-data).
  • Files for packaging process: LICENSE, setup.py, setup.cfg, myproject.toml, etc.

Important: from my interpretation, I don’t take the word “src” literally. Usually I re-name this folder to my liking, typically to the project name (e.g., “melhousing” in this case) or overall package name. The function of this folder is to act as a central place for all of my custom sub-packages, which can be installed in the editable mode to the virtual environment, and then used by other application codes. In this case, please treat all the sub-folders inside this “melhousing” as sub-packages (despite rather confusing naming I used — my bad), and python files inside are modules.

.
├── melhousing
│ ├── __init__.py
│ ├── classes
│ │ ├── __init__.py
│ │ └── housingdata.py
│ ├── constants
│ │ ├── __init__.py
│ │ └── myconstants.py
│ ├── another_sub_package
│ │ ├── __init__.py
│ │ └── another_module.py
│ └── scripts_internal
│ └── main_internal.py
├── data
│ ├── processed
│ │ └── processeddata.csv
│ └── raw
│ └── sampledata.csv
├── scripts
│ ├── expconfig
│ │ ├── __init__.py
│ │ ├── locconfig.json
│ │ └── locconfig.py
│ └── main.py
├── tests
├── __init__.py
└── test_demo.py
├── pyproject.toml
├── requirements.txt
├── LICENSE
├── README.md
├── setup.cfg
└── setup.py

Content of test script: create a class of housing data from the data folder, and print out a random listing information, together with other information about the data source and the custom classes in the package.

Note: I’m not going through what the custom _HousingData_ class is all about (another article focusing on the actual analytics aspects will do this). It is just here for demonstration purpose.

#inside <root>/scripts/main.py
import os
from expconfig.locconfig import locconfig
from melhousing.classes import housingdata as hd
import random

from expconfig.locconfig import locconfig

# configure input/output locations
inputfile = os.path.join(locconfig['raw']['dir'], locconfig['raw']['file'])
outputfile = os.path.join(locconfig['processed']['dir'], locconfig['processed']['file'])

# create HousingData object
def make_housing_data():
    housing_data = hd.HousingData(inputfile)
    return housing_data

def make_random_row_number():
    housing_data = make_housing_data()
    rownumber = random.randint(0, housing_data._number_of_listings)
    return rownumber

def main(rownumber: int):
    print('-'*50)
    print('location configuration:')
    print(locconfig)
    print('-'*50)
    print('csv file read from:')
    print(inputfile)
    print('-'*50)
    housing_data = make_housing_data()
    print(type(housing_data))
    print(type(housing_data._listing_list[rownumber]))
    print('-'*50)
    print('number of Listing objects inside HousingData class:')
    print(housing_data._number_of_listings)
    print('random listing object index:')
    print(rownumber)
    print('-'*50)
    print(f'listing object index {rownumber} details:')
    print(housing_data._listing_list[rownumber])
    print('-'*50)
    print('corresponding dataframe details:')
    print(housing_data._listing_dataframe.iloc[rownumber])

if __name__ == "__main__":
    main(make_random_row_number())

III. PROBLEM STATEMENT

Goal: make the whole melhousing package accessible to all scripts that are not directly in the root folder, e.g., scripts inside tests/, scripts/, etc.

Running the experiment with main.py, using resources from the melhousing package results in ModuleNotFoundError because this package is not in the same folder as main.py. Note that the expconfig package is still able to be imported (shown in the message returned in the terminal from its __init__.py file), because this package is in the same folder as the main.py file.

$ python ./scripts/main.py

imported config package
Traceback (most recent call last):
  File "/mnt/c/Users/phuon/GitHub/melhousing/./experiments/main.py", line 5, in <module>
    from melhousing.classes import housingdata as hd
ModuleNotFoundError: No module named 'melhousing'

Doing some unit testing with pytest. The tests are simply hard-coded to check if toy objects (functions, constants) from the melhousing package can be imported and return the exact expected values.

# inside ./tests/test_demo.py
from melhousing.constants.myconstants import constant1, constant2
from melhousing.constants.filepaths import demopath
from melhousing.classes import housingdata as hd

def test_constants():
    assert constant1 == 'constant1 says hello'
    assert constant2 == 123
def test_filepaths():
    assert demopath == 'filepaths.py says hello'
def test_housingdata():
    assert hd.say_hello() == 'housingdata says hello!'

if __name__ == '__main__':
    test_constants()
    test_filepaths()
    test_housingdata()

After installing the pytest framework to the venv, we simply run$ pytest ./tests . The tests were passed as pytest helps import the melhousing package from the “sister” folder location without the editable installation with pip(hence pytest advocates for the src-layout, more details here). However if you run the test script on its own without the pytest framework help, the same ModuleNotFoundError is returned.

$ pytest ./tests
=========================== test session starts ===========================
platform linux -- Python 3.9.14, pytest-7.2.1, pluggy-1.0.0
rootdir: /mnt/c/Users/phuon/GitHub/melhousing
collected 3 items

tests/test_demo.py ...                                              [100%]

============================ 3 passed in 4.49s ============================
$ python3 ./tests/test_demo.py
Traceback (most recent call last):
  File "/mnt/c/Users/phuon/GitHub/melhousing/./tests/test_demo.py", line 1, in <module>
    from melhousing.constants.myconstants import constant1, constant2
ModuleNotFoundError: No module named 'melhousing'

What to do: Install the folder melhousing as a package (in editable mode) into your virtual environment with 2 methods: 1). setup.py and 2). setup.cfg + myproject.toml

IV. SOLUTION

Method 1: Using setup.py

with the content of setup.py file below. Essentially it tells setuptools to:

  • name the package melhousing, this will be used for pip install later.
  • the version of the package that pip will report, and PyPi will published if you later distribute the package on this website.
  • you can leave find_packages() arguments empty and let it search your root folder automatically for packages. However, it is advisable to specify which folder you want to install as a package (melhousing in this case) so that we prevent any unintended installation of other packages in the root folder.
setup(name='melhousing',
    version='1.0',
    packages=find_packages(include=['melhousing'])

Now, install in editable mode (with the -e flag, note the . at the end), so that Python will direct any import operation to the package folder under development (instead of copying the codes to another disk location, e.g., site-packages) and any updates to the codes will be reflected the next time the Python interpreter is run. The terminal output will resemble something like this:

$ pip install -e .

Obtaining file:///mnt/c/Users/phuon/GitHub/melhousing
	Installing build dependencies ... done
	Checking if build backend supports build_editable ... done
	Getting requirements to build editable ... done
	Installing backend dependencies ... done
	Preparing editable metadata (pyproject.toml) ... done
Building wheels for collected packages: melhousing
	Building editable for melhousing (pyproject.toml) ... done
Successfully built melhousing
Installing collected packages: melhousing
Successfully installed melhousing-1.0

Retrying running the main.py script again, it now can import the package resources and return the objects’ attributes as intended.

imported config package
imported melhousing package
imported classes package
--------------------------------------------------
location configuration:
{'raw': {'dir': './data/raw', 'file': 'sampledata.csv'}, 'processed': {'dir': './data/processed', 'file': 'processeddata.csv'}}
--------------------------------------------------
csv file read from:
./data/raw/sampledata.csv
--------------------------------------------------
<class 'melhousing.classes.housingdata.HousingData'>
<class 'melhousing.classes.housingdata.Listings'>
--------------------------------------------------
number of Listing objects inside HousingData class:
500
random listing object index:
492
--------------------------------------------------
listing object index 492 details:
Suburb: Ascot Vale
Address: 44 The Parade
...
Lattitude: -37.7729
Longtitude: 144.9179
Regionname: Western Metropolitan
Propertycount: 6567
--------------------------------------------------
corresponding dataframe details:
Suburb Ascot Vale
Address 44 The Parade
...
Lattitude -37.7729
Longtitude 144.9179
Regionname Western Metropolitan
Propertycount 6567
Name: 492, dtype: object

Method 2 (preferred): Using setup.cfg + pyproject.toml

While setup.py can be used to build project packages, setuptools became the popular tool to build distributions, and there came the issue regarding dependencies required to run setup.py as setuptools does not belong to the Python standard library. This is explained in PEP 517 and PEP 518, and a solution was recommended with the introduction of setup.cfg and pyproject.toml files.

PEP 518:

You can’t execute a setup.py file without knowing its dependencies, but currently there is no standard way to know what those dependencies are in an automated fashion without executing the setup.py file where that information is stored. It’s a catch-22 of a file not being runnable without knowing its own contents which can’t be known programmatically unless you run the file.

PEP 517:

The build system dependencies will be stored in a file named pyproject.toml that is written in the TOML format [6]. This format was chosen as it is human-usable (unlike JSON [7]), it is flexible enough (unlike configparser [9]), stems from a standard (also unlike configparser [9]), and it is not overly complex (unlike YAML [8]).

Use Case 1: When you want to install the whole src-equivalent folder as a package

After organizing our project folders, there must be a way to specify which programs and libraries are needed to execute the actual packaging (build system requirement). We do that with the pyproject.toml file. Since we are using setuptools, we can just include it and keep the file very minimal.

Next, the setup.cfg is used as the main file to define a package’s metadata and other options that are normally supplied to the setup() function previously. In addition, you can also include other meta data of your project under [metadata] table.

With these two files doing all the heavy-lifting specifications, the setup.py file is a bare-bone stub.

Note: other required dependencies that are usually specified in the _requirements.txt_ file can also be migrated to either the _pyproject.toml_ or the _setup.cfg_ under the appropriate table flags (you can read up more here). _setuptools_ will automatically download and install these dependencies when the package is installed. In this article for more clarity, I just keep them separate in the _requrements.txt_and you will have to just do one extra step to _pip install_ them into your _venv_ as usual.

# inside pyproject.toml:

[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"
# inside setup.cfg:

[metadata]
name = melhousing
version = 1.0
author = <Firstname Lastname>
author_email = youremail@email.com
description = Your description of the project
long_description = file: README.md
long_description_content_type = text/markdown
url = yourprojecturl.com
classifiers =
Programming Language :: Python :: 3
Operating System :: OS Independent
license_files = LICENSE.txt

[options]
packages = find:

[options.packages.find]
include = melhousing
# inside setup.py - minimal stub:

from setuptools import setup
if __name__ == '__main__':
setup()

Now, install your project as an editable package just like before from the project root folder:

pip install -e .

Use case 2: when you want to install sub-packages inside the src-equivalent folder

According to the official user guide documentation:

If your packages are not in the root of the repository or do not correspond exactly to the directory structure, you also need to configure package_dir

In our case, this means if you want to install some specific sub-packages inside melhousing, for instance classes, and consequently only write import classes instead of import melhousing.classes in any scripts, we will make some changes as followed. Note that you can replace classes with * to import all sub-packages, and the arguments package_dir and where take on the same value.

# inside setup.cfg:

[options]
packages = find:
package_dir =
=melhousing

[options.packages.find]
where = melhousing
include = classes

V. CONCLUSION

Key takeaways:

  • Structuring Python projects is very important for proper internal working, as well as for distribution to other users in the form of packages.
  • There are two popular structures: src layout, flat layout.
  • In the past, setup.py was commonly used to configure the package building process, but the way forward when using non-standard libraries like setuptools is via setup.cfg an pyproject.toml.

Thank you for reading and I welcome any comments / feedback.

REFERENCES

Enjoyed this article?

Share it with your network to help others discover it

Continue Learning

Discover more articles on similar topics