I'm building a Django ETL engine that extracts data from GitHub using the enterprise API to gather metrics on internal company collaboration. I've designed some schema that I now realize won't scale due to the PK (primary key) that is automatically set by the ORM. One of the main features of the extraction is to get the id
of the person that has created a repository, commented on a post, etc.
My initial thought was to let the ORM automatically set the id
as the PK
but this won't work as the GET request is going to run once a week and it will raise errors causing the overwriting of the ID
primary key to fail.
I've done some research and one potential solution is to create a meta class as referenced here: Django model primary key as a pair
but I am unsure if creating a few meta classes is going to defeat the entire point of a meta class to begin with.
Here is the schema I have setup for the models.py
from django.db import models
from datetime import datetime
""" Contruction of tables in MySQL instance """
class Repository(models.Model):
id = models.PositiveIntegerField(null=False, primary_key=True)
repo_name = models.CharField(max_length=50)
creation_date = models.CharField(max_length=21, null=True)
last_updated = models.CharField(max_length=30, null=True)
qty_watchers = models.PositiveIntegerField(null=True)
qty_forks = models.PositiveIntegerField(null=True)
qty_issues = models.PositiveIntegerField(null=True)
main_language = models.CharField(max_length=30, null=True)
repo_size = models.PositiveIntegerField(null=True)
timestamp = models.DateTimeField(auto_now=True)
class Contributor(models.Model):
id = models.IntegerField(null=False, primary_key=True)
contributor_cec = models.CharField(max_length=30, null=True)
contribution_qty = models.PositiveIntegerField(null=True)
get_request = models.CharField(max_length=100, null=True)
timestamp = models.DateTimeField(auto_now=True)
class Teams(models.Model):
id = models.IntegerField(primary_key=True, null=False)
team_name = models.CharField(max_length=100, null=True)
timestamp = models.DateTimeField(auto_now=True)
class TeamMembers(models.Model):
id = models.IntegerField(null=False, primary_key=True)
team_member_cec = models.CharField(max_length=30, null=True)
get_request = models.CharField(max_length=100, null=True)
timestamp = models.DateTimeField(auto_now=True)
class Discussions(models.Model):
id = models.IntegerField(null=False, primary_key=True)
login = models.CharField(max_length=30, null=True)
title = models.CharField(max_length=30, null=True)
body = models.CharField(max_length=1000, null=True)
comments = models.IntegerField(null=True)
updated_at = models.CharField(max_length=21, null=True)
get_request = models.CharField(max_length=100, null=True)
timestamp = models.DateTimeField(auto_now=True)
Is there a way to overwrite the id
field and make the PK
the timestamp
field since each time the GET request
is run that field will be populated with static data that will not change over the lifetime of the app?
Alternatively, is there a way to ditch the multi-table inheritance architecture and go for something different?
The core metrics that I will be extracting away from this are things like top contributor to repository
, repository with most commits
, most replied to comments
. I'd like to be able to run some kind of filters
on the data so as to extract these metrics out but I know this is heavily reliant upon the schema setup.
Thank you!