3.8. Theory Relations¶
- relation¶
In relational database theory, a relation, as originally defined by E. F. Codd, 4 is a set of tuples (d1, d2, ..., dn), where each element dj is a member of Dj, a data domain. Codd's original definition notwithstanding, and contrary to the usual definition in mathematics, there is no ordering to the elements of the tuples of a relation. Instead, each element is termed an attribute value. An attribute is a name paired with a domain (nowadays more commonly referred to as a type or data type). An attribute value is an attribute name paired with an element of that attribute's domain, and a tuple is a set of attribute values in which no two distinct elements have the same name. Thus, in some accounts, a tuple is described as a function, mapping names to values. 6
- retention¶
Data retention defines the policies of persistent data and records management for meeting legal and business data archival requirements. In the field of telecommunications, data retention generally refers to the storage of call detail records (CDRs) of telephony and internet traffic and transaction data (IPDRs) by governments and commercial organisations. In the case of government data retention, the data that is stored is usually of telephone calls made and received, emails sent and received, and websites visited. Location data is also collected. 5
- consistency¶
Consistency (or Correctness) in database systems refers to the requirement that any given database transaction must change affected data only in allowed ways. Any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof. This does not guarantee correctness of the transaction in all ways the application programmer might have wanted (that is the responsibility of application-level code) but merely that any programming errors cannot result in the violation of any defined database constraints. 10 7
- integrity¶
Data integrity is the maintenance of, and the assurance of, data accuracy and consistency over its entire life-cycle and is a critical aspect to the design, implementation, and usage of any system that stores, processes, or retrieves data. The term is broad in scope and may have widely different meanings depending on the specific context – even under the same general umbrella of computing. It is at times used as a proxy term for data quality, while data validation is a prerequisite for data integrity. Data integrity is the opposite of data corruption. The overall intent of any data integrity technique is the same: ensure data is recorded exactly as intended (such as a database correctly rejecting mutually exclusive possibilities). Moreover, upon later retrieval, ensure the data is the same as when it was originally recorded. In short, data integrity aims to prevent unintentional changes to information. Data integrity is not to be confused with data security, the discipline of protecting data from unauthorized parties. 9 8
- DBA¶
DataBase Administrator
3.8.1. Base¶
>>> from dataclasses import dataclass
>>>
>>>
>>> @dataclass
... class Astronaut:
... firstname: str
... lastname: str
>>>
>>>
>>> CREW = [
... Astronaut('Mark', 'Watney'),
... Astronaut('Melissa', 'Lewis'),
... Astronaut('Rick', 'Martinez')]

3.8.2. Extend¶
>>> from dataclasses import dataclass
>>>
>>>
>>> @dataclass
... class Astronaut:
... firstname: str
... lastname: str
... role: str
>>>
>>>
>>> CREW = [
... Astronaut('Mark', 'Watney', 'Botanist'),
... Astronaut('Melissa', 'Lewis', 'Commander'),
... Astronaut('Rick', 'Martinez', 'Pilot')]

>>> from dataclasses import dataclass
>>>
>>>
>>> @dataclass
... class Astronaut:
... firstname: str
... lastname: str
... role: str
... mission_year: int
... missions_name: str
>>>
>>>
>>> CREW = [
... Astronaut('Mark', 'Watney', 'Botanist', 2035, 'Ares 3'),
... Astronaut('Melissa', 'Lewis', 'Commander', 2035, 'Ares 3'),
... Astronaut('Rick', 'Martinez', 'Pilot', 2035, 'Ares 3')]

3.8.3. Boolean Vector¶
>>> from dataclasses import dataclass
>>>
>>>
>>> @dataclass
... class Mission:
... year: int
... name: str
>>>
>>>
>>> @dataclass
... class Astronaut:
... firstname: str
... lastname: str
... role: str
... missions: list[Mission]
>>>
>>>
>>> CREW = [
... Astronaut('Mark', 'Watney', 'Botanist', missions=[
... Mission(2035, 'Ares 3')]),
... Astronaut('Melissa', 'Lewis', 'Commander', missions=[
... Mission(2035, 'Ares 3'),
... Mission(2031, 'Ares 1')]),
... Astronaut('Rick', 'Martinez', 'Pilot', missions=[])]

3.8.4. FFill¶
>>> from dataclasses import dataclass
>>>
>>>
>>> @dataclass
... class Mission:
... year: int
... name: str
>>>
>>>
>>> @dataclass
... class Astronaut:
... firstname: str
... lastname: str
... role: str
... missions: list[Mission]
>>>
>>>
>>> CREW = [
... Astronaut('Mark', 'Watney', 'Botanist', missions=[
... Mission(2035, 'Ares 3')]),
... Astronaut('Melissa', 'Lewis', 'Commander', missions=[
... Mission(2035, 'Ares 3'),
... Mission(2031, 'Ares 1')]),
... Astronaut('Rick', 'Martinez', 'Pilot', missions=[])]




3.8.5. Relations¶
>>> from dataclasses import dataclass
>>>
>>>
>>> @dataclass
... class Mission:
... year: int
... name: str
>>>
>>>
>>> @dataclass
... class Astronaut:
... firstname: str
... lastname: str
... role: str
... missions: list[Mission]
>>>
>>>
>>> CREW = [
... Astronaut('Mark', 'Watney', 'Botanist', missions=[
... Mission(2035, 'Ares 3')]),
... Astronaut('Melissa', 'Lewis', 'Commander', missions=[
... Mission(2035, 'Ares 3'),
... Mission(2031, 'Ares 1')]),
... Astronaut('Rick', 'Martinez', 'Pilot', missions=[])]


3.8.6. Serialization¶
>>> from dataclasses import dataclass
>>>
>>>
>>> @dataclass
... class Mission:
... year: int
... name: str
>>>
>>>
>>> @dataclass
... class Astronaut:
... firstname: str
... lastname: str
... role: str
... missions: list[Mission]
>>>
>>>
>>> CREW = [
... Astronaut('Mark', 'Watney', 'Botanist', missions=[
... Mission(2035, 'Ares 3')]),
... Astronaut('Melissa', 'Lewis', 'Commander', missions=[
... Mission(2035, 'Ares 3'),
... Mission(2031, 'Ares 1')]),
... Astronaut('Rick', 'Martinez', 'Pilot', missions=[])]




3.8.7. Normal forms¶
UNF: Unnormalized form
1NF: First normal form
2NF: Second normal form
3NF: Third normal form
EKNF: Elementary key normal form
BCNF: Boyce–Codd normal form
4NF: Fourth normal form
ETNF: Essential tuple normal form
5NF: Fifth normal form
DKNF: Domain-key normal form
6NF: Sixth normal form
3.8.8. Recap¶
DBA and Programmers use different data format than Data Scientists
Data Scientists prefer flat formats, without relations and joins
DBA and Programmers prefer relational data
For DBA and Programmers flat data formats represents data duplication
Normalization make data manipulation more consistent
Normalization uses less space and makes UPDATEs easier
Normalization causes a lot of SELECT and JOINs, which requires computation
In XXI century storage is cheap, computing power cost money
Currently SELECTs are far more common than INSERTs and UPDATEs (let say 80%-15%-5% - just a rough estimate, please don't quote this number)
Normalization does not work at large (big-data) scale
Big data requires simplified approach, and typically without any relations
Data consistency then is achieved by business logic
3.8.9. References¶
- 1
Database normalization. https://en.wikipedia.org/wiki/Database_normalization
- 2
SQL. Wikipedia. Year: 2021. Retrieved: 2021-12-16. URL: https://en.wikipedia.org/wiki/SQL
- 3
Shafranovich, Y. The application/sql Media Type. Internet Engineering Task Force (IETF). Retrieved: 2021-12-16. Year: 2013. URL: https://datatracker.ietf.org/doc/html/rfc6922
- 4
Codd, E. F. Further Normalization of the Data Base Relational Model. (Presented at Courant Computer Science Symposia Series 6, Data Base Systems, New York City, May 24–25, 1971.) IBM Research Report RJ909 (August 31, 1971). Republished in Randall J. Rustin (ed.), Data Base Systems: Courant Computer Science Symposia Series 6. Prentice-Hall, 1972.
- 5
Data retention. Wikipedia. Year: 2021. Retrieved: 2021-12-16. URL: https://en.wikipedia.org/wiki/Data_retention
- 6
Relation (database). Wikipedia. Year: 2021. Retrieved: 2021-12-16. URL: https://en.wikipedia.org/wiki/Relation_(database)
- 7
Consistency. Wikipedia. Year: 2021. Retrieved: 2021-12-16. URL: https://en.wikipedia.org/wiki/Consistency_(database_systems)
- 8
Data Integrity. Wikipedia. Year: 2021. Retrieved: 2021-12-16. URL: https://en.wikipedia.org/wiki/Data_integrity
- 9
Boritz, J. IS Practitioners' Views on Core Concepts of Information Integrity. International Journal of Accounting Information Systems. Elsevier. Year: 2011. Retrieved: 2011-08-12.
- 10
Date, C. J. SQL and Relational Theory: How to Write Accurate SQL Code 2nd edition, O'reilly Media, Inc., 2012, pg. 180.
3.8.10. Assignments¶
"""
* Assignment: OOP Relations Syntax
* Complexity: easy
* Lines of code: 7 lines
* Time: 5 min
English:
1. Use Dataclass to define class `Point` with attributes:
a. `x: int` with default value `0`
b. `y: int` with default value `0`
2. Use Dataclass to define class `Path` with attributes:
a. `points: list[Point]` with default empty list
3. Run doctests - all must succeed
Polish:
1. Użyj Dataclass do zdefiniowania klasy `Point` z atrybutami:
a. `x: int` z domyślną wartością `0`
b. `y: int` z domyślną wartością `0`
2. Użyj Dataclass do zdefiniowania klasy `Path` z atrybutami:
a. `points: list[Point]` z domyślną pustą listą
3. Uruchom doctesty - wszystkie muszą się powieść
Tests:
>>> import sys; sys.tracebacklimit = 0
>>> from inspect import isclass
>>> assert isclass(Point)
>>> assert isclass(Path)
>>> assert hasattr(Point, 'x')
>>> assert hasattr(Point, 'y')
>>> Point()
Point(x=0, y=0)
>>> Point(x=0, y=0)
Point(x=0, y=0)
>>> Point(x=1, y=2)
Point(x=1, y=2)
>>> Path([Point(x=0, y=0),
... Point(x=0, y=1),
... Point(x=1, y=0)])
Path(points=[Point(x=0, y=0), Point(x=0, y=1), Point(x=1, y=0)])
"""
from dataclasses import dataclass, field
"""
* Assignment: OOP Relations Model
* Complexity: easy
* Lines of code: 10 lines
* Time: 8 min
English:
1. In `DATA` we have two classes
2. Model data using classes and relations
3. Create instances dynamically based on `DATA`
4. Run doctests - all must succeed
Polish:
1. W `DATA` mamy dwie klasy
2. Zamodeluj problem wykorzystując klasy i relacje między nimi
3. Twórz instancje dynamicznie na podstawie `DATA`
4. Uruchom doctesty - wszystkie muszą się powieść
Tests:
>>> import sys; sys.tracebacklimit = 0
>>> assert type(result) is list
>>> assert all(type(astro) is Astronaut
... for astro in result)
>>> assert all(type(addr) is Address
... for astro in result
... for addr in astro.addresses)
>>> result # doctest: +NORMALIZE_WHITESPACE
[Astronaut(firstname='José',
lastname='Jiménez',
addresses=[Address(street='2101 E NASA Pkwy', city='Houston', postcode=77058, region='Texas', country='USA'),
Address(street=None, city='Kennedy Space Center', postcode=32899, region='Florida', country='USA')]),
Astronaut(firstname='Mark',
lastname='Watney',
addresses=[Address(street='4800 Oak Grove Dr', city='Pasadena', postcode=91109, region='California', country='USA'),
Address(street='2825 E Ave P', city='Palmdale', postcode=93550, region='California', country='USA')]),
Astronaut(firstname='Иван',
lastname='Иванович',
addresses=[Address(street=None, city='Космодро́м Байкону́р', postcode=None, region='Кызылординская область', country='Қазақстан'),
Address(street=None, city='Звёздный городо́к', postcode=141160, region='Московская область', country='Россия')]),
Astronaut(firstname='Melissa',
lastname='Lewis',
addresses=[]),
Astronaut(firstname='Alex',
lastname='Vogel',
addresses=[Address(street='Linder Hoehe', city='Cologne', postcode=51147, region='North Rhine-Westphalia', country='Germany')])]
"""
from dataclasses import dataclass
DATA = [
{"firstname": "José", "lastname": "Jiménez", "addresses": [
{"street": "2101 E NASA Pkwy", "city": "Houston", "postcode": 77058, "region": "Texas", "country": "USA"},
{"street": None, "city": "Kennedy Space Center", "postcode": 32899, "region": "Florida", "country": "USA"}]},
{"firstname": "Mark", "lastname": "Watney", "addresses": [
{"street": "4800 Oak Grove Dr", "city": "Pasadena", "postcode": 91109, "region": "California", "country": "USA"},
{"street": "2825 E Ave P", "city": "Palmdale", "postcode": 93550, "region": "California", "country": "USA"}]},
{"firstname": "Иван", "lastname": "Иванович", "addresses": [
{"street": None, "city": "Космодро́м Байкону́р", "postcode": None, "region": "Кызылординская область", "country": "Қазақстан"},
{"street": None, "city": "Звёздный городо́к", "postcode": 141160, "region": "Московская область", "country": "Россия"}]},
{"firstname": "Melissa", "lastname": "Lewis", "addresses": []},
{"firstname": "Alex", "lastname": "Vogel", "addresses": [
{"street": "Linder Hoehe", "city": "Cologne", "postcode": 51147, "region": "North Rhine-Westphalia", "country": "Germany"}]}
]
class Astronaut:
...
class Address:
...
# Iterate over `DATA` and create instances
# type: list[Astronaut]
result = ...
"""
* Assignment: OOP Relations HasPosition
* Complexity: medium
* Lines of code: 18 lines
* Time: 8 min
English:
1. Define class `Point`
2. Class `Point` has attributes `x: int = 0` and `y: int = 0`
3. Define class `HasPosition`
4. In `HasPosition` define method `get_position(self) -> Point`
5. In `HasPosition` define method `set_position(self, x: int, y: int) -> None`
6. In `HasPosition` define method `change_position(self, left: int = 0, right: int = 0, up: int = 0, down: int = 0) -> None`
7. Assume left-top screen corner as a initial coordinates position:
a. going right add to `x`
b. going left subtract from `x`
c. going up subtract from `y`
d. going down add to `y`
8. Run doctests - all must succeed
Polish:
1. Zdefiniuj klasę `Point`
2. Klasa `Point` ma atrybuty `x: int = 0` oraz `y: int = 0`
3. Zdefiniuj klasę `HasPosition`
4. W `HasPosition` zdefiniuj metodę `get_position(self) -> Point`
5. W `HasPosition` zdefiniuj metodę `set_position(self, x: int, y: int) -> None`
6. W `HasPosition` zdefiniuj metodę `change_position(self, left: int = 0, right: int = 0, up: int = 0, down: int = 0) -> None`
7. Przyjmij górny lewy róg ekranu za punkt początkowy:
a. idąc w prawo dodajesz `x`
b. idąc w lewo odejmujesz `x`
c. idąc w górę odejmujesz `y`
d. idąc w dół dodajesz `y`
8. Uruchom doctesty - wszystkie muszą się powieść
Tests:
>>> import sys; sys.tracebacklimit = 0
>>> from inspect import isclass, ismethod
>>> assert isclass(Point)
>>> assert isclass(HasPosition)
>>> assert hasattr(Point, 'x')
>>> assert hasattr(Point, 'y')
>>> assert hasattr(HasPosition, 'get_position')
>>> assert hasattr(HasPosition, 'set_position')
>>> assert hasattr(HasPosition, 'change_position')
>>> assert ismethod(HasPosition().get_position)
>>> assert ismethod(HasPosition().set_position)
>>> assert ismethod(HasPosition().change_position)
>>> class Astronaut(HasPosition):
... pass
>>> astro = Astronaut()
>>> astro.set_position(x=1, y=2)
>>> astro.get_position()
Point(x=1, y=2)
>>> astro.set_position(x=1, y=1)
>>> astro.change_position(right=1)
>>> astro.get_position()
Point(x=2, y=1)
>>> astro.set_position(x=1, y=1)
>>> astro.change_position(left=1)
>>> astro.get_position()
Point(x=0, y=1)
>>> astro.set_position(x=1, y=1)
>>> astro.change_position(down=1)
>>> astro.get_position()
Point(x=1, y=2)
>>> astro.set_position(x=1, y=1)
>>> astro.change_position(up=1)
>>> astro.get_position()
Point(x=1, y=0)
"""
from dataclasses import dataclass
"""
* Assignment: OOP Relations Nested
* Complexity: medium
* Lines of code: 6 lines
* Time: 13 min
English:
1. Convert `DATA` to format with one column per each attrbute for example:
a. `mission1_year`, `mission2_year`,
b. `mission1_name`, `mission2_name`
2. Note, that enumeration starts with one
3. Run doctests - all must succeed
Polish:
1. Przekonweruj `DATA` do formatu z jedną kolumną dla każdego atrybutu, np:
a. `mission1_year`, `mission2_year`,
b. `mission1_name`, `mission2_name`
2. Zwróć uwagę, że enumeracja zaczyna się od jeden
4. Uruchom doctesty - wszystkie muszą się powieść
Tests:
>>> import sys; sys.tracebacklimit = 0
>>> assert type(result) is list
>>> assert len(result) > 0
>>> assert all(type(x) is dict for x in result)
>>> result # doctest: +NORMALIZE_WHITESPACE
[{'firstname': 'Mark',
'lastname': 'Watney',
'mission1_year': '2035',
'mission1_name': 'Ares3'},
{'firstname': 'Melissa',
'lastname': 'Lewis',
'mission1_year': '2030',
'mission1_name': 'Ares1',
'mission2_year': '2035',
'mission2_name': 'Ares3'},
{'firstname': 'Rick',
'lastname': 'Martinez'}]
"""
DATA = [
{"firstname": "Mark", "lastname": "Watney", "missions": [
{"year": "2035", "name": "Ares3"}]},
{"firstname": "Melissa", "lastname": "Lewis", "missions": [
{"year": "2030", "name": "Ares1"},
{"year": "2035", "name": "Ares3"}]},
{"firstname": "Rick", "lastname": "Martinez", "missions": []}
]
# flatten data, each mission field prefixed with mission and number
# type: list[dict]
result = ...
"""
* Assignment: OOP Relations Flatten
* Complexity: medium
* Lines of code: 5 lines
* Time: 13 min
English:
1. How to write relations to CSV file (contact has many addresses)?
2. Convert `DATA` to `resul: list[dict[str,str]]`
3. Non-functional requirements:
a. Use `,` to separate fields
b. Use `;` to separate columns
4. Run doctests - all must succeed
Polish:
1. Jak zapisać w CSV dane relacyjne (kontakt ma wiele adresów)?
2. Przekonwertuj `DATA` do `resul: list[dict[str,str]]`
3. Wymagania niefunkcjonalne:
b. Użyj `,` do oddzielenia pól
b. Użyj `;` do oddzielenia kolumn
4. Uruchom doctesty - wszystkie muszą się powieść
Tests:
>>> import sys; sys.tracebacklimit = 0
>>> result # doctest: +NORMALIZE_WHITESPACE
[{'firstname': 'Pan', 'lastname': 'Twardowski', 'missions': '1967,Apollo 1;1970,Apollo 13;1973,Apollo 18'},
{'firstname': 'Ivan', 'lastname': 'Ivanovic', 'missions': '2023,Artemis 2;2024,Artemis 3'},
{'firstname': 'Mark', 'lastname': 'Watney', 'missions': '2035,Ares 3'},
{'firstname': 'Melissa', 'lastname': 'Lewis', 'missions': ''}]
"""
class Astronaut:
def __init__(self, firstname, lastname, missions=()):
self.firstname = firstname
self.lastname = lastname
self.missions = list(missions)
class Mission:
def __init__(self, year, name):
self.year = year
self.name = name
DATA = [
Astronaut('Pan', 'Twardowski', missions=[
Mission('1967', 'Apollo 1'),
Mission('1970', 'Apollo 13'),
Mission('1973', 'Apollo 18')]),
Astronaut('Ivan', 'Ivanovic', missions=[
Mission('2023', 'Artemis 2'),
Mission('2024', 'Artemis 3')]),
Astronaut('Mark', 'Watney', missions=[
Mission('2035', 'Ares 3')]),
Astronaut('Melissa', 'Lewis')]
# Convert DATA
# Use `,` to separate fields
# Use `;` to separate columns
# type: list[dict]
result = ...
"""
* Assignment: OOP Relations Nested
* Complexity: medium
* Lines of code: 7 lines
* Time: 13 min
English:
1. Convert `DATA` to format with one column per each attribute for example:
a. `address1_street`, `address2_street`,
b. `address1_city`, `address2_city`
c. `address1_city`, `address2_city`
2. Note, that enumeration starts with one
3. Run doctests - all must succeed
Polish:
1. Przekonwertuj `DATA` do formatu z jedną kolumną dla każdego atrybutu, np:
a. `address1_street`, `address2_street`,
b. `address1_city`, `address2_city`
c. `address1_city`, `address2_city`
2. Zwróć uwagę, że enumeracja zaczyna się od jeden
3. Uruchom doctesty - wszystkie muszą się powieść
Tests:
>>> import sys; sys.tracebacklimit = 0
>>> assert type(result) is list
>>> assert len(result) > 0
>>> assert all(type(x) is dict for x in result)
>>> result # doctest: +NORMALIZE_WHITESPACE
[{'firstname': 'José',
'lastname': 'Jiménez',
'address1_street': '2101 E NASA Pkwy',
'address1_city': 'Houston',
'address1_post_code': 77058,
'address1_region': 'Texas',
'address1_country': 'USA',
'address2_street': '',
'address2_city': 'Kennedy Space Center',
'address2_post_code': 32899,
'address2_region': 'Florida',
'address2_country': 'USA'},
{'firstname': 'Mark',
'lastname': 'Watney',
'address1_street': '4800 Oak Grove Dr',
'address1_city': 'Pasadena',
'address1_post_code': 91109,
'address1_region': 'California',
'address1_country': 'USA', 'address2_street': '2825 E Ave P',
'address2_city': 'Palmdale',
'address2_post_code': 93550,
'address2_region': 'California',
'address2_country': 'USA'},
{'firstname': 'Иван',
'lastname': 'Иванович',
'address1_street': '',
'address1_city': 'Космодро́м Байкону́р',
'address1_post_code': '',
'address1_region': 'Кызылординская область',
'address1_country': 'Қазақстан',
'address2_street': '',
'address2_city': 'Звёздный городо́к',
'address2_post_code': 141160,
'address2_region': 'Московская область',
'address2_country': 'Россия'},
{'firstname': 'Melissa',
'lastname': 'Lewis'},
{'firstname': 'Alex',
'lastname': 'Vogel',
'address1_street': 'Linder Hoehe',
'address1_city': 'Cologne',
'address1_post_code': 51147,
'address1_region': 'North Rhine-Westphalia',
'address1_country': 'Germany'}]
"""
import json
DATA = """[
{"firstname": "José", "lastname": "Jiménez", "addresses": [
{"street": "2101 E NASA Pkwy", "city": "Houston", "post_code": 77058, "region": "Texas", "country": "USA"},
{"street": "", "city": "Kennedy Space Center", "post_code": 32899, "region": "Florida", "country": "USA"}]},
{"firstname": "Mark", "lastname": "Watney", "addresses": [
{"street": "4800 Oak Grove Dr", "city": "Pasadena", "post_code": 91109, "region": "California", "country": "USA"},
{"street": "2825 E Ave P", "city": "Palmdale", "post_code": 93550, "region": "California", "country": "USA"}]},
{"firstname": "Иван", "lastname": "Иванович", "addresses": [
{"street": "", "city": "Космодро́м Байкону́р", "post_code": "", "region": "Кызылординская область", "country": "Қазақстан"},
{"street": "", "city": "Звёздный городо́к", "post_code": 141160, "region": "Московская область", "country": "Россия"}]},
{"firstname": "Melissa", "lastname": "Lewis", "addresses": []},
{"firstname": "Alex", "lastname": "Vogel", "addresses": [
{"street": "Linder Hoehe", "city": "Cologne", "post_code": 51147, "region": "North Rhine-Westphalia", "country": "Germany"}]}
]"""
# flatten data, each address field prefixed with address and number
# type: list[dict]
result = ...