Entity Relationship Diagrams, a quick guide

Sounds very complicated doesn’t it? What exactly is an entity and how can they have relationships? What use could that possibly be to anyone? Well let’s take it one step at a time and begin with what an entity relationship diagram is, shall we? An entity relationship diagram, or ERD, is a diagram in a flow chart type style that shows how different things like customers, orders, products, etc, are connected to each and the types of interactions they have together.

ERD’s are generally used for designing a data base of some description. However, before you can really start putting one together there’s a few things you have to decide first, chiefly figuring what the entities are. An entity is usually considered to be anything that has a noun. So for example in this list of potential entities, what are the correct ones?

  • Dog

  • Cat

  • Walking

  • Friendly

  • Owner

Owner, dog and cat are the actual entities here because they have to be things that can have attributes that describe them. So for example, dog would have attributes like breed, age, colour, attitude, etc. The owner could have attributes such as name, age, address, number of pets, career, etc. Attributes are generally considered to be the adjectives that describe an entity.


Understanding an Entity


Now that we have that out of the way you may be wondering what does this have to do with databases? Probably better to just show you an example of an ERD before we get in to the details. Have a look at the below diagram and see if it makes any sense to you.

It probably seems more complicated than it is but lets talk through it anyway. This is the ERD for a company that wants to set up a database for details about it’s employees and departments as well as any appointments made with clients. The entities are; job description, employee, department, appointment, report, attachment, client and address.

Let’s take the employee entity to start with. Bear in mind that each employee will be entered in to the database following the format of this entity, so in this case, each of them will have to have their first name, last name, home phone number, job description and department information available for the database administrator. These are pretty self explanatory but you probably noticed the employee ID attribute as well. This is what’s called a primary key and is vital for any database. This is an ID that is unique for each individual entity in the database and is used to reference a specific person or object. So if an employee quits, the database administrator just gets their ID number and deletes it from the database, instead of manually deleting all of their details from every system they have used. You’ve probably been given a staff/student ID number in the past and now you know why!


Foreign Keys


Did you notice anything about the job description and department attributes? They are what’s called foreign keys. These are not necessary for every single entity as they reference another entity that the original entity may be a part of. Let’s say there are four departments in this company; sales, accounts, admin and marketing. The department entity has the primary key “departmentID” so each of the departments could be identified as sales:1, accounts:2, admin: 3, marketing: 4.

In this case the foreign key simply states that the employee with, let’s say, an ID of 10 (the primary key) , works in department 3 (foreign key saying they work in the admin department) and has a job description (which would be stored in the job description entity with it’s own ID), let’s say 15 for IT administrator. So from this info we can gather a lot of data fairly quickly without having to go to too much trouble.


Connecting Lines and Cardinality


But how does this relate to the rest of the diagram and what are those lines and circles connecting each entity? Those lines, or connecting lines as they are so creatively called, represent the type of relationship each entity has to one another. So in this diagram we can see that “department” is connected to “employee”, which makes sense, but is not connected to “address” as there is no logical connection between the two of them. You can see at the end of each connecting line there is a combination of vertical lines, three lines that fan out, known as crow’s feet, as well as circles.

These symbols refer to the cardinality of the connection between entities, or how many instances of an entity relate to one instance of another entity(1). So in plain old English that means that the “client” entity can have one address and that the address entity can have many clients. This makes sense when you consider that a client would have at least one work address, such as their office or factory, but that same address could hold many of this company’s clients.

Now those vertical lines and circles come in to effect. The vertical line denotes that the side of the relationship it is on is mandatory and the circle denotes the other side is optional. So for the full relationship between job description and employee it would read as the following; an employee has to have at least one job description, shown as the following:

As well as this a job description can have many employees, for example a sales agent job description can be applied to many employees, denoted as the following:

Makes sense right? So if we are to take the whole diagram and read it in such a way we would get the following:

  • An employee has to have a job description but a job description can have many employees.
  • An employee has to have a department and a department has to have many employees.
  • An employee can have many appointments and an appointment has to have at least one employee.
  • An appointment can have a report but a report has to part of an appointment.
  • A report can have many attachments but an attachment has to be part of a report.
  • An appointment has to have at least one client but a client can have many appointments.
  • A client has to have an address but an address can have many clients.

Many-to-many relationships


Something else that needs to be consider when putting together an ERD is if there are any many-to-many relationships. Databases are not fans to many-to-many relationships as they can cause issue when running queries later on. Consider the following:

While it might make perfect sense to say that a project can have many employees working on it and an employee can be part of many projects this would cause problems in the database. How are we to know who is doing what project? We would need multiple primary keys in each entity which would cause a lot of unnecessary confusion. A lot of database won’t even allow you to store a many-to-many relationship to avoid this. According to Michael Blaha of DataVersity (2)many to many relationships “permit multiple records for a combination of (two entities), which was not the intent of (original) relationship“.

So how do we get around this? It’s quite straight forward really, we simply add another entity to hold the details of both the entities, such as this:

By taking this approach we can easily look up details of who is doing what project without causing issue with the database.


Actions within an ERD


Sometimes you might come across an ERD that contains diamond shapes on the connecting line between entities. These are called actions and they are used to “show how two entities share information in the database(1). I think that this diagram quite accurately sums up the whole concept:

They are used more as a way for us to to understand the exact process that is taking place. If you were to look at this example without the actions it would be quite clear that a writer creates a novel and a consumer buys it but for a more complex database these actions may not be so obvious.


Conclusions


So there you have it, ERD’s summed up in about thirteen hundred words. As you can see these are pretty useful tools for organizing and understanding your data and saves you having to consider all of this when filling out your database. They cover the who, what, when and where of your database but the why is most likely because it is your job, or else you just really like databases! You can imagine that these diagrams can get very large and complicated very quickly but you now have the tools to get a general understanding of what is what regarding ERDs. I’ll leave you with this beast of a diagram, see if you can get your head around it because I’m done…enjoy!


References


1: Entity Relationship Diagram – Common ERD Symbols and Notations . 2017. Entity Relationship Diagram – Common ERD Symbols and Notations . [ONLINE] Available at: https://www.smartdraw.com/entity-relationship-diagram/. [Accessed 13 March 2017].

2: DATAVERSITY. 2017. More Database Design Errors – Confusion with Many-to-Many Relationships. [ONLINE] Available at: http://www.dataversity.net/more-database-design-errors-confusion-with-many-to-many-relationships/. [Accessed 14 March 2017].

Business Intelligence, it’s very intelligent

This is widely discussed topic among the data community and for good reason. Early examples of this go back to times when shop keepers would make a habit of getting to know their regular customer so that they might understand what type of new products the customer, or their children, close friends, neighbours, etc. could be interested in. This was all aimed at not only generating new customers but also about deciding how to diversify and expand their product line and therefore, profit potential. Granted they may never have thought of it in such clinical terms, they probably just thought of getting to know the customers as the “expected” thing to do but essentially that’s what was happening.

This kind of activity has fallen by the way side over the last few decades as people, for the most part, want to just get their products and be on their way, without wasting 40 minutes shooting the breeze with the cashier in a Spar. However that hasn’t stopped companies for still utilising this approach, not one bit. But it doesn’t just apply to customers, businesses these days want to know every detail they can get their hands on about their suppliers, their competive enviroment, etc, because as the old saying goes “knowledge is power.”

What exactly is it though?

Business intelligence is defined as:

The collective information about your customers, your competitors, your business partners, your competitive environment and your own internal operations that gives you the ability to make effective, important and often strategic business decisions.”

From this you can imagine how critical good business intelligence can be for an organisation. Take for example a clothes manufacturer whose primary customers are middle aged to elderly people. If they one day make a decision to release to market a trendy new summer collection, because a very small segment of younger people tweeted that they liked one particular outfit from them, the company would have spent a large amount of time, money and man power to make a product line few of their regular customers actually want. This is an example of bad business intelligence (BI).

So how do companies get this intelligence?

Companies these days have a multitude of options when it comes to gathering data to use for making BI decisions. For this blog however we will only discuss some of the most widely used ones.

  • Data warehouses

Data warehouses are one such tool and are repositories for all of the data that a business has collected through their various systems. They can be located either in a cloud storage system or else within a company’s mainframe server. The idea is to have as much data from as many different sources as possible.

Two approaches are taken when it comes to data warehousing, top-down or bottom-up. Top-down entails considering the “companies global needs, planning the design and implementation of the data warehouse as a whole”[3]. This can be a difficult and expensive endeavour since it is an all or nothing approach.

For a bottom-up approach a data warehouse is build up piece by piece with several data marts ( a specific resposititory for a particular communityor knowledge base [4]) being added to it as it gets larger. This apporach is generally favoured as it delivers a usable prototype quickly and can be altered easily.

  • Hadoop

This is “an open source framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs” [2]. Hadoop came about as a result of the need for automation regarding finding data quickly and it is chiefly concerned with distributing computing and processing power across several machines to speed up the search process.

This is important when you consider that the larger the amount of processing power available, i.e the number of computers used or nodes, the faster huge quantites of data can be sorted, stored and analysied. Google use a similar platform to Hadoop [1] so imagine if they used less processing power and you tried to search for something. It would be like the “good ole days” of internet dial up speeds. Searching for an image could take 20 minutes to respond. If you wanted to have your Hadoop platform increase speed and performance it is simply a matter of connecting more nodes.

So by using these tools you can see how a company could collect and process huge amounts of data quickly and have themselves in a position where they use this information to their advantage.

So what to do once you have the data?

Once enough data has been collected and analised using the above mentioned methods, BI managers can then go about generating reports based off of this data. These reports can range from querying where the business can cut costs or making their production process more effecient to predicting trends in their competitive enviroment or how upcoming P.E.S.T.E.L factors could effect the business.

After careful consideration and analysis of these reports the business can then go about making the most informed decision they can regarding how they should move forward. BI helps companies better understand and anticipate all aspects of their business and has become somewhat of an essential process to keep up with the competition these.

Remember our friendly shopkeeper from earlier? Imagine he had an enterprise scale BI analysis system. He would know what you wanted before you even realised you needed it! He could do home deliveries for essential products so that the instant you ran out of milk for your next cup of tea, he could knock on the door with a fresh bottle for you just when you need it the most. Now that…is customer service!

References

1. Does Google search engine use hadoop technology? – Quora. 2017. Does Google search engine use hadoop technology? – Quora. [ONLINE] Available at: https://www.quora.com/Does-Google-search-engine-use-hadoop-technology. [Accessed 23 February 2017].

2. Hadoop, What is it and why does it matter? – sas.com 2016. [ONLINE] Available at: https://www.sas.com/en_us/insights/big-data/hadoop.html [Accessed 23 February 2017]

3. Informatica. 2012 – Learn Big Data,Data Warehousing Fundamentals,Informatica,SQL,Cognos. -Top-Down vs. Bottom-Up In Data Warehousing- Informatica. [ONLINE] Available at: http://completedwh.blogspot.ie/2012/12/top-down-vs-bottom-up-in-data.html [Accessed 23 February 2017].

4. SearchSQLServer. 2017. What is data warehouse? – Definition from WhatIs.com. [ONLINE] Available at:http://searchsqlserver.techtarget.com/definition/data-warehouse. [Accessed 23 February 2017].

Master Data Management, it helps everyone

In this day and age every company seems to have a huge amount of data that they need to manage, be it from customers, competitors, suppliers, sales trends, etc. It’s not hard to image that all of this data can pose a rather daunting task for the people who actually need to sort through it to gather some useful information. Some times a corporation will take over an entire company simply so they can get access to the data they have on particular customer segments. So what happens if this company’s customer data is in a completely different format to the how the corporation has the rest of their data stored, or if it is entered incorrectly in to the new system?

Master data is usually in the form of entities such as customers, account, stock, etc. and is often used across multiple programs, applications and departments so it’s easy to imagine that an instance of bad data management could cause major issues. For example if a material supplier has incorrect information about a client looking for a large order, such as their shipping address, they may send the order to a different address entirely, which will cost time and money to rectify and may lose the client altogether as a result. Also when we consider that companies base most of their strategic decisions off trends they gather from their master data, poor data management could have long term strategic problems for the company.

According to Roger Wolter and Kirk Haselden of the Microsoft Corporation ”We define Master Data Management (MDM) as the technology, tools, and processes required to create and maintain consistent and accurate lists of master data”. Data can be divided in to one of four general groupings: people, things, concepts and places. Bearing this in mind a data entity such as people can be further subdivided in to customers, staff, suppliers, consultants etc.

So how does a company decide what elements are needed for which particular data entity? A customer for example my only be created through a sales agent in one company while another company my only allow them to be created through an online portal. As a result of this a company may only need limited data about the customer, such as basic contact information. However in another company customers may be on a contractual basis and would need to give extensive data to the company, such as credit checks and health information. All of this defines exactly what is the customer entity will need to be in order to satisfy the needs of the company. Bear in mind a company could have thousands of customer so if the data is wrong it usually is only noticed after it’s too late.

When a company decides to use master data management tools to increase their data’s effectiveness, rolling it out across the whole company can be expensive and challenging. Often the best approach is to start off small, usually on a trial basis within one department. Once all of the department’s data has been collated, is uniform in how it is stored, the employees have been adequately trained in how best to store, format and access the data, as well as it being up to date and accurate, it is then stored on a master data file accessible by anyone who needs it within the department. From here a company can monitor how streamline the use of this data is and if it was worth the effort. If implemented properly the company should see a reduction in redundant data entries and an overall increase in effectiveness of the data itself. They can then roll this out across other departments and once this is done throughout the whole company the results should be highly visible. The idea is to have all of the master data files available to other departments and personnel as and when they need access to them. MDM is generally more useful for large enterprise companies when considering how much data they have. Using a MDM system in a corner shop for example would be overkill to say the least.

There are many steps to approach outlined above such as:

  • Identify sources of master data

  • Identify the producers and consumers of master data

  • Collect and analyse metadata about your master data

  • Appoint data stewards

  • Implement a data governance program and data governance council

  • Develop a master data model

  • Choose a toolset

  • Design the infrastructure

  • Generate and test the master data

  • Modify the producing and consuming systems

  • Implement the maintainance process

(Roger Wolter and Kirk Haselden, Microsoft Corporation, November 2006)

For the purposes of this blog we won’t go in to the specific details of each of these steps but if you are interested in learning more about them, follow this link:

https://msdn.microsoft.com/en-us/library/bb190163.aspx

The benefits of successfully implementing a MDM system for a large organisation center around the data being clean, concise, not replicated across several systems and most importantly correct. Surely you have experience times when dealing with data that you have multiple entries about the same data item all with different information about the same thing and have no way of knowing what is correct or not. MDM removes that uncertainty and gives everyone in the company the knowledge that all of the data they are seeing is absolutely correct and thusly allows them to make more effective decisions when using the data in their own work. Think of a programmer who is developing a company wide application. If the data is stored in multiple areas and it is full of duplicates, that programmer is going to have a nightmare trying to get their application connected to everything as well as getting it to return the correct information every time. If there has been a MDM system implemented then they only have to connect to one central repository and can rest easy knowing it will work just fine.

References

Duddy, Jeremy . 2014. What Are the Benefits of Master Data Management?. [ONLINE] Available at:http://blog.apterainc.com/what-are-the-benefits-of-master-data-management. [Accessed 3 February 2017].

Haselden, Kirk and Wolter, Roger ; Microsoft Corporation. 2006. The What, Why, and How of Master Data Management. [ONLINE] Available at: https://msdn.microsoft.com/en-us/library/bb190163.aspx. [Accessed 2 February 2017].

Koets, René. 2013. How bad Master Data impacts good business. [ONLINE] Available at:https://blog.kpmg.ch/big-data-first-get-the-basics-right-how-bad-master-data-impacts-good-business-and-what-to-do/. [Accessed 2 February 2017].

Rouse, Margaret. 2010. master data management (MDM). [ONLINE] Available at:http://searchdatamanagement.techtarget.com/definition/master-data-management. [Accessed 3 February 2017].

Using R Studio

What is R?

R is a powerful programming language that allows user to perform calculations on large datasets which can then be plotted out graphically in a number of ways. This enables users to quickly and easily spot trends in a datasets which in turn allows them to make better decisions based off these graphs. The use of R can be applied to many different industries such as financial analysis, statistical research and for business intelligence purposes.

What this post is all about?

For this blog post I had to firstly take a crash course in the basic use of R. This was done on the code school website which has a very user friendly step by step guide on how to make use of different aspects of the R language. It covered storing value in variables, creating matrices and vectors as well as how to plot data in different ways, from basic scatter diagrams to 3D graphical representations. After each section I received a badge which verified my successful completion of the module.

badges rcodeschool

Getting Started

To demonstrate how to use R I decided to use one of the built-in datasets that come as part of R Studio. Initially I tried using external ones but most of the ones I found where not segmented in to columns properly, which would mean re-writing hundreds of data entries in to the right columns, which would be time consuming and impractical. So I opted to use the built-in dataset for UK driver deaths between 1970-1985.

Entering the dataset to R was quite straight forward. This was done by entering the line:

  • data<-UKDriverDeaths

This just meas the the variable name “data” will use to reference to entire dataset. Having the dataset stored and ready for use is all well and good but it doesn’t give us much information. So to make use of it I set up three other basic variables which were the mean, the median and the standard deviation. This was done by entering the lines:

  • meanValue<-mean(data)

  • deviation <- sd(data)

  • median(data)

As we have some functional data from the dataset we can now say that the average number of driver deaths in the UK between 1970-1985 is 1670.307 (rounded down to 1670), the average change of that number between each year is the standard deviation which is 289.611 (rounded up to 290) and the median value is 1631.

Representing this data graphically

To make this data a little bit easier to make sense of, the best course of action is to plot it out graphically. R makes this process very simple and quick. Firstly let us just have a straight forward histogram of the data to show where it has increased or decreased over each year. This is done by entering the line:

  • hist(data, col=”green”)

This basically just tells R to take the dataset and make a histogram with a green color such as this one:

histogram

From this graph we can see the frequency of the exact number of deaths over the time frame with 1500 deaths per year being the most usual. A higher number of deaths gets less frequent for each year which is fortunate.

Next I generated a graph with the mean value represented by a straight line running through the center of. This was done by entering the lines:

  • plot(data)

  • abline(h = meanValue)

This gave back the following graph:

graphmean

From this is graph it is easy to see where the number of deaths peaks and dips over the 15 year time frame. We can see a pattern between ’75 and around ’83 that when the number spikes there is a very sudden decrease in the number of driver deaths for perhaps a few months out of the year or so. This could be down to awareness programs implemented by government authorities to reduce the amount of people getting killed on the roads.

However the effect is short lived as the numbers quickly spike again and the pattern is repeated until a huge decrease of driver deaths in 1983. It was around this period that the UK government set a national causality reduction target which would account for the figures dropping far below the average of 1670.

What else can you do with R?

Since R can quickly manipulate data in various ways it isn’t hard to see why it is so widely used. Being that the dataset is from 30 years there isn’t much we could recommend to do in the past but with a similar and more recent dataset we could merge it with another one that has the locations of where the most driver deaths occur and with the combined datasets we could graphically plot out which areas have a consistently high number of driver deaths. After doing this, some factors in the area could be examined to determine how to best avoid an increased number of driver deaths.

Also with R any dataset can be quickly and easily added to it and plotted out in seconds. This enables you to gather a huge amount of data and spot where trends are showing with ease. Like the above example I could gleam some basic factors and conclusion about what the data represents without having to go in great detail combing through each data instance. Having this ability is invaluable when it comes to gaining useful business intelligence which enables better decision making and that can not only save on costs but depending on the sector in question, could save lives too.

Fusion Tables: Simple to use and a great data tool

Irish population map by county

For this aspect of data management we will now be looking at how fusion tables work. A fusion table is a tool that allows us to gather some very unique insights in to different datasets. The main aspect of this uses a mapping feature (google maps) to plot out various pieces of information over a geographic area.

Using Fusion Tables

To achieve the image represented above two things are required. They are a KML file, which contains the mapping information about each county so that we can have an accurate representation in the map view, as well as a spreadsheet containing the population dataset. In this case the population dataset, obtained from the CSO, had some of the larger counties and their population figures divided up by different regions, for example with Dublin the original dataset had the figures for the Fingal area, Co. Dublin and Dublin city center represented separately. So for the sake of convenience and to make the data more easily readable any counties listed like this had all of the different regions figures combined and represented as just one county itself.

After this was achieved and all of the population data was standardized, the csv(population) and kml(mapping) files were loaded in to google fusion tables as two separate tables. In order to ensure that the files would merge smoothly, the name of each county was used as the column for which they should be related to each other. Once this was processed and loaded the two tables were merged together to form an accurate population map.

To make the map more understandable a legend was applied. This was done by selecting the required range, 30,000 to 1,273,070 and dividing it up in to 5 different buckets, each associated with a different shade of the same color. Each bucket contains a different range of the population figures. The legend then shows beside the map for a quick reference to the actual figures each colored county represents.

The above example shows how the population of Ireland is dispersed across the 26 counties. Each county has a corresponding color value which informs us that the darker the color, the more populated the county is. It is clear to see from the map that while Galway, Cork and Dublin are the most populated areas while the central strip of counties like Kilkenny, Loais, Offaly and Roscommon are the least populated.

Additional Information

Irish houses with internet access

Irish houses with no internet access

While this information in itself doesn’t tell us much, it is possible to merge this map with another data set. I chose to merge this with datasets regarding private access to the internet at home, one map containing the figures of those who have it and a separate map with the figures for those who don’t. It can be seen that the range of those with no access at home is relatively small compared to the actual population numbers themselves. For example Dublin has a population of 1,273,070 and only 7.1% have no internet access. This is good news for the country in general and it can be assumed that of the 7.1% with no internet access, they are more than likely people who have no interested in it such as people of the older generation.

However it can be seen that the counties with the lowest numbers of people with home internet access are also the same that have the highest rates of home internet access. This is simply due to the population size of each county, i.e the county with the highest population will undoubtedly have the highest numbers of both. The figures are relative to the population of each county.

Conclusions/Recommendations

The conclusions we can draw from the use of fusion tables are many. For example one could predict that in the coming decades the built up and highly populated areas around Dublin, Cork and Galway will spill over in to the less populated areas of Kilkenny, Loais, Offaly and Roscommon.

What can be done to facilitate this? Firstly when the government are considering where needs to be developed, they should look at these counties since the east and west populations seems to be naturally migrating inland anyway. Granted that these are very rural areas and many are used for farming but once urban and suburban developments don’t have a detrimental impact on the farming sector, we might soon see the more under developed areas of the country become highly populated metropolitan areas like Dublin and Cork. Furthermore it would be wise for large businesses and home owners to consider buying property in these underdeveloped areas as in the years to come they will be valued real estate, providing the housing market doesn’t take another unexpected downturn.

References

http://www.cso.ie/en/census/census2011reports/census2011thisisirelandpart2/ (accessed 7/10/2016)

http://www.cso.ie/en/statistics/population/populationofeachprovincecountyandcity2011/ (accessed 10/10/2016)

http://www.independent.ie/editorial/test/map_lead.kml (accessed 7/10/2016)

http://www.cso.ie/px/pxeirestat/Statire/SelectVarVal/saveselections.asp (accessed 10/10/2016)

http://www.cso.ie/px/pxeirestat/Statire/SelectVarVal/saveselections.asp (accessed 10/10/2016)