Javascript required
Skip to content Skip to sidebar Skip to footer

On the Graph You Should Not Connect the Dots Discrete Continuous

IMHO, whoever first omitted the precise timing of changes in number of cars is the first one responsible for any misleading results. If you had this information (even if measured with error), time would be a proper continuous variable, not a grouped continuous variable (see Anderson, 1984) necessarily. You'd be free to group observations into hour-based bins if you really wanted to, at which point you'd assume responsibility for deriving any misleading results. Otherwise, by preserving precise times of arrival, you could graph your number of cars time-series over continuous time accurately.

Anyway, assuming you're stuck with number of cars per hour, I agree with @John, you should draw a line connecting your hourly observations. If you lack information about when each incremental change occurred, it's rather hard to say you're misleading anyone unless you fail to describe the limits of the information graphed. Similarly, if you graph your hourly data with a simple bar chart without a line connecting the bins, you're not really guilty of misleading anyone if you don't claim that the changes between hourly observations occur precisely as depicted, on the hour, all at once. If someone misunderstands (as will probably occur with any sufficiently publicized statistic or data), it won't be the case that you misled them, especially if you describe your data and collection procedure in sufficient detail. This much should not be hard to do.

Given basic clarity and thoroughness of data and graph descriptions, there should be no disadvantage of drawing a line to connect your bins. The advantage of connecting your bins is in fact what you seem to think is the disadvantage: drawing those lines mimics a halfway decent equation for the number of cars as a function of continuous time, even though it's based on discrete, hourly observations. You can use a straight line between observations to represent a fairly reasonable assumption that change occurs linearly over each hour, not all at once. Based on such an assumption, any reader can make a decent guess of which minute after a given hour's measurement will see the next car arrive or leave by this fairly common-sense four-step procedure:

  1. Find the point on the line where number of cars $=1+$ the previous hour's observation
  2. Draw a line straight down from this point to find where it intersects with the hour axis
  3. Measure the distance of this point on the hour axis from the point of the previous observation
  4. distance $\div$ distance between observations $\times60=$ minute after the hour of the next car's arrival.

Of course, one can estimate the next car's arrival down to the precise second too, and you can't stop readers from doing this by not providing the line – drawing the line just becomes the first of five steps. Thus if someone actually wants to know how many cars were there in the meantime...well, they can't, because the info isn't available, but they can estimate. If you knock a step off the process for them, I imagine they'll be grateful.

Doing this for your readers with simple, straight lines only implies your comfort with the assumption that change occurs linearly between hourly observations, or more pejoratively stated, your disinterest in any inaccuracies in this assumption. Inaccuracies aren't hard to imagine. First, change necessarily occurs as a nonlinear, zero-inflated function of time. It's nonlinear because the change event is ternary: either a car arrives, leaves, or neither – cars don't arrive or leave in fractional increments. It's zero-inflated because most moments in time won't see a car arrive or leave. You can get around this by treating the line as describing the probability that cars will arrive or leave in any given moment to reach the nearest whole number.

Yet another inaccuracy of the assumption behind straight lines between hourly observations remains. You might expect the rate of change (in terms of probability as above) to change more smoothly over time than your straight lines drawn separately between points imply. In more mathematical terms, you might want the derivative of your number of cars(hour) function to be continuous across hours. You might be able to do this by fitting a polynomial function to your data, but if your purpose is predictive, beware of overfitting.

Another advantage of lines over histogram-style bars (i.e., with no intermediate spacing for adjacent values of hour...let alone charts with bars that don't "touch" each other) arises from your polytomous lot variable. You can superimpose your separate time series for each lot on the same graph to facilitate comparisons, which will help you see whether your lot variable is interesting. Here's a demonstration with some made-up data:

Kudos to McCown!

I'm not even going to try to figure out how to do that coherently with bars; I'll leave that to @ChristianStade-Schuldt ;) To be fair, it's even easier to not connect these points as he suggested, but adding the lines helps disambiguate the points corresponding to separate time series from one another. In the end, it's still going to be a little subjective, so judge for yourself:

I for one find myself drawing the lines in my mind anyway. BTW, if you feel the lines in the first figure detract anything from the visual impact of the exact points, don't forget that you can always increase the size of the points, change their shape, or present their values numerically in a separate table.

Reference
Anderson, J. A. (1984). Regression and ordered categorical variables. Journal of the Royal Statistical Society B, 46, 1–30.

seccombemattlas.blogspot.com

Source: https://stats.stackexchange.com/questions/88267/connecting-the-dots-in-a-graph