Skip to content

Conversation

@qnguyen345
Copy link
Contributor

@qnguyen345 qnguyen345 commented May 9, 2025

- [ ] Closes #xxx

  • Added tests to cover all new or modified code.
    - [ ] Clearly documented all new API functions with PEP257 and numpydoc compliant docstrings.
    - [ ] Added new API functions to docs/api.rst.
  • Non-API functions clearly documented with docstrings or comments as necessary.
  • Adds description and name entries in the appropriate "what's new" file
    in docs/whatsnew
    for all changes. Includes link to the GitHub Issue with :issue:`num`
    or this Pull Request with :pull:`num`. Includes contributor name
    and/or GitHub username (link with :ghuser:`user`).
  • Pull request is nearly complete and ready for detailed review.
  • Maintainer: Appropriate GitHub Labels and Milestone are assigned to the Pull Request and linked Issue.

There can be days were the system is not producing the desired power output. We can measure the daily performance against a PVWatts model to determine those outlier days. We can model a system's expected dc capacity/ power output from PVWatts using the system metadata and nsrdb weather data. We can then compare the modeled daily time series to the real time series to get a percent difference. If the percent difference is over a certain threshold and is producing much less/more than is expected, we can flag that day as an anomaly.

return deviation > max_deviation * mad


def run_pvwatts_data_checks(power_series, nsrdb_weather_df):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add underscore as this is a private method

azimuth : Float
Azimuth angle of site in degrees.
dc_capacity : Float
DC capacity of the site.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data stream

return power_series


def run_pvwatts_model(tilt, azimuth, dc_capacity, dc_inverter_limit,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private method

Percent difference threshold for flagging data as anomalies.
Defaulted to 50.
dc_capacity : None or Float
DC capacity of the site. If the inverter dc capacity is not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data stream instead of site

Returns
-------
master_df : Pandas dataframe with datetime index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename master_df as it's generic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return pandas series of percent difference, add new function to determine if anomalous where output is boolean with datetime index

@cwhanse
Copy link
Member

cwhanse commented May 9, 2025

My reaction is that run_pvwatts_model doesn't belong as a function in pvanalytics. As the code is, the PVWatts model is hardwired into the data check function get_anomalous_days so I couldn't use a different performance model as input for this check - that reduces reusability.

get_anomalous_days ought to have a more specific name.

+1 to @kperrynrel's comments about the output of get_anomalous_days.

@kperrynrel
Copy link
Member

Hey @cwhanse, Quyen put this together on our end as this was a specific request from @williamhobbs. Southern wants to run an outlier check for "abnormal" daily behavior based on expected PVWatts output (they're using a lot of the PVAnalytics routines already). If you don't think it's a good fit, we could send him the code directly? Can you think of another open source repo where it may be more appropriate?

@cwhanse
Copy link
Member

cwhanse commented May 9, 2025

Would the example be sufficient for @williamhobbs? The prepackaged PVWatts model could be a function in the example, although then it's not importable.

For identifying the outliers from a percent absolute difference in daily values, only predicted and actual are needed, and this function could accept lots of inputs: power, energy, temperature...

@williamhobbs
Copy link

(I think this issue is almost 100% relevant to my comment below: #143.)

Here's my summary of our in-person conversation, @cwhanse. Hopefully this captures everything (with new sketches!):

We talked about a more general function/set of functions to flag deviations in a signal (like power or back of module temperature) from a reference, which could be from a physically adjacent piece of hardware (like inverter or Tbom sensor) or from a simulated signal, which I'm most interested in. It would be up to the user to provide the reference timeseries.

Anomalies could be flagged if the deviation (absolute value?) exceeds some time-based threshold, e.g., off by 20% for 1 hr or 10% for one day. The threshold could be a curve based on a function with one or two parameters, or maybe a piece-wise function based on a table. See the sketch below.

image

There could a be possible second support function that you feed historical "good" data to and it returns the threshold curve at some confidence interval (e.g., 95% or 99% of historical deviations where below this curve). I could see this being very useful, otherwise there could be a lot of trial and error for users.

image

I imagine these concepts already exists somewhere. My quick web-searching turns up network traffic anomaly detection, but it seems to be based only on past trends, not on an independent reference "expectation".

@cwhanse
Copy link
Member

cwhanse commented Jun 4, 2025

@williamhobbs @kperrynrel @qnguyen345

I propose we close this PR and replace with the following development goals:

  1. add functions to outlier.py that label outliers in a timeseries based on deviations from a reference signal. The current functions in outlier.py find outliers using either the time series' marginal distribution (zscore, tukey) or sliding windows of the time series hampel). I don't know how to define an outlier from the deviation of signal from reference, but perhaps the existing outlier methods apply?

  2. add functions to outlier.py to identify threshold curves in a time series. The geometry is intuitive, but the statistics can seem complicated and out of reach. Maybe there's a version of quantile regression that can be applied here.

@williamhobbs
Copy link

@cwhanse - your proposal sounds good to me.

Maybe the quantile regression in statsmodels (already required of pvanalytics, I think) can be used for a non-linear fit like this, https://stats.stackexchange.com/a/474426/375881, but where the x-axis is the time averaging window length.

@kperrynrel
Copy link
Member

@cwhanse @williamhobbs also good with closing this and reopening another PR with the newly recommended logic. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants