Adding an outlier check for dc capacity/power #223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

qnguyen345 wants to merge 5 commits into pvlib:main from qnguyen345:pvwatts_model_comparison

Contributor

qnguyen345 commented May 9, 2025 •

edited

Loading

~~- [ ] Closes #xxx~~

Added tests to cover all new or modified code.
~~- [ ] Clearly documented all new API functions with PEP257 and numpydoc compliant docstrings.~~
~~- [ ] Added new API functions to docs/api.rst.~~
Non-API functions clearly documented with docstrings or comments as necessary.
Adds description and name entries in the appropriate "what's new" file
in docs/whatsnew
for all changes. Includes link to the GitHub Issue with :issue:`num`
or this Pull Request with :pull:`num`. Includes contributor name
and/or GitHub username (link with :ghuser:`user`).
Pull request is nearly complete and ready for detailed review.
Maintainer: Appropriate GitHub Labels and Milestone are assigned to the Pull Request and linked Issue.

There can be days were the system is not producing the desired power output. We can measure the daily performance against a PVWatts model to determine those outlier days. We can model a system's expected dc capacity/ power output from PVWatts using the system metadata and nsrdb weather data. We can then compare the modeled daily time series to the real time series to get a percent difference. If the percent difference is over a certain threshold and is producing much less/more than is expected, we can flag that day as an anomaly.

qnguyen345 added 4 commits

May 5, 2025 17:41


          added script to detect anomalous days in inverter dc time series

7c6f6f9


          added pvwatts model to detect anomalies in outlier script

f6fa87f


          changed .mean() to .sum() to get total daily power output and correct…

b79c15c

…ions for abs()


          added scripts to detect outliers with PVWatts

f72ae85

kperrynrel reviewed

View reviewed changes

pvanalytics/quality/outliers.py Outdated

    
                  return deviation > max_deviation * mad

              def run_pvwatts_data_checks(power_series, nsrdb_weather_df):

Member

kperrynrel May 9, 2025

Add underscore as this is a private method

pvanalytics/quality/outliers.py Outdated

    
                  azimuth : Float

                      Azimuth angle of site in degrees.

                  dc_capacity : Float

                      DC capacity of the site.

Member

kperrynrel May 9, 2025

data stream

pvanalytics/quality/outliers.py Outdated

    
                  return power_series

              def run_pvwatts_model(tilt, azimuth, dc_capacity, dc_inverter_limit,

Member

kperrynrel May 9, 2025

private method

pvanalytics/quality/outliers.py Outdated

    
                      Percent difference threshold for flagging data as anomalies.

                      Defaulted to 50.

                  dc_capacity : None or Float

                      DC capacity of the site. If the inverter dc capacity is not

Member

kperrynrel May 9, 2025

data stream instead of site

pvanalytics/quality/outliers.py Outdated

    
                  Returns

                  -------

                  master_df : Pandas dataframe with datetime index

Member

kperrynrel May 9, 2025

rename master_df as it's generic

Member

kperrynrel May 9, 2025

Return pandas series of percent difference, add new function to determine if anomalous where output is boolean with datetime index

Member

cwhanse commented May 9, 2025

My reaction is that run_pvwatts_model doesn't belong as a function in pvanalytics. As the code is, the PVWatts model is hardwired into the data check function get_anomalous_days so I couldn't use a different performance model as input for this check - that reduces reusability.

get_anomalous_days ought to have a more specific name.

+1 to @kperrynrel's comments about the output of get_anomalous_days.

Member

kperrynrel commented May 9, 2025

Hey @cwhanse, Quyen put this together on our end as this was a specific request from @williamhobbs. Southern wants to run an outlier check for "abnormal" daily behavior based on expected PVWatts output (they're using a lot of the PVAnalytics routines already). If you don't think it's a good fit, we could send him the code directly? Can you think of another open source repo where it may be more appropriate?

Member

cwhanse commented May 9, 2025

Would the example be sufficient for @williamhobbs? The prepackaged PVWatts model could be a function in the example, although then it's not importable.

For identifying the outliers from a percent absolute difference in daily values, only predicted and actual are needed, and this function could accept lots of inputs: power, energy, temperature...


          added suggested fixes

b1bfadd

williamhobbs commented May 22, 2025

(I think this issue is almost 100% relevant to my comment below: #143.)

Here's my summary of our in-person conversation, @cwhanse. Hopefully this captures everything (with new sketches!):

We talked about a more general function/set of functions to flag deviations in a signal (like power or back of module temperature) from a reference, which could be from a physically adjacent piece of hardware (like inverter or Tbom sensor) or from a simulated signal, which I'm most interested in. It would be up to the user to provide the reference timeseries.

Anomalies could be flagged if the deviation (absolute value?) exceeds some time-based threshold, e.g., off by 20% for 1 hr or 10% for one day. The threshold could be a curve based on a function with one or two parameters, or maybe a piece-wise function based on a table. See the sketch below.

There could a be possible second support function that you feed historical "good" data to and it returns the threshold curve at some confidence interval (e.g., 95% or 99% of historical deviations where below this curve). I could see this being very useful, otherwise there could be a lot of trial and error for users.

I imagine these concepts already exists somewhere. My quick web-searching turns up network traffic anomaly detection, but it seems to be based only on past trends, not on an independent reference "expectation".

Member

cwhanse commented Jun 4, 2025

@williamhobbs @kperrynrel @qnguyen345

I propose we close this PR and replace with the following development goals:

add functions to outlier.py that label outliers in a timeseries based on deviations from a reference signal. The current functions in outlier.py find outliers using either the time series' marginal distribution (zscore, tukey) or sliding windows of the time series hampel). I don't know how to define an outlier from the deviation of signal from reference, but perhaps the existing outlier methods apply?
add functions to outlier.py to identify threshold curves in a time series. The geometry is intuitive, but the statistics can seem complicated and out of reach. Maybe there's a version of quantile regression that can be applied here.

williamhobbs commented Jun 4, 2025

@cwhanse - your proposal sounds good to me.

Maybe the quantile regression in statsmodels (already required of pvanalytics, I think) can be used for a non-linear fit like this, https://stats.stackexchange.com/a/474426/375881, but where the x-axis is the time averaging window length.

Member

kperrynrel commented Jun 5, 2025

@cwhanse @williamhobbs also good with closing this and reopening another PR with the newly recommended logic. Thanks!

cwhanse mentioned this pull request

Detect outliers by different from reference signal #224

Open

cwhanse closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet